While scrolling through Twitter (now X) the other evening, I stumbled upon a provocative little programming snippet:
C devs will tell you this is peak API and Programming Languages design,
— Dmitrii Kovanikov (@ChShersh) November 20, 2025
And nothing better has been invented since 1972. pic.twitter.com/mqELKMVEOb
This got me wondering: how fast are “modern” alternatives?
Searching for an element in C++
One of the C++ proponents in the thread suggested the following “ultra-modern” solution using
C++23 std::views::enumerate:
std::optional<size_t> find(std::span<const int> items, int x) {
for (auto [i, value] : std::views::enumerate(items)) {
if (value == x) return i;
}
return std::nullopt;
}I’m a big fan of structured bindings available since C++17. They can be used to write very clean code.
I didn’t know about enumerate, but some quick searching showed that an enumerate-like mechanism was possible already in C++17.
It is nice to find out this is now part of the latest standard.
One reason for using std::optional for the return value is that you can use the full range of size_t for the returned index.
In the C approach, an empty array or absence of the item would be indicated by a negative value.
This implies the use of a signed integer type for the result, which limits the range.
The difference matters if you need to search more than 2^31 - 1 elements.
In the types of problems I deal with, this wouldn’t be a problem.
The second benefit of std::optional is that the result can be used as a boolean:
if (auto res = find(items,42)) {
std::cout << "found element 42 at index " << *res << '\n';
}The index contained in the std::optional class template is then accessed using the familiar dereference operator.
With the C-like API you’d need to store the result before using it in one of two ways:
// 1. store result and check condition
auto idx = find(items,10000,42);
if (idx >= 0) { /* ... */ }
// 2. using C++17 init-statement
if (auto idx = find(items,10000,42); idx >= 0) { /* ... */ }Both are more verbose than using std::optional, and it’s easy to introduce mistakes.
If we accidentally use idx as the condition, it will get converted to a boolean!
If we limit ourselves to C++20, there are a couple of other idiomatic alternatives. The first would be the classic iterator based approach:
template<class Iter, class T>
Iter find_iter(Iter begin, Iter end, const T& x) {
for (Iter it = begin; it != end; ++it) {
if (*it == x) return it;
}
return end;
}Another option is to use a range-based for loop:
template<class T>
std::optional<std::size_t> find_rbfor(std::span<const T> items, const T& x) {
for (const T& v : items) {
if (v == x) return std::distance(items.data(), &v);
}
return std::nullopt;
}We could also use the STL std::find algorithm and avoid looping entirely:
template<class T>
std::optional<std::size_t> find_find(std::span<const T> items, const T& x) {
auto it = std::ranges::find(items, x);
if (it == items.end()) return std::nullopt;
return std::ranges::distance(items.begin(), it);
}This illustrates one of the challenges I personally face with C++: there are often many ways to solve the same problem. As long as the code works correctly and the client code remains clean, any of the implementations will do. Still, it takes time and experience to develop a good intuition for which approach is best. Since I only use C++ sporadically, I get lost in the options easily.
API elegance aside, performance often tells a different story.
Searching a double array
Now when it comes to performance, things are not as straightforward. Most CPUs today have some level of SIMD capabilities and it would be nice to know if they are used. For this purpose I prepared a micro-benchmark using the Google Benchmark framework. The benchmark can be found in my “shed” of code snippets.
For the sake of the experiment, we’ll scan an array of doubles of size 32768 (2^15), searching for the last element.
The resulting time needed is measured on an Apple M2 Pro.
Using GCC 15.2 I got:
| Variant | Time (ns) | Bandwidth (GiB/s) |
|---|---|---|
| C-like | 5036 | 50.0 |
| Iterators | 10084 | 24.6 |
| Range-based for | 10705 | 24.2 |
find algorithm |
9847 | 25.0 |
Using Apple clang version 17 I got:
| Variant | Time (ns) | Bandwidth (GiB/s) |
|---|---|---|
| C-like | 10342 | 24.0 |
| Iterators | 9656 | 25.4 |
| Range-based for | 10443 | 23.6 |
find algorithm |
10074 | 24.6 |
In both cases I compiled the benchmark driver with -O3 -mcpu=native.
If you’re not interested in assembly dumps, feel free to skim through the next few paragraphs.
As we can see, with gcc the C-like loop is the fastest. Looking into the generated assembly in Compiler Explorer, the C-like variant in GCC gives the following innermost loop:
.L7:
add v29.4s, v29.4s, v27.4s
cmp x3, x2
beq .L34
.L9:
ldp q31, q30, [x2], 32
fcmeq v31.2d, v31.2d, v28.2d
fcmeq v30.2d, v30.2d, v28.2d
orr v31.16b, v31.16b, v30.16b
umaxp v31.4s, v31.4s, v31.4s
fmov x5, d31
cbz x5, .L7
Here GCC uses 4-way unrolling of the loop and uses vector comparisons, which explains the speed.
The other variants appear to use 2-way unrolling - a possible explanation for the 2x difference:
.L40:
add v30.2d, v30.2d, v28.2d
cmp x5, x1
beq .L55
.L43:
lsl x2, x1, 4
add x1, x1, 1
ldr q31, [x4, x2]
fcmeq v31.2d, v31.2d, v29.2d
umaxp v31.4s, v31.4s, v31.4s
fmov x2, d31
cbz x2, .L40
With clang on the other hand, all variants use loop nests with scalar instructions:
.LBB0_3:
ldr d1, [x8, x0, lsl #3]
fcmp d1, d0
b.eq .LBB0_6
add x0, x0, #1
cmp x9, x0
b.ne .LBB0_3
I was interested in what the Intel C++ Compiler (the newer LLVM-based version) would do.
Interestingly, only the C-like variant appears to be vectorized.
For example targeting the Intel Skylake microarchitecture (-march=skylake), it produces a rather aggressively vectorized loop:
.LBB0_16:
lea rbx, [r11 + 8*r9]
vcmpeqpd ymm2, ymm1, ymmword ptr [rbx]
vcmpeqpd ymm3, ymm1, ymmword ptr [rbx + 32]
vpackssdw ymm2, ymm2, ymm3
vcmpeqpd ymm3, ymm1, ymmword ptr [rbx + 64]
vpermq ymm2, ymm2, 216
vcmpeqpd ymm4, ymm1, ymmword ptr [rbx + 96]
vpackssdw ymm3, ymm3, ymm4
vpermq ymm3, ymm3, 216
vpackssdw ymm2, ymm2, ymm3
vpermq ymm2, ymm2, 216
vcmpeqpd ymm3, ymm1, ymmword ptr [rbx + 128]
vcmpeqpd ymm4, ymm1, ymmword ptr [rbx + 160]
vpackssdw ymm3, ymm3, ymm4
vpermq ymm3, ymm3, 216
vcmpeqpd ymm4, ymm1, ymmword ptr [rbx + 192]
vcmpeqpd ymm5, ymm1, ymmword ptr [rbx + 224]
vpackssdw ymm4, ymm4, ymm5
vpermq ymm4, ymm4, 216
vpackssdw ymm3, ymm3, ymm4
vpermq ymm3, ymm3, 216
vpacksswb ymm2, ymm2, ymm3
vpmovmskb ebx, ymm2
test ebx, ebx
jne .LBB0_17
add r9, 32
cmp r9, r10
jbe .LBB0_16
Here it works on chunks of 32 elements as seen from the 8 memory references and 4 doubles per YMM register.
I did not attempt to measure the performance on an AMD64 machine.
Fortran findloc intrinsic
Fortran 2008 offers a built-in function, findloc, which makes code very clear and concise.
It is generic over intrinsic types and also provides several optional arguments:
result = findloc(array, value, dim [, mask] [,kind] [,back])
But conciseness isn’t everything—what about performance?
To test this, we can wrap the intrinsic in a short C-compatible wrapper function, allowing it to be called from C or C++:
! int f_findloc(double items[], int size, double item);
function f_findloc(items, size, item) bind(c)
use, intrinsic :: iso_c_binding, only: c_int, c_double
implicit none
integer(c_int), value :: size
real(c_double), intent(in) :: items(size)
real(c_double), intent(in), value :: item
integer(c_int) :: f_findloc
! Note: Fortran uses 1-based indexing by default
f_findloc = findloc(items,item,dim=1) - 1
end functionOne difference from C and C++ is that Fortran uses 1-based indexing by default, the convention in Fortran is to use 0 to indicate absence of the item. Hence we need to subtract 1 to match the convention used earlier in the C-like variant.
For completeness we will also look at a loop-based version in Fortran. For theatrical purposes, I’m writing it in legacy fixed-form Fortran complete with implicit typing and labels:
INTEGER FUNCTION DFIND(ITEMS, N, X)
DOUBLE PRECISION ITEMS(N), X
DO 10 I = 1, N
IF (X .EQ. ITEMS(I)) THEN
DFIND = I
RETURN
ENDIF
10 CONTINUE
DFIND = 0
RETURN
ENDThis time, we have to rely on the platform calling convention to call the routine from C++. We can achieve this with the following prototype:
extern "C" int dfind_(double items[], int *n, double *x);For the tests we’ll be using gfortran 15.2 and flang 20.1, and using the flags: -O3 -mcpu=apple-m2.
The performance results look the following:
| Compiler | Time (ns) | Bandwidth (GiB/s) |
|---|---|---|
gfortran findloc |
10422 | 24.1 |
gfortran DFIND |
4891 | 50.0 |
flang findloc |
48566 | 5.0 |
flang DFIND |
9998 | 24.4 |
With gfortran the performance is comparable to the previous gcc results.
Once again, the difference between findloc and raw loops is probably due to the implicit compiler choice between 2-way and 4-way unrolling.
It would be interesting to determine which compilation or language factor causes this difference.
On the other hand flang’s findloc implementations proves to be slower.
We can look for an explanation for the low performance by peeking into the assembly.
Unlike gfortran, which inlines findloc, flang calls a runtime function (__FortranAFindlocDim).
The implementation of this function can be found in the LLVM repository.
For the record, I’m using flang installed via the brew package manager.
Maybe some LLVM compilation tweaks could improve the performance of this routine?
Hopefully the performance will improved in future flang releases.
For completeness, I also looked at the output of the Intel Fortran compilers. Neither ifort 2021.11 (the last pre-deprecation release) nor ifx 2024.0.0 vectorizes the loop, and both use scalar instructions instead.
So if you need a fast search over an array of doubles, and performance is the priority, a straightforward hand-written loop remains an efficient option. It may not be the peak of API or programming language design, but it gets the job done.
Searching an int array
The original discussion on Twitter concerned an integer array, so I ran the same micro-benchmarks for int as well.
In this scenario, there is less variation across C++ variants.
Integer search using GCC 15.2:
| Variant | Time (ns) | Bandwidth (GiB/s) |
|---|---|---|
| C-like | 4840 | 25.2 |
| Iterators | 4863 | 25.1 |
| Range-based for | 4909 | 24.9 |
find algorithm |
4899 | 24.9 |
Fortran findloc |
4834 | 25.3 |
Fortran ÌFIND |
4801 | 25.4 |
Integer search using clang 17.0 and flang 20.1:
| Variant | Time (ns) | Bandwidth (GiB/s) |
|---|---|---|
| C-like | 9690 | 12.6 |
| Iterators | 9972 | 12.4 |
| Range-based for | 11138 | 11.9 |
find algorithm |
9932 | 12.3 |
Fortran findloc |
49764 | 2.5 |
Fortran ÌFIND |
9913 | 12.3 |
With both compiler families the C-like solution happened to be a tiny fraction faster in this particular run. In general I observed a few percent of variation when rerunning the microbenchmark driver. I didn’t inspect the generated assembly this time, but the differences are small enough that they can be considered negligible.
As before, the findloc implementation in flang is behind others.
Replicating the results
As mentioned earlier, the code can be found in the shed/perf folder (which also contains a bunch of other stuff).
You can also download a standalone tarball (12.4 KB).
To unpack the files use:
tar xf shed-main_perf.tar.gz && cd shed
Create the build directory and configure the CMake project with your desired options:
mkdir build && cd build
cmake .. \
-DCMAKE_Fortran_COMPILER=gfortran \
-DCMAKE_Fortran_FLAGS="-O3 -mcpu=native" \
-DCMAKE_C_COMPILER=gcc-15 \
-DCMAKE_C_FLAGS="-O3 -mcpu=native" \
-DCMAKE_CXX_COMPILER=g++-15 \
-DCMAKE_CXX_FLAGS="-O3 -mcpu=native" \
-DCMAKE_BUILD_TYPE=Release
If configurations succeeds, you can build the executable and run the benchmark driver with:
make findloc
./findloc --benchmark_filter=32768
The raw output will look similar to this:
2025-11-22T20:38:38+01:00
Running ./findloc
Run on (10 X 24.0006 MHz CPU s)
CPU Caches:
L1 Data 64 KiB
L1 Instruction 128 KiB
L2 Unified 4096 KiB (x10)
Load Average: 193.07, 174.62, 183.12
------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
------------------------------------------------------------------------------------
BM_find_fortran/32768 10330 ns 10020 ns 69115 bytes_per_second=24.3652Gi/s
BM_find_dfind/32768 4914 ns 4909 ns 141821 bytes_per_second=49.7352Gi/s
BM_find_c/32768 4855 ns 4855 ns 143112 bytes_per_second=50.2876Gi/s
BM_find_iter/32768 9760 ns 9754 ns 71777 bytes_per_second=25.0292Gi/s
BM_find_rbfor/32768 9756 ns 9747 ns 70924 bytes_per_second=25.0469Gi/s
BM_find_range/32768 9795 ns 9791 ns 72052 bytes_per_second=24.9351Gi/s
BM_find_int_fortran/32768 4853 ns 4853 ns 143246 bytes_per_second=25.1543Gi/s
BM_find_int_ifind/32768 4838 ns 4837 ns 144714 bytes_per_second=25.235Gi/s
BM_find_int_c/32768 4854 ns 4850 ns 143823 bytes_per_second=25.1675Gi/s
BM_find_int_iter/32768 4851 ns 4850 ns 142512 bytes_per_second=25.1686Gi/s
BM_find_int_rbfor/32768 4834 ns 4832 ns 144733 bytes_per_second=25.2625Gi/s
BM_find_int_range/32768 5175 ns 5010 ns 100000 bytes_per_second=24.3661Gi/s