Is the Humble For-Loop a Good Way to Search an Array?

While scrolling through Twitter (now X) the other evening, I stumbled upon a provocative little programming snippet:

This got me wondering: how fast are “modern” alternatives?

Searching for an element in C++

One of the C++ proponents in the thread suggested the following “ultra-modern” solution using C++23 std::views::enumerate:

std::optional<size_t> find(std::span<const int> items, int x) {
	for (auto [i, value] : std::views::enumerate(items)) {
		if (value == x) return i;
	}
	return std::nullopt;
}

I’m a big fan of structured bindings available since C++17. They can be used to write very clean code.

I didn’t know about enumerate, but some quick searching showed that an enumerate-like mechanism was possible already in C++17. It is nice to find out this is now part of the latest standard.

One reason for using std::optional for the return value is that you can use the full range of size_t for the returned index. In the C approach, an empty array or absence of the item would be indicated by a negative value. This implies the use of a signed integer type for the result, which limits the range. The difference matters if you need to search more than 2^31 - 1 elements. In the types of problems I deal with, this wouldn’t be a problem.

The second benefit of std::optional is that the result can be used as a boolean:

if (auto res = find(items,42)) {
	std::cout << "found element 42 at index " << *res << '\n';
}

The index contained in the std::optional class template is then accessed using the familiar dereference operator.

With the C-like API you’d need to store the result before using it in one of two ways:

// 1. store result and check condition
auto idx = find(items,10000,42);
if (idx >= 0) { /* ... */ }

// 2. using C++17 init-statement
if (auto idx = find(items,10000,42); idx >= 0) { /* ... */ }

Both are more verbose than using std::optional, and it’s easy to introduce mistakes. If we accidentally use idx as the condition, it will get converted to a boolean!

If we limit ourselves to C++20, there are a couple of other idiomatic alternatives. The first would be the classic iterator based approach:

template<class Iter, class T>
Iter find_iter(Iter begin, Iter end, const T& x) {
    for (Iter it = begin; it != end; ++it) {
        if (*it == x) return it;
    }
    return end;
}

Another option is to use a range-based for loop:

template<class T>
std::optional<std::size_t> find_rbfor(std::span<const T> items, const T& x) {
    for (const T& v : items) {
        if (v == x) return std::distance(items.data(), &v);
    }
    return std::nullopt;
}

We could also use the STL std::find algorithm and avoid looping entirely:

template<class T>
std::optional<std::size_t> find_find(std::span<const T> items, const T& x) {
    auto it = std::ranges::find(items, x);
    if (it == items.end()) return std::nullopt;
    return std::ranges::distance(items.begin(), it);
}

This illustrates one of the challenges I personally face with C++: there are often many ways to solve the same problem. As long as the code works correctly and the client code remains clean, any of the implementations will do. Still, it takes time and experience to develop a good intuition for which approach is best. Since I only use C++ sporadically, I get lost in the options easily.

API elegance aside, performance often tells a different story.

Searching a double array

Now when it comes to performance, things are not as straightforward. Most CPUs today have some level of SIMD capabilities and it would be nice to know if they are used. For this purpose I prepared a micro-benchmark using the Google Benchmark framework. The benchmark can be found in my “shed” of code snippets.

For the sake of the experiment, we’ll scan an array of doubles of size 32768 (2^15), searching for the last element. The resulting time needed is measured on an Apple M2 Pro.

Using GCC 15.2 I got:

Variant Time (ns) Bandwidth (GiB/s)
C-like 5036 50.0
Iterators 10084 24.6
Range-based for 10705 24.2
find algorithm 9847 25.0

Using Apple clang version 17 I got:

Variant Time (ns) Bandwidth (GiB/s)
C-like 10342 24.0
Iterators 9656 25.4
Range-based for 10443 23.6
find algorithm 10074 24.6

In both cases I compiled the benchmark driver with -O3 -mcpu=native. If you’re not interested in assembly dumps, feel free to skim through the next few paragraphs.

As we can see, with gcc the C-like loop is the fastest. Looking into the generated assembly in Compiler Explorer, the C-like variant in GCC gives the following innermost loop:

.L7:
        add     v29.4s, v29.4s, v27.4s
        cmp     x3, x2
        beq     .L34
.L9:
        ldp     q31, q30, [x2], 32
        fcmeq   v31.2d, v31.2d, v28.2d
        fcmeq   v30.2d, v30.2d, v28.2d
        orr     v31.16b, v31.16b, v30.16b
        umaxp   v31.4s, v31.4s, v31.4s
        fmov    x5, d31
        cbz     x5, .L7

Here GCC uses 4-way unrolling of the loop and uses vector comparisons, which explains the speed.

The other variants appear to use 2-way unrolling - a possible explanation for the 2x difference:

.L40:
        add     v30.2d, v30.2d, v28.2d
        cmp     x5, x1
        beq     .L55
.L43:
        lsl     x2, x1, 4
        add     x1, x1, 1
        ldr     q31, [x4, x2]
        fcmeq   v31.2d, v31.2d, v29.2d
        umaxp   v31.4s, v31.4s, v31.4s
        fmov    x2, d31
        cbz     x2, .L40

With clang on the other hand, all variants use loop nests with scalar instructions:

.LBB0_3:
        ldr     d1, [x8, x0, lsl #3]
        fcmp    d1, d0
        b.eq    .LBB0_6
        add     x0, x0, #1
        cmp     x9, x0
        b.ne    .LBB0_3

I was interested in what the Intel C++ Compiler (the newer LLVM-based version) would do. Interestingly, only the C-like variant appears to be vectorized. For example targeting the Intel Skylake microarchitecture (-march=skylake), it produces a rather aggressively vectorized loop:

.LBB0_16:
        lea     rbx, [r11 + 8*r9]
        vcmpeqpd        ymm2, ymm1, ymmword ptr [rbx]
        vcmpeqpd        ymm3, ymm1, ymmword ptr [rbx + 32]
        vpackssdw       ymm2, ymm2, ymm3
        vcmpeqpd        ymm3, ymm1, ymmword ptr [rbx + 64]
        vpermq  ymm2, ymm2, 216
        vcmpeqpd        ymm4, ymm1, ymmword ptr [rbx + 96]
        vpackssdw       ymm3, ymm3, ymm4
        vpermq  ymm3, ymm3, 216
        vpackssdw       ymm2, ymm2, ymm3
        vpermq  ymm2, ymm2, 216
        vcmpeqpd        ymm3, ymm1, ymmword ptr [rbx + 128]
        vcmpeqpd        ymm4, ymm1, ymmword ptr [rbx + 160]
        vpackssdw       ymm3, ymm3, ymm4
        vpermq  ymm3, ymm3, 216
        vcmpeqpd        ymm4, ymm1, ymmword ptr [rbx + 192]
        vcmpeqpd        ymm5, ymm1, ymmword ptr [rbx + 224]
        vpackssdw       ymm4, ymm4, ymm5
        vpermq  ymm4, ymm4, 216
        vpackssdw       ymm3, ymm3, ymm4
        vpermq  ymm3, ymm3, 216
        vpacksswb       ymm2, ymm2, ymm3
        vpmovmskb       ebx, ymm2
        test    ebx, ebx
        jne     .LBB0_17
        add     r9, 32
        cmp     r9, r10
        jbe     .LBB0_16

Here it works on chunks of 32 elements as seen from the 8 memory references and 4 doubles per YMM register.

I did not attempt to measure the performance on an AMD64 machine.

Fortran findloc intrinsic

Fortran 2008 offers a built-in function, findloc, which makes code very clear and concise. It is generic over intrinsic types and also provides several optional arguments:

result = findloc(array, value, dim [, mask] [,kind] [,back])

But conciseness isn’t everything—what about performance?

To test this, we can wrap the intrinsic in a short C-compatible wrapper function, allowing it to be called from C or C++:

! int f_findloc(double items[], int size, double item);
function f_findloc(items, size, item) bind(c)
    use, intrinsic :: iso_c_binding, only: c_int, c_double
    implicit none
    integer(c_int), value :: size
    real(c_double), intent(in) :: items(size)
    real(c_double), intent(in), value :: item
    integer(c_int) :: f_findloc

    ! Note: Fortran uses 1-based indexing by default
    f_findloc = findloc(items,item,dim=1) - 1

end function

One difference from C and C++ is that Fortran uses 1-based indexing by default, the convention in Fortran is to use 0 to indicate absence of the item. Hence we need to subtract 1 to match the convention used earlier in the C-like variant.

For completeness we will also look at a loop-based version in Fortran. For theatrical purposes, I’m writing it in legacy fixed-form Fortran complete with implicit typing and labels:

      INTEGER FUNCTION DFIND(ITEMS, N, X)
      DOUBLE PRECISION ITEMS(N), X
      DO 10 I = 1, N
         IF (X .EQ. ITEMS(I)) THEN
            DFIND = I
            RETURN
         ENDIF
   10 CONTINUE
      DFIND = 0
      RETURN
      END

This time, we have to rely on the platform calling convention to call the routine from C++. We can achieve this with the following prototype:

extern "C" int dfind_(double items[], int *n, double *x);

For the tests we’ll be using gfortran 15.2 and flang 20.1, and using the flags: -O3 -mcpu=apple-m2.

The performance results look the following:

Compiler Time (ns) Bandwidth (GiB/s)
gfortran findloc 10422 24.1
gfortran DFIND 4891 50.0
flang findloc 48566 5.0
flang DFIND 9998 24.4

With gfortran the performance is comparable to the previous gcc results. Once again, the difference between findloc and raw loops is probably due to the implicit compiler choice between 2-way and 4-way unrolling. It would be interesting to determine which compilation or language factor causes this difference.

On the other hand flang’s findloc implementations proves to be slower. We can look for an explanation for the low performance by peeking into the assembly. Unlike gfortran, which inlines findloc, flang calls a runtime function (__FortranAFindlocDim). The implementation of this function can be found in the LLVM repository. For the record, I’m using flang installed via the brew package manager. Maybe some LLVM compilation tweaks could improve the performance of this routine? Hopefully the performance will improved in future flang releases.

For completeness, I also looked at the output of the Intel Fortran compilers. Neither ifort 2021.11 (the last pre-deprecation release) nor ifx 2024.0.0 vectorizes the loop, and both use scalar instructions instead.

So if you need a fast search over an array of doubles, and performance is the priority, a straightforward hand-written loop remains an efficient option. It may not be the peak of API or programming language design, but it gets the job done.

Searching an int array

The original discussion on Twitter concerned an integer array, so I ran the same micro-benchmarks for int as well. In this scenario, there is less variation across C++ variants.

Integer search using GCC 15.2:

Variant Time (ns) Bandwidth (GiB/s)
C-like 4840 25.2
Iterators 4863 25.1
Range-based for 4909 24.9
find algorithm 4899 24.9
Fortran findloc 4834 25.3
Fortran ÌFIND 4801 25.4

Integer search using clang 17.0 and flang 20.1:

Variant Time (ns) Bandwidth (GiB/s)
C-like 9690 12.6
Iterators 9972 12.4
Range-based for 11138 11.9
find algorithm 9932 12.3
Fortran findloc 49764 2.5
Fortran ÌFIND 9913 12.3

With both compiler families the C-like solution happened to be a tiny fraction faster in this particular run. In general I observed a few percent of variation when rerunning the microbenchmark driver. I didn’t inspect the generated assembly this time, but the differences are small enough that they can be considered negligible.

As before, the findloc implementation in flang is behind others.

Replicating the results

As mentioned earlier, the code can be found in the shed/perf folder (which also contains a bunch of other stuff). You can also download a standalone tarball (12.4 KB). To unpack the files use:

tar xf shed-main_perf.tar.gz && cd shed

Create the build directory and configure the CMake project with your desired options:

mkdir build && cd build
cmake .. \
  -DCMAKE_Fortran_COMPILER=gfortran \
  -DCMAKE_Fortran_FLAGS="-O3 -mcpu=native" \
  -DCMAKE_C_COMPILER=gcc-15 \
  -DCMAKE_C_FLAGS="-O3 -mcpu=native" \
  -DCMAKE_CXX_COMPILER=g++-15 \
  -DCMAKE_CXX_FLAGS="-O3 -mcpu=native" \
  -DCMAKE_BUILD_TYPE=Release

If configurations succeeds, you can build the executable and run the benchmark driver with:

make findloc
./findloc --benchmark_filter=32768

The raw output will look similar to this:

2025-11-22T20:38:38+01:00
Running ./findloc
Run on (10 X 24.0006 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB
  L1 Instruction 128 KiB
  L2 Unified 4096 KiB (x10)
Load Average: 193.07, 174.62, 183.12
------------------------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------
BM_find_fortran/32768          10330 ns        10020 ns        69115 bytes_per_second=24.3652Gi/s
BM_find_dfind/32768             4914 ns         4909 ns       141821 bytes_per_second=49.7352Gi/s
BM_find_c/32768                 4855 ns         4855 ns       143112 bytes_per_second=50.2876Gi/s
BM_find_iter/32768              9760 ns         9754 ns        71777 bytes_per_second=25.0292Gi/s
BM_find_rbfor/32768             9756 ns         9747 ns        70924 bytes_per_second=25.0469Gi/s
BM_find_range/32768             9795 ns         9791 ns        72052 bytes_per_second=24.9351Gi/s
BM_find_int_fortran/32768       4853 ns         4853 ns       143246 bytes_per_second=25.1543Gi/s
BM_find_int_ifind/32768         4838 ns         4837 ns       144714 bytes_per_second=25.235Gi/s
BM_find_int_c/32768             4854 ns         4850 ns       143823 bytes_per_second=25.1675Gi/s
BM_find_int_iter/32768          4851 ns         4850 ns       142512 bytes_per_second=25.1686Gi/s
BM_find_int_rbfor/32768         4834 ns         4832 ns       144733 bytes_per_second=25.2625Gi/s
BM_find_int_range/32768         5175 ns         5010 ns       100000 bytes_per_second=24.3661Gi/s

#C++   #Fortran   #Performance   #Findloc