The -march flag itself is GCC-specific, but the general advice is universal: don’t forget to tell your compiler that it can take full advantage of your spiffy new CPU! I should know better but I’ve been forgetting to specify -march when compiling upb.

Here’s an extreme example of why. Take an innocent-looking function like:

int float_to_int(float f) {
  return (int)f;
}

Looks simple enough, right? Unfortunately, float -> int casts are stupidly expensive on x86. Without any -m flags, gcc compiles this to:

sub      $0x8, %esp       ; allocate stack space
fnstcw   0x6(%esp)        ; save floating-point control word
flds     $0xc(%esp)       ; push floating-point param onto fp stack
movzwl   0x6(%esp), %eax  ; move prev fp control word into %eax
mov      $0xc, %ah        ; set rounding mode of control word to
"truncate"
mov      %ax, 0x4(%esp)   ; save it *back* to the stack
fldcw    0x4(%esp)        ; set the floating-point control word to
truncate
fistp    0x2(%esp)        ; store integer from the fp stack to the
stack
fldcw    0x6(%esp)        ; set the fp control word back to what it
was
movzwl   0x2(%esp), %eax  ; read the value into eax (the return value)
add      $0x8, %esp       ; give the stack space back
ret

This would be funny if it weren’t so sad. All these gymnastics are required because the cast is required to round down (according to the C standard), but that requires the x86’s floating point unit to be in a different mode than for most operations.

Compiling exactly the same code with -msse2 allows the compiler to take advantage of an SSE-only instruction, and the above is replaced with:

cvttss2si  0x4(%esp), %eax     ; convert value to
integer with truncation
ret

The difference in this case is astounding. Hopefully this will motivate you never to forget the -march flag!

The right thing to do in my case is compile with -march=core2. When I compile with -march=core2 or -msse3, the compiler to emits the not-quite-as-terse:

sub     $0x4,%esp
flds    0x8(%esp)
fisttpl (%esp)
mov     (%esp),%eax
add     $0x4,%esp
ret

I’m really not sure why gcc prefers this version when sse3 is available. It seems to be more work than the sse2 version. I tend to believe gcc know what it’s doing here, but I’d love to learn why.