The -march flag itself is GCC-specific, but the general advice is universal: don’t forget to tell your compiler that it can take full advantage of your spiffy new CPU! I should know better but I’ve been forgetting to specify -march when compiling upb.

Here’s an extreme example of why. Take an innocent-looking function like:

int float_to_int(float f) {
  return (int)f;

Looks simple enough, right? Unfortunately, float -> int casts are stupidly expensive on x86. Without any -m flags, gcc compiles this to:

sub      $0x8, %esp       ; allocate stack space
fnstcw   0x6(%esp)        ; save floating-point control word
flds     $0xc(%esp)       ; push floating-point param onto fp stack
movzwl   0x6(%esp), %eax  ; move prev fp control word into %eax
mov      $0xc, %ah        ; set rounding mode of control word to
mov      %ax, 0x4(%esp)   ; save it *back* to the stack
fldcw    0x4(%esp)        ; set the floating-point control word to
fistp    0x2(%esp)        ; store integer from the fp stack to the
fldcw    0x6(%esp)        ; set the fp control word back to what it
movzwl   0x2(%esp), %eax  ; read the value into eax (the return value)
add      $0x8, %esp       ; give the stack space back

This would be funny if it weren’t so sad. All these gymnastics are required because the cast is required to round down (according to the C standard), but that requires the x86’s floating point unit to be in a different mode than for most operations.

Compiling exactly the same code with -msse2 allows the compiler to take advantage of an SSE-only instruction, and the above is replaced with:

cvttss2si  0x4(%esp), %eax     ; convert value to
integer with truncation

The difference in this case is astounding. Hopefully this will motivate you never to forget the -march flag!

The right thing to do in my case is compile with -march=core2. When I compile with -march=core2 or -msse3, the compiler to emits the not-quite-as-terse:

sub     $0x4,%esp
flds    0x8(%esp)
fisttpl (%esp)
mov     (%esp),%eax
add     $0x4,%esp

I’m really not sure why gcc prefers this version when sse3 is available. It seems to be more work than the sse2 version. I tend to believe gcc know what it’s doing here, but I’d love to learn why.