Shiny New Text-Processing Instructions in SSE 4.2

SSE 4.2 includes text processing instructions. In the words of Ars Technica:

Intel has added a number of new instructions to Nehalem and it has sped up others. The 4.2 version of Intel's SSE vector extensions takes the x86 ISA back to the future just a bit by adding new string manipulation instructions. I say "back to the future" because ISA-level support for string processing is a hallmark of CISC architectures that was actively deprecated in the post-RISC years; typically, when a writer wants to give an example of crufty old corners of the x86 ISA that have caused pain for chip architects, string manipulation instructions are what he or she reaches for. But the new SSE 4.2 string instructions are aimed at accelerating XML processing, which makes them Web-friendly and therefore modern (i.e., not crufty).

I chuckled a bit when I read this. I’m not very purist when it comes to hardware. If these instructions will make my parsers faster, then they sound great to me!

The four new instructions are:

pcmpestri: packed compare of explicit length strings, returning index
pcmpestrm: packed compare of explicit length strings, returning mask
pcmpistri: packed compare of implicit length strings, returning index
pcmpistrm: packed compare of implicit length strings, returning mask

The variants are as follows:

implicit length strings are NULL-terminated, explicit strings have an explicit length (ie. the whole input register).
they can return an index into the source string (if you were searching for something) or a mask (if you wanted to test each character of the input

Both let you scan a 128-bit SSE register (treating it as either 16 8-bit characters or 8 16-bit characters) and perform all kinds of searches/comparisons. The instructions are configurable; you supply a control word that specifies all of the different variations of the instructions. For example, are the input values signed or unsigned, are we comparing against ranges or specific values, etc.

The reciprocal throughput of these instructions is high (2 cycles) but the latency is annoyingly slow (9 cycles). This means that you have to wait nine cycles after issuing the instruction before you can use the result. It’s hard to think of too many useful things you can execute in parallel while you’re waiting for that answer. As a side note, these figures come from Intel’s IntelÂ® 64 and IA-32 Architectures Optimization Reference Manual, which says that the latency number is a worst case estimate:

Actual performance of these instructions by the out-of-order core execution unit can range from somewhat faster to significantly faster than the latency data shown in these tables.

I’m not enough of a hardware geek to know what to actually expect.

Still, that’s nine cycles to wait before getting a lot of really useful information. In addition to returning the index or mask, the instructions set several of the flags in useful ways.

So what processors have SSE 4.2? Or in other words, how long will my impatient self have to wait to try them out? Apparently SSE 4.2 is available on Penryn, which is the second-gen Core 2, which debuted in 2007/2008. It uses a “45 nm process”, which I’m sure means something to hardware geeks but not to me. All I know is that it’s not the Core 2 that’s inside the MacBook Pro sitting on my lap. And of course SSE 4.2 is in the new Nehalem.