<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Josh Haberman &#187; Josh</title>
	<atom:link href="http://blog.reverberate.org/author/admin/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.reverberate.org</link>
	<description>parsing, performance, minimalism with C99</description>
	<lastBuildDate>Mon, 30 Jan 2012 00:15:14 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>State of the hash functions, 2012</title>
		<link>http://blog.reverberate.org/2012/01/29/state-of-the-hash-functions-2012/</link>
		<comments>http://blog.reverberate.org/2012/01/29/state-of-the-hash-functions-2012/#comments</comments>
		<pubDate>Mon, 30 Jan 2012 00:15:14 +0000</pubDate>
		<dc:creator>Josh</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.reverberate.org/?p=536</guid>
		<description><![CDATA[The state-of-the-art in non-cryptographic hash functions has advanced rapidly in the last few years. When I did some searching this week I was happy to see that new cutting-edge hash functions had been released even since last time I looked &#8230; <a href="http://blog.reverberate.org/2012/01/29/state-of-the-hash-functions-2012/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>The state-of-the-art in non-cryptographic hash functions has advanced rapidly in the last few years.  When I did some searching this week I was happy to see that new cutting-edge hash functions had been released even since last time I looked into this 6 months or a year ago.</p>
<p>Non-cryptographic hash functions take a string as input and compute an integer output.  The desirable property of a hash function is that the outputs are evenly distributed across the domain of possible outputs, especially for inputs that are similar.  Unlike a cryptographic hash function, these functions are <em>not</em> designed to withstand an effort by an attacker to find a collision.  Cryptographic hash functions have this property, but are much slower: <a href="http://www.cryptopp.com/benchmarks.html">SHA-1 is on the order of 0.09 bytes/cycle</a> whereas the newest non-cryptographic hash functions are on the order of 3 bytes/cycle.  So non-cryptographic hashes are roughly 33x faster, at the cost of not being able to withstand attacks.  Non-cryptographic hashes are most often used for hash tables.</p>
<p>As an interesting aside, <a href="http://thread.gmane.org/gmane.comp.lang.lua.general/87491">there is a debate going on in the Lua community right now</a> about what, if anything, should be done about the fact that Lua&#8217;s hash function could theoretically be attacked to force its hash table implementation into its O(n) worst-case lookup performance.  This could let an attacker DoS you if he is feeding you input that you are putting into a Lua hash table.  The Lua authors are somewhat skeptical about how realistic this attack is (and whether it would be cheaper than other DoS alternatives), but are moving ahead anyway with a plan to generate a random seed at startup that the hash function will use.  This is an interesting alternative to cryptographic hash functions that should be able to give you the same collision resistance as a cryptographic hash function (presuming you have an entropy source that can give you truly random bits), but at the cost of non-reproducible output.</p>
<p>Since there are lots of options out there for non-cryptographic hash functions and this number keeps expanding, I thought I&#8217;d summarize my knowledge of what is out there.</p>
<h2>Paul Jenkins&#8217; Functions</h2>
<p><a href="http://burtleburtle.net/bob/">Paul Jenkins</a> has been working on hash functions for 15 years or so.  In 1997 he published an article about hash functions in Dr. Dobbs Journal; the article is available now on the web with more content added since its original publication: <a href="http://www.burtleburtle.net/bob/hash/doobs.html">A hash function for hash Table lookup</a>.  In this article Bob has an extensive catalog of existing hash functions, as well as presenting his own called &#8220;lookup2.&#8221;  Paul subsequently published <a href="http://burtleburtle.net/bob/c/lookup3.c">lookup3</a> in 2006, which for the purposes of this article I will consider the first &#8220;modern&#8221; hash function, in the sense that it is both fast (0.5 bytes/cycle, according to Paul) and free of any serious flaws.</p>
<p>More information about Paul&#8217;s functions can be found on Wikipedia: <a href="http://en.wikipedia.org/wiki/Jenkins_hash_function">Jenkins hash function</a>.</p>
<h2>Second generation: MurmurHash</h2>
<p>In 2008 Austin Appleby published a new hash function called <a href="https://sites.google.com/site/murmurhash/">MurmurHash</a>.  In its most recent version it is roughly 2x the speed of lookup3 (so roughly 1 byte/cycle), and it comes in both 32 and 64-bit versions.  The 32-bit version uses only 32-bit math and gives you a 32-bit hash, the 64-bit version uses 64-bit math and gives a 64-bit hash.  According to Austin&#8217;s analysis it has excellent properties, though Bob Jenkins says in his expanded Dr. Dobbs article &#8220;I can see [MurmurHash is] weaker than my lookup3, but I don&#8217;t by how much, I haven&#8217;t tested it.&#8221;  MurmurHash quickly became popular thanks to its excellent speed and statistical properties.</p>
<h2>Third generation: CityHash and SpookyHash</h2>
<p>In 2011 two hash functions were released that both improve on MurmurHash due largely to greater instruction-level parallelism.  Google released <a href="http://code.google.com/p/cityhash/">CityHash</a> (written by Geoff Pike and Jyrki Alakuijala) and Bob Jenkins released a new hash of his own, <a href="http://burtleburtle.net/bob/hash/spooky.html">SpookyHash</a> (so named because it was released on Halloween).  Both functions are on the order of 2x the speed of MurmurHash, but both functions use 64-bit math and have no 32-bit version, and CityHash depends on the CRC32 instruction that is present in SSE 4.2 (Intel Nehalem and later) for its speed.  SpookyHash gives you 128-bit output, whereas CityHash has 64-bit, 128-bit, and 256-bit variants.</p>
<h2>Which function is best/fastest?</h2>
<p>From what I can see, all of the hash functions I mentioned in this article are good enough from a statistical perspective.  One consideration is that only CityHash/SpookyHash give more than 64 bits of output, but for a hash table 32 bits of output is plenty.  Other applications may have use for 128 or 256 bit output.</p>
<p>If you&#8217;re on 32-bit, MurmurHash looks like the clear winner since it&#8217;s the only function faster than lookup3 that has a native 32-bit version.  32-bit machines could probably compile and run City and Spooky, but I would expect it to be much slower because the 64-bit math would have to be emulated.</p>
<p>On 64-bit machines it&#8217;s hard to say which is best without further benchmarking.  I&#8217;d be liable to prefer Spooky to City since the latter depends on the CRC32 instruction for speed which isn&#8217;t available everywhere.</p>
<p>One other consideration is aligned vs. unaligned access.  MurmurHash (unlike City or Spooky) comes in a variant that will only perform aligned reads, since on many architectures unaligned reads will crash or return the wrong data (unaligned reads are undefined behavior in C).  City and Spooky both address the issue by copying the input data into aligned storage with memcpy(); Spooky does the memcpy() a block at a time (if ALLOW_UNALIGNED_READS is not defined), City does the memcpy() an integer at a time!  On machines that can handle unaligned reads (like x86 and x86-64) the memcpy will be optimized away, but I did a test on my little ARM box and found that this:</p>
<pre lang="c">#include &lt;stdint.t&gt;
#include &lt;string.h&gt;
int32_t read32_unaligned(const void *buf) {
  int32_t ret;
  memcpy(&#038;ret, buf, 4);
  return ret;
}</pre>
<p>compiles to this very inefficient code (this would be a single instruction on x86):</p>
<pre lang="asm">   0:	b500      	push	{lr}
   2:	2204      	movs	r2, #4
   4:	b083      	sub	sp, #12
   6:	4601      	mov	r1, r0
   8:	eb0d 0002 	add.w	r0, sp, r2
   c:	f7ff fffe 	bl	0 &lt;memcpy&gt;
  10:	9801      	ldr	r0, [sp, #4]
  12:	b003      	add	sp, #12
  14:	bd00      	pop	{pc}</pre>
<p>To conclude, MurmurHash still looks like the best option if you need 32-bit or aligned-only reads.  CityHash and SpookyHash look to be faster on x86-64, but I would almost think of them as being specific to that architecture, since I&#8217;m not aware of other architectures that are both 64-bit and allow unaligned reads.</p>
<p>Please let me know of any errors in this article.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.reverberate.org/2012/01/29/state-of-the-hash-functions-2012/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Refcounting immutable cyclic graphs</title>
		<link>http://blog.reverberate.org/2012/01/21/refcounting-immutable-cyclic-graphs/</link>
		<comments>http://blog.reverberate.org/2012/01/21/refcounting-immutable-cyclic-graphs/#comments</comments>
		<pubDate>Sat, 21 Jan 2012 19:07:30 +0000</pubDate>
		<dc:creator>Josh</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.reverberate.org/?p=518</guid>
		<description><![CDATA[Cycles are a thorn in the side of refcounting. But yesterday I discovered (or perhaps rediscovered) a simple, efficient scheme for refcounting cyclic graphs if the defs are immutable: find strongly-connected components and make all nodes in each SCC share &#8230; <a href="http://blog.reverberate.org/2012/01/21/refcounting-immutable-cyclic-graphs/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Cycles are a thorn in the side of refcounting.  But yesterday I discovered (or perhaps rediscovered) a simple, efficient scheme for refcounting cyclic graphs if the defs are immutable: find <a href="http://en.wikipedia.org/wiki/Strongly_connected_component">strongly-connected components</a> and make all nodes in each SCC share a refcount.  Beautiful.</p>
<p>This relies on the graph theory result that if each SCC in a graph is replaced with a single node, the resulting graph forms a <a href="http://en.wikipedia.org/wiki/Directed_acyclic_graph">directed acyclic graph</a>.</p>
<p>I&#8217;m planning to use this scheme to refcount defs in upb (garbage-collection is a non-starter because I don&#8217;t want to have to track a global list of roots or force the client to periodically call a GC function).</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.reverberate.org/2012/01/21/refcounting-immutable-cyclic-graphs/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Making Knuth&#8217;s wish come true: the x32 ABI</title>
		<link>http://blog.reverberate.org/2011/09/26/making-knuths-wish-come-true-the-x32-abi/</link>
		<comments>http://blog.reverberate.org/2011/09/26/making-knuths-wish-come-true-the-x32-abi/#comments</comments>
		<pubDate>Mon, 26 Sep 2011 23:34:07 +0000</pubDate>
		<dc:creator>Josh</dc:creator>
				<category><![CDATA[upb]]></category>

		<guid isPermaLink="false">http://blog.reverberate.org/?p=496</guid>
		<description><![CDATA[Several years ago (though I can&#8217;t say exactly how many since it&#8217;s not dated) Knuth made the following complaint: A Flame About 64-bit Pointers It is absolutely idiotic to have 64-bit pointers when I compile a program that uses less &#8230; <a href="http://blog.reverberate.org/2011/09/26/making-knuths-wish-come-true-the-x32-abi/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Several years ago (though I can&#8217;t say exactly how many since it&#8217;s not dated) Knuth <a href="http://www-cs-faculty.stanford.edu/~uno/news08.html">made the following complaint</a>:</p>
<blockquote><p><b>A Flame About 64-bit Pointers</b></p>
<p>It is absolutely idiotic to have 64-bit pointers when I compile a program that uses less than 4 gigabytes of RAM. When such pointer values appear inside a struct, they not only waste half the memory, they effectively throw away half of the cache.</p>
<p>The gcc manpage advertises an option &#8220;-mlong32&#8243; that sounds like what I want. Namely, I think it would compile code for my x86-64 architecture, taking advantage of the extra registers etc., but it would also know that my program is going to live inside a 32-bit virtual address space.</p>
<p>Unfortunately, the -mlong32 option was introduced only for MIPS computers, years ago. Nobody has yet adopted such conventions for today&#8217;s most popular architecture. Probably that happens because programs compiled with this convention will need to be loaded with a special version of libc.</p>
<p>Please, somebody, make that possible.</p></blockquote>
<p>I always thought this made a lot of sense.  <a href="https://launchpad.net/bugs/185263">People have asked distro-makers for this before without a lot of success</a>, but it looks like this is now being worked on by high-profile people in the Linux community.  It is called <a href="https://sites.google.com/site/x32abi/">The x32 ABI</a> (see <a href="http://lwn.net/Articles/456731/">the LWN coverage</a> for a more digestible description).  It&#8217;s exciting because in some benchmarks this can outperform the x86-64 ABI by 10% or more.  It&#8217;s a tradeoff &#8212; if you don&#8217;t need to address more than 4GB of memory, you can get faster programs because smaller pointers have better cache utilization.  You&#8217;ll use less memory too.</p>
<p>This could have been done in a way that operated nearly the same as &#8220;compatibility mode&#8221; (ie. running 32-bit binaries on a 64-bit CPU/OS), which would have required only minimal changes to the kernel/toolchain.  But it looks like their plans are more ambitious: they want to be able to use the optimized <code>SYSCALL64</code> instruction (which is &#8220;much faster&#8221; than <code>int 0x80</code> <a href="http://article.gmane.org/gmane.linux.kernel/1184885">according to H. Peter Anvin</a>), and they&#8217;re looking at fixing other problems like 32-bit <code>time_t</code>.  So it&#8217;s a more substantial effort, but it looks like there&#8217;s significant interest and momentum behind this.</p>
<p>Thinking about how this would affect upb, my impression is that I could use my x86-64 JIT unmodified with x32, since it appears to have all of the same calling conventions.  It has the same set of callee-save registers and the same set of registers for parameter transfer, and I think these are the main things upb&#8217;s JIT-ted code depends on.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.reverberate.org/2011/09/26/making-knuths-wish-come-true-the-x32-abi/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Beating the compiler</title>
		<link>http://blog.reverberate.org/2011/09/17/beating-the-compiler/</link>
		<comments>http://blog.reverberate.org/2011/09/17/beating-the-compiler/#comments</comments>
		<pubDate>Sat, 17 Sep 2011 22:08:25 +0000</pubDate>
		<dc:creator>Josh</dc:creator>
				<category><![CDATA[upb]]></category>

		<guid isPermaLink="false">http://blog.reverberate.org/?p=491</guid>
		<description><![CDATA[It&#8217;s been a while since I&#8217;ve posted about upb, but I&#8217;ve been busy improving it! I think the biggest achievement I can mention is that the core upb APIs (upb_handlers, upb_def, and upb_bytestream/upb_bytesink) are converging to the point where I&#8217;m &#8230; <a href="http://blog.reverberate.org/2011/09/17/beating-the-compiler/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s been a while since I&#8217;ve posted about upb, but I&#8217;ve been busy improving it!  I think the biggest achievement I can mention is that the core upb APIs (upb_handlers, upb_def, and upb_bytestream/upb_bytesink) are converging to the point where I&#8217;m comfortable with people starting to experiment with them.  I&#8217;m not promising they won&#8217;t change at all, but I&#8217;m a lot more confident in their overall structure and semantics than I have been previously.</p>
<p>Notably, I think upb&#8217;s deserialization is ready for casual and experimental use.  Definitely don&#8217;t trust any data to it until it&#8217;s better tested, though.  I won&#8217;t be releasing until things have converged a bit more (and are better tested).</p>
<p>I gave a talk about upb at Google yesterday that was well-received.  One question that comes up is &#8220;how are you beating the generated code from the protobuf compiler?&#8221;  For the record, my JIT appears to be about 25% faster than Google&#8217;s protobuf release on a completely apples-to-apples test.  It is a bit surprising, since my code has basically the same structure as protobuf&#8217;s generated C++ &#8212; I don&#8217;t invent any new optimizations or anything like that.  I think it really comes down to generating better assembly than gcc is.</p>
<p>We live in a day and age where common wisdom is that you can&#8217;t beat a good C++ compiler, or at least not by much, and I think this is probably true for 99% of use cases.  But I was first inspired to think that I could beat the C++ compiler in this case by reading <a href="http://article.gmane.org/gmane.comp.lang.lua.general/75426">this mailing list post from Mike Pall</a> where he explains why you can still beat the compiler for interpreter main loops.  A protobuf parser is surprisingly similar to a byte-code interpreter main loop, so I thought I&#8217;d give it a shot.</p>
<p>Below is just the simplest example I could dig up of a side-by-side comparison of my code vs. the compiler&#8217;s.  What follows is the code to parse a single fixed64 value:</p>
<pre lang="asm">
  ; upb JIT assembly:
  mov    rdx,QWORD PTR [rbx+0x2]    ; load fixed64 val out of buffer
  add    rbx,0xa                    ; advance buffer by 10 (2 for tag)
  mov    QWORD PTR [r12+0x40],rdx   ; store fixed64 value in message
  or     BYTE PTR [r12+0x1],0x4     ; set hasbit
  cmp    rbx,QWORD PTR [r15+0xaf8]  ; check for end-of-buffer
  jae    <end of buffer>
  mov    rcx,QWORD PTR [rbx]        ; load next tag
  cmp    cx,0x1b0                   ; next field+wt in order?
  je     <expected next field>
</pre>
<p>There&#8217;s not a lot left to cut away here.  Compare with protobuf/gcc-generated code:</p>
<pre lang="asm">
  ; protobuf/gcc code:
  mov    ecx,DWORD PTR [rbx+0x10]   ; load buffer end
  mov    rax,QWORD PTR [rbx+0x8]    ; load buffer
  sub    ecx,eax                    ; if (buffer_end_ - buffer_ <= 7)
  cmp    ecx,0x7
  jle    <error>
  mov    rax,QWORD PTR [rax]        ; load fixed64 val
  mov    rdx,QWORD PTR [rbp-0x48]   ; load this
  mov    QWORD PTR [rdx],rax        ; store fixed64 val in this
  add    QWORD PTR [rbx+0x8],0x8    ; advance buffer
  or     DWORD PTR [r12+0x74],0x800 ; set hasbit
  mov    rdx,QWORD PTR [rbx+0x10]   ; load buffer end
  mov    rax,QWORD PTR [rbx+0x8]    ; load buffer
  mov    ecx,edx
  sub    ecx,eax                    ; if (buffer_end_ - buffer_ <= 1)
  cmp    ecx,0x1
  jle    <end of file>
  cmp    BYTE PTR [rax],0xb0        ; check first byte of next tag
  jne    <lookup field>
  cmp    BYTE PTR [rax+0x1],0x1     ; check second byte of next tag
  jne    <lookup field>
</pre>
<p>There is some poor register allocation going on here &#8212; gcc is repeatedly loading <tt>buffer_</tt> and <tt>buffer_end_</tt> even though it has plenty of registers to play with.  All in all, gcc is generating over twice the number of instructions, over twice the number of loads, and more stores too.  Note that this is taken from the middle of a very large C++ function with a big switch statement and a bunch of gotos, and I think it&#8217;s just difficult for a compiler to do good register allocation under these constraints.</p>
<p>If the C++ compiler could know the difference between fast paths and slow paths and do the register allocation solely for the fast paths (spilling everything for the slow paths) it might have a better shot.  But still I think it&#8217;s just a hard problem.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.reverberate.org/2011/09/17/beating-the-compiler/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Using dtrace on OS X to debug a performance problem</title>
		<link>http://blog.reverberate.org/2011/05/08/using-dtrace-on-os-x-to-debug-a-performance-problem/</link>
		<comments>http://blog.reverberate.org/2011/05/08/using-dtrace-on-os-x-to-debug-a-performance-problem/#comments</comments>
		<pubDate>Sun, 08 May 2011 20:28:12 +0000</pubDate>
		<dc:creator>Josh</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.reverberate.org/?p=467</guid>
		<description><![CDATA[I recently ported upb&#8217;s table-based decoder to use setjmp/longjmp-based error handling. I did this largely for code simplicity and readability, so that the non-error code-paths didn&#8217;t have to check for errors all the time. But unfortunately I noticed a dramatic &#8230; <a href="http://blog.reverberate.org/2011/05/08/using-dtrace-on-os-x-to-debug-a-performance-problem/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>I recently ported upb&#8217;s table-based decoder to use <code>setjmp/longjmp</code>-based error handling.  I did this largely for code simplicity and readability, so that the non-error code-paths didn&#8217;t have to check for errors all the time.  But unfortunately I noticed a dramatic 75% performance decrease.  What was going on?</p>
<p>A profile in Shark showed the majority of my time being spent in system calls, but didn&#8217;t make it clear which system calls were involved.  It sounded like a job for dtrace.</p>
<p>I used the following script to dump all the system calls that were being issued by my process:</p>
<pre lang="d">syscall:::entry
/pid == $target/
{
     @[probefunc] = count();
} </pre>
<p>I then ran this script on both my old (fast) benchmark and the new (unexpectedly slow) one:</p>
<pre lang="bash">$ sudo dtrace -c ./old-benchmark -s trace.d

  exit                                             1
  fstat64                                          1
  ioctl                                            1
  write_nocancel                                   1
  munmap                                          93
  mmap                                            94
  getrusage                                     7728    

$ sudo dtrace -c ./new-benchmark -s trace.d

  exit                                             1
  fstat64                                          1
  ioctl                                            1
  munmap                                           1
  write_nocancel                                   1
  getrusage                                      438
  sigaltstack                                 111773
  sigreturn                                   111774
  sigprocmask                                 223546</pre>
<p>Sure enough, the new version is making a ton of system calls that weren&#8217;t happening before, and they appear to be signal related.  I investigated the manpage of <code>setjmp/longjmp</code> and found:</p>
<blockquote><p>The setjmp()/longjmp() pairs save and restore the signal mask while _setjmp()/_longjmp() pairs save and restore only the register set and the stack.  (See sigprocmask(2).)</p>
<p>     The sigsetjmp()/siglongjmp() function pairs save and restore the signal mask if the argument savemask is non-zero; otherwise, only the register set and the stack are saved.</p></blockquote>
<p>I replaced my calls to <code>setjmp/longjmp</code> with <code>sigsetjmp/siglongjmp</code>, passing 0 for my signal mask, and the performance problems went away.  Score one for dtrace.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.reverberate.org/2011/05/08/using-dtrace-on-os-x-to-debug-a-performance-problem/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>upb status and preliminary performance numbers</title>
		<link>http://blog.reverberate.org/2011/04/25/upb-status-and-preliminary-performance-numbers/</link>
		<comments>http://blog.reverberate.org/2011/04/25/upb-status-and-preliminary-performance-numbers/#comments</comments>
		<pubDate>Tue, 26 Apr 2011 06:18:53 +0000</pubDate>
		<dc:creator>Josh</dc:creator>
				<category><![CDATA[upb]]></category>

		<guid isPermaLink="false">http://blog.reverberate.org/?p=441</guid>
		<description><![CDATA[The last few weeks have been very exciting for upb. On April 1 I checked in a JIT compiler for parsing protobufs, which one might think was an April Fool&#8217;s joke, but it&#8217;s real and the performance numbers so far &#8230; <a href="http://blog.reverberate.org/2011/04/25/upb-status-and-preliminary-performance-numbers/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>The last few weeks have been very exciting for upb.  On April 1 <a href="https://github.com/haberman/upb/commit/9eb4d695c49a85f7f72ad68c3c31affd61fef984">I checked in a JIT compiler for parsing protobufs</a>, which one might think was an April Fool&#8217;s joke, but it&#8217;s real and the performance numbers so far have exceeded my expectations.</p>
<h3>Why JIT?</h3>
<p>Before I get to the numbers, I should explain what it even means for upb to have a JIT.  If you&#8217;re not interested in the technical details, feel free to skip this section to go straight to the impacts and what this means for upb.</p>
<p>So why a JIT?  After all, protobufs are not an imperative language.  You can&#8217;t write an algorithm in a <code>.proto</code> file.  You can&#8217;t apply compiler techniques like SSA, strength reduction, tracing, or really any of the things you&#8217;d expect from a JIT on a platform like the JVM.  There is no byte code or intermediate representation to speak of.  So why would upb have a JIT and what on earth would it do?</p>
<p>To review, a <code>.proto</code> file defines a schema for some messages.  In this sense, it is comparable to <a href="http://json-schema.org/">JSON Schema</a> or <a href="http://en.wikipedia.org/wiki/XML_Schema_(W3C)">XML Schema</a>.</p>
<pre>message Person {
  required int32 id = 1;
  required string name = 2;
  optional string email = 3;
}</pre>
<p>To support forwards and backwards compatibility, we don&#8217;t make any assumptions about which fields we&#8217;re going to see or in what order.  Instead, on the wire each field is preceded by its tag number.  Logically, the serialized version is something like:</p>
<pre>- field number: 1, wire type: varint, value: 1
- field number 2, wire type: delimited, value: "Josh Haberman"
- field number 3, wire type: delimited, value: "jhaberman@gmail.com"</pre>
<p>If you were going to write a parser for this, you might create a table of fields, keyed by field number, and a main parser loop that looks something like this:</p>
<pre lang="c">
while (!done) {
  int type = fields[get_field_number()];
  switch (type) {
    case WIRE_TYPE_VARINT:
      parse_varint();
      break;
    case WIRE_TYPE_DELIMITED:
      parse_delimited();
      break;
    // etc.
  }
}</pre>
<p>We call this a &#8220;table-based parser.&#8221;  We look up the type in a table of fields, and use that to branch to the correct value parsing function.  There are minor variations on this, like using a function pointer instead of a switch statement, but it&#8217;s the same general idea.  To see an actual program that implements this style of parser, check out my bare-bones 100-line protobuf parser that I wrote a few years ago: <a href="http://blog.reverberate.org/wp-content/uploads/2008/07/pb.c">pb.c</a>.</p>
<p>In a way this resembles a byte-code interpreter, and the serialized protobuf resembles byte-code.  For example, see <a href="http://www.lua.org/source/5.1/lvm.c.html#luaV_execute">the main loop of the Lua interpreter</a>.  It has a similar pattern where it extracts an opcode and uses that to branch into a giant switch statement.</p>
<p>It turns out that if we generate code that is specific to one message type, and takes advantage of the fact that fields are usually encoded in order, we can beat this table-based parser significantly.  Google&#8217;s main &#8220;protobuf&#8221; software has done this for a long time: their strategy is to generate C++ classes that are specific to a proto message type (like <code>Person</code> in the above example).  This code has been highly tuned and optimized, and the generated parsing code looks something like this:</p>
<pre lang="cpp">while((tag = input->ReadTag()) != 0) {
  field_number = GetFieldNumber(tag);
  wire_type = GetWireType(tag);
  switch(tag) {
    // optional float field1 = 1;
    case 1:
      if (wire_type != FLOAT_WIRE_TYPE) goto error;
      input->ParseFloat();
      // Saves the switch() and type check branch if fields are in order.
      if (input->ExpectTag(2, DOUBLE_WIRE_TYPE)) goto parse_field2;
      break;

    // optional double field2 = 2;
    case 2:
      if (wire_type != DOUBLE_WIRE_TYPE) goto error;
parse_field2:
      input->ParseDouble();
      // saves the switch() and type check branch if fields are in order.
      if (input->ExpectTag(3, INT32_WIRE_TYPE)) goto parse_field3
      break;

    // optional int32 field3 = 3;
    case 3:
    // [...]
  }
}
</pre>
<p>This code also has a big <code>switch()</code> statement at the top level, but the targets are field numbers instead of wire types.  More importantly, each field&#8217;s case has a line like this:</p>
<pre lang="cpp">if (input->ExpectTag(3, INT32_WIRE_TYPE)) goto parse_field3</pre>
<p>This is key, because if the fields occur in order (which they usually do) then this branch will always be taken and it will be predicted correctly by the CPU.  This can make a huge difference.</p>
<p>While this kind of C++ code generation has benefited Google&#8217;s protobuf software in speed, I always found it inconvenient from a dynamic languages perspective.  What dynamic language user wants to have to generate C++ code and link that into their interpreter?  We&#8217;re not talking about a single C++ extension that you compile once: every single message you define in a .proto file generates <i>different</i> C++ code.  So you can&#8217;t just <code>apt-get install libprotobuf-python</code> and be done with it, you have to generate C++ for every message that your specific app wants to use.</p>
<p>This is where the JIT comes in.  If we can generate machine code at runtime, we can get the speed benefits of the generated code without having to generate, compile, and link C++ into our interpreters or VMs.  You could compile the library once, and after that you can dynamically load message definitions but still get the fastest possible parsing.</p>
<h3>Preliminary results</h3>
<p>I knew all of the theory I just described before I started writing a JIT.  But theory is just that &#8212; theory.  What would actually happen when I implemented a JIT?</p>
<p>The results exceeded my expectations.  I still am being somewhat cautious, because there are so many dimensions to any performance equation, and because benchmarks are not guaranteed to correspond to real-world performance.  That said, here are my preliminary results.  You can reproduce these by <a href="https://github.com/haberman/upb">obtaining upb from GitHub</a> and running <code>make benchmarks</code> followed by <code>make benchmark</code> (you have to define <code>-DUPB_USE_JIT_X64</code> to get the JIT).</p>
<pre>Parsing an 80k protobuf into a data structure repeatedly,
calling Clear() between each parse.  (proto2 == Google protobuf)

proto2 table-based parser      38 MB/s
proto2 generated code parser   265 MB/s
upb table-based parser         340 MB/s
upb JIT parser                 741 MB/s</pre>
<p>The results above are designed to be as apples-to-apples as possible.  In other words, I disabled upb&#8217;s optimization that avoids <code>memcpy()</code> because Google&#8217;s protobuf doesn&#8217;t support it in its open-source release.  I think the reason that even my table-based parser beats Google&#8217;s generated code here is because proto2 implements Clear() in a sub-optimal way that requires an extra pass over the message tree; see <a href="https://github.com/haberman/upb/blob/7cf5893dcc755a1bc706536088db3d34cfc8c46b/src/upb_msg.h#L232">my comment in upb_msg.h</a> for more info.</p>
<p>Things get even better if we allow ourselves to drop the constraint of being apples-to-apples with proto2.  upb is capable of <i>stream-based parsing</i> like SAX, whereas proto2 only supports a DOM-based approach where you parse into a data structure.  If we include the performance numbers for stream parsing, and for DOM-based parsing that avoids memcpy, we get:</p>
<pre>upb table-based stream parser    420 MB/s
upb JIT no-memcpy() DOM parser   870 MB/s
upb JIT stream parser           1430 MB/s</pre>
<p>If you&#8217;re like me when I first saw these numbers, your jaw is on the floor at seeing almost 1.5GB/s doing stream parsing of protocol buffers.  At this point we are 5x proto2&#8242;s highly optimized generated code.</p>
<p>upb&#8217;s JIT isn&#8217;t complete and can&#8217;t handle all cases, but these performance numbers should still be valid because these benchmarks only used the parts of the format that <i>are</i> currently supported.</p>
<h3>So when can I USE it?  And can I help?</h3>
<p>One reason I&#8217;ve avoided posting these extremely positive results so far is that I hate to get people excited about something that&#8217;s still not ready for users yet.</p>
<p>Adding support for the JIT required making extremely large and intrusive changes to the core interfaces, like <a href="https://github.com/haberman/upb/commit/8ef6873e0e14309a1715a252a650bab0ae1a33ef">this 3000 line refactoring</a> that had to be completed before I could write even the first line of the JIT.  I <i>need</i> to have the flexibility to change core interfaces like this still, because the design is still converging.  So as much as I wish I could say it&#8217;s ready to go, I still need to hold off on a real release.</p>
<p>The good news is that the design is still making huge strides.  In the last few weeks I&#8217;ve been refining my scheme for how upb will integrate into different VM&#8217;s and language runtimes, and I feel more confident than ever that the language bindings for Lua, Python, etc. will be some of the fastest extensions ever offered for this kind of functionality.</p>
<p>As a preview for where this is going, I think that upb will even be usable as a JSON library, offering speed that is as good or greater than any existing JSON libraries.  JSON can be directly mapped onto a subset of protobufs (JSON only uses double-precision numbers) and the protobuf text format is already extremely similar to JSON.  So all the work I&#8217;ve done to optimize memory layout and dynamic language object access should apply.</p>
<p>And while I&#8217;m really happy to get offers to help out, it&#8217;s still at a stage of design where I need to be doing most of the work.  Working on upb generally involves pacing around my apartment deep in thought about all the requirements and use cases I want to satisfy and brainstorming a million different approaches until I converge on the the one that is the smallest, fastest, and most flexible.  It&#8217;s hard work but I love it, and the more time that passes the more convinced I am that this is going to be big.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.reverberate.org/2011/04/25/upb-status-and-preliminary-performance-numbers/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>ARM architecture is mired in oppressive legalese</title>
		<link>http://blog.reverberate.org/2011/04/18/arm-architecture-is-mired-in-oppressive-legalese/</link>
		<comments>http://blog.reverberate.org/2011/04/18/arm-architecture-is-mired-in-oppressive-legalese/#comments</comments>
		<pubDate>Tue, 19 Apr 2011 05:36:47 +0000</pubDate>
		<dc:creator>Josh</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.reverberate.org/?p=431</guid>
		<description><![CDATA[My Efika MX Smarttop came and I&#8217;ve had some fun compiling upb for it and trying it out. I don&#8217;t have a pressing ARM use case yet, but I want to work ahead a bit and get familiar with the &#8230; <a href="http://blog.reverberate.org/2011/04/18/arm-architecture-is-mired-in-oppressive-legalese/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><a href="http://blog.reverberate.org/2011/03/31/addicted-to-hardware-new-toy-on-the-way/">My Efika MX Smarttop came</a> and I&#8217;ve had some fun compiling upb for it and trying it out.  I don&#8217;t have a pressing ARM use case yet, but I want to work ahead a bit and get familiar with the architecture so that when I <i>do</i> want to program for it I&#8217;m not starting from scratch.</p>
<p>I went to the ARM website to get my hands on some documentation.  I was prepared to buy physical books if that is their authoritative documentation, but it appears that they have PDFs for the architecture&#8217;s reference manuals.</p>
<p>So far so good &#8212; this is on par with Intel which has <a href="http://www.intel.com/products/processor/manuals/">a nice and easily accessible website where you can download any of their manuals in like 5 seconds.</a>  I&#8217;ve downloaded them and referenced them a countless number of times.</p>
<p>But I get no such love from ARM.  Strike 1: the ARM website won&#8217;t let you download the manuals unless you have registered first.  Not cool ARM, not cool.  I grudgingly give them my name, email address, company name, country, and state.  And the website warns:</p>
<blockquote><p><b>Note</b>: We recommend using your business email address to ensure you can access all your relevant services</p></blockquote>
<p>This is a vague but ominous warning that if you use a personal email address the registration might not work correctly.</p>
<p>But fine, I go through with the registration.  Now I can download the manual, right?  Turns out no: first I have to accept a EULA!  It begins:</p>
<blockquote><p>USER AGREEMENT FOR THE ARM ARCHITECTURE REFERENCE MANUAL</p>
<p>THIS AGREEMENT (&#8221; AGREEMENT &#8220;) IS A LEGAL AGREEMENT BETWEEN YOU (EITHER A SINGLE INDIVIDUAL, OR SINGLE LEGAL ENTITY) AND ARM LIMITED (&#8220;ARM&#8221;) FOR THE USE OF THE ARM ARCHITECTURE REFERENCE MANUAL. ARM IS ONLY WILLING TO PROVIDE ACCESS TO THE ARM ARCHITECTURE REFERENCE MANUAL TO YOU ON CONDITION THAT YOU ACCEPT ALL OF THE TERMS IN THIS AGREEMENT. BY CLICKING &#8220;I AGREE&#8221; OR BY DOWNLOADING OR OTHERWISE COPYING THE DELIVERABLES YOU INDICATE THAT YOU AGREE TO BE BOUND BY ALL THE TERMS OF THIS LICENCE.</p></blockquote>
<p>This is all just to read some documentation.  Not impressed.  ARM, We&#8217;re off to a rocky start.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.reverberate.org/2011/04/18/arm-architecture-is-mired-in-oppressive-legalese/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>EINTR and PC loser-ing (The &#8220;Worse Is Better&#8221; case study)</title>
		<link>http://blog.reverberate.org/2011/04/18/eintr-and-pc-loser-ing-the-worse-is-better-case-study/</link>
		<comments>http://blog.reverberate.org/2011/04/18/eintr-and-pc-loser-ing-the-worse-is-better-case-study/#comments</comments>
		<pubDate>Mon, 18 Apr 2011 08:57:36 +0000</pubDate>
		<dc:creator>Josh</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.reverberate.org/?p=399</guid>
		<description><![CDATA[Richard Gabriel&#8217;s 1989 essay Worse Is Better is a famous comparison between LISP and Unix/C that pops up from time to time and is guaranteed to spark a spirited discussion. The philosophical argument itself is not something I want to &#8230; <a href="http://blog.reverberate.org/2011/04/18/eintr-and-pc-loser-ing-the-worse-is-better-case-study/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Richard Gabriel&#8217;s 1989 essay <a href="http://www.jwz.org/doc/worse-is-better.html">Worse Is Better</a> is a famous comparison between LISP and Unix/C that pops up from time to time and is guaranteed to spark a spirited discussion.  The philosophical argument itself is not something I want to get into right now; I am interested in the technical content of the essay.  What always bothered me about this paper is that I never fully understood Gabriel&#8217;s primary example of a dirty hack vs. &#8220;the right thing.&#8221;</p>
<p>His example is &#8220;the PC loser-ing problem,&#8221; which he describes thus:</p>
<blockquote><p>Two famous people, one from MIT and another from Berkeley (but working on Unix) once met to discuss operating system issues. The person from MIT was knowledgeable about ITS (the MIT AI Lab operating system) and had been reading the Unix sources. He was interested in how Unix solved the PC loser-ing problem. The PC loser-ing problem occurs when a user program invokes a system routine to perform a lengthy operation that might have significant state, such as IO buffers. If an interrupt occurs during the operation, the state of the user program must be saved. Because the invocation of the system routine is usually a single instruction, the PC of the user program does not adequately capture the state of the process. The system routine must either back out or press forward. The right thing is to back out and restore the user program PC to the instruction that invoked the system routine so that resumption of the user program after the interrupt, for example, re-enters the system routine. It is called &#8220;PC loser-ing&#8221; because the PC is being coerced into &#8220;loser mode,&#8221; where &#8220;loser&#8221; is the affectionate name for &#8220;user&#8221; at MIT.</p>
<p>The MIT guy did not see any code that handled this case and asked the New Jersey guy how the problem was handled. The New Jersey guy said that the Unix folks were aware of the problem, but the solution was for the system routine to always finish, but sometimes an error code would be returned that signaled that the system routine had failed to complete its action. A correct user program, then, had to check the error code to determine whether to simply try the system routine again. The MIT guy did not like this solution because it was not the right thing.</p>
<p>The New Jersey guy said that the Unix solution was right because the design philosophy of Unix was simplicity and that the right thing was too complex.</p></blockquote>
<p>When I read this I always had a burning desire to know: how did the story end?  How do modern operating systems resolve this problem &#8212; the &#8220;dirty hack&#8221; way or the &#8220;right way?&#8221;  What part of our modern POSIX interfaces are affected by this question?</p>
<p>There are several things that never made sense to me about this example.  First of all, why would you need to abort a system call just because an interrupt occurred?  I investigated the Linux source and it seems quite clear that interrupt handlers can return to either the kernel or userspace &#8212; whichever was running when the interrupt fired.  So I don&#8217;t see why you&#8217;d need to &#8220;coerce&#8221; the system into &#8220;loser mode&#8221; at all.</p>
<p>But let&#8217;s suppose you accept this as a given &#8212; we will assume that when a hardware interrupt occurs, you must exit to user mode.  I still don&#8217;t see the difficulty in automatically re-invoking the system call.  It&#8217;s true that invoking the system routine is a single instruction, but why is it that &#8220;the PC of the user program does not adequately capture the state of the process,&#8221; as Gabriel&#8217;s essay states?  What other process state do we need to capture?  The registers must already be saved when the syscall is entered, because they must be restored even with a completely normal syscall return.  So if we want to re-invoke the system routine, it should be as easy as simply re-executing the instruction that made the system call.  Right?</p>
<p>The whole example confused me quite a lot until I had the idea to replace &#8220;interrupt&#8221; in the above description with &#8220;signal.&#8221;  This is not such a stretch, since signals are essentially user-space software interrupts.  With this small change, everything started to make a lot more sense.  If a <i>signal</i> was delivered to a process that was currently inside a system call, that signal handler could invoke a system call itself, which would cause us to re-enter the kernel.  I could easily see how the complexity of dealing with this could have led early UNIX implementors to simply abort the original system call before delivering the signal.</p>
<p>But this is only speculation about what UNIX was like in the mid to late 80s when &#8220;Worse is Better&#8221; was written.  I could be completely off the mark in this analysis &#8212; maybe returning to the kernel from a hardware interrupt handler really wasn&#8217;t implemented at that time.  Or maybe saving user state really was difficult for some reason.  I&#8217;d love to hear from anyone who has more historical context about this.  But the essay contains an important clue that seems to reinforce my speculation that it&#8217;s actually about signals.</p>
<h3>UNIX and EINTR</h3>
<p>If we look closely at the &#8220;Worse is Better&#8221; essay, we get a strong clue about what the Unix guy in the story might have been talking about:</p>
<blockquote><p>The New Jersey guy said that the Unix folks were aware of the problem, but the solution was for the system routine to always finish, but sometimes an error code would be returned that signaled that the system routine had failed to complete its action. A correct user program, then, had to check the error code to determine whether to simply try the system routine again.</p></blockquote>
<p>As someone who has done a lot of Unix system-level programming, this sounds to me like it <i>must</i> be describing EINTR, the error code in Unix that means &#8220;Interrupted system call.&#8221;  To give a quick description of EINTR I&#8217;ll enlist the help of my trusty copy of &#8220;Advanced Programming in the Unix Environment&#8221; by W. Richard Stevens:</p>
<blockquote><p>A characteristic of earlier UNIX systems is that if a process caught a signal while the process was blocked in a &#8220;slow&#8221; system call, the system call was interrupted.  The system call returned an error and <code>errno</code> was set to <code>EINTR</code>.  This was done under the assumption that since a signal occurred and the process caught it, there is a good chance that something has happened that should wake up the blocked system call.</p>
<p>[...]</p>
<p>The problem with interrupted system calls is that we now have to handle the error return explicitly.  The typical code sequence (assuming a read operation and assuming that we want to restart the read even if it&#8217;s interrupted) would be</p>
<pre lang="c">
again:
  if ((n = read(fd, buf, BUFFSIZE)) < 0) {
    if (errno == EINTR)
      goto again;  /* just an interrupted system call */
    /* handle other errors */
  }
</pre>
</blockquote>
<p>This sounds an awful lot like the the New Jersey guy's approach from the story, which required a correct program "to check the error code to determine whether to simply try the system routine again."  And there's nothing else in Unix that I've ever heard of that's anything like this.  This must be what the New Jersey guy from the story was talking about!</p>
<p>But note that in W. Richard Stevens' explanation this isn't some dirty hack!  It's not a case of cutting corners that is justified by favoring implementation simplicity over interface simplicity.  Stevens describes it as a deliberate design decision that gives users the capability to abort a long-running system call if you catch a signal in the meantime.  Now you could easily see this as a rationalization of a dirty hack ("it's not a bug, it's a feature!"), but it certainly seems plausible that if you catch a signal while you're blocked on a long system call, the signal might make you decide that you don't want to wait for the long system call any more.  Indeed, Ulrich Drepper claimed in 2000 that <a href="http://lkml.indiana.edu/hypermail/linux/kernel/0011.0/0494.html">"Returning EINTR is necessary for many applications,"</a> though it would have been helpful if he had expanded on this point by giving some examples.</p>
<p>Of course, the price we have paid for this capability is that we have to wrap all of our potentially long system calls in a loop like the example above.  If we don't, our system calls can start failing and causing program errors whenever we catch a signal.  You may think that you don't use any signals yourself, but are you sure that none of your libraries do?  On the flip side, if you're implementing a library you can never know if the main application will use signals or not, so any library that wants to be robust will have to wrap these system calls in a retry loop.</p>
<p>Since the vast majority of programs will always want their system calls to continue even when a signal is received, 4.2BSD (released in 1983) implemented support for automatically retrying most system calls that could previously fail with EINTR.  To me this sounds exactly like what the MIT guy in Richard Gabriel's story was saying is "the right thing."  <b>In other words, Berkeley UNIX was already doing "the right thing" five years before "Worse is Better" was written!</b></p>
<p>Modern POSIX APIs allow both behaviors (either restarting the system call automatically or returning <code>EINTR</code>) -- this is controlled by the <code>SA_RESTART</code> flag.  The following program illustrates both behaviors:</p>
<pre lang="c">

#include <errno.h>
#include <signal.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

void doread() {
  char buf[128];
  printf("doing read() into buf %p\n", buf);
  ssize_t ret = read(STDIN_FILENO, buf, sizeof(buf));
  if (ret < 0) {
    printf("read() for buf %p returned error: %s\n", buf, strerror(errno));
  } else {
    printf("read() for buf %p returned data: %.*s", buf, (int)ret, buf);
  }
}

void sighandler(int signo) {
  printf("received signal %d\n", signo);
  doread();
}

int main(int argc, char *argv[]) {
  // Register SIGHUP handler.  Pass any argument to get SA_RESTART.
  struct sigaction action;
  action.sa_handler = &#038;sighandler;
  sigemptyset(&#038;action.sa_mask);
  action.sa_flags = (argc > 1) ? SA_RESTART : 0;
  sigaction(SIGHUP, &#038;action, NULL);

  doread();
  return 0;
}
</pre>
<p>Here are the results of running the program three different times. I've bolded the parts where I typed to give the program input on stdin.  You can also see where I sent the program a <code>SIGHUP</code>.</p>
<pre>
$ ./test
doing read() into buf 0x7ec7959c
<b>INPUT FROM TERMINAL</b>
read() for buf 0x7ec7959c returned data: INPUT FROM TERMINAL
$ ./test
doing read() into buf 0x7ef6659c
received signal 1
doing read() into buf 0x7ef66204
<b>INPUT FROM TERMINAL</b>
read() for buf 0x7ef66204 returned data: INPUT FROM TERMINAL
read() for buf 0x7ef6659c returned error: Interrupted system call
$ ./test give_me_sa_restart
doing read() into buf 0x7eb7657c
received signal 1
doing read() into buf 0x7eb761e4
<b>INPUT FROM TERMINAL</b>
read() for buf 0x7eb761e4 returned data: INPUT FROM TERMINAL
<b>INPUT FROM TERMINAL AGAIN</b>
read() for buf 0x7eb7657c returned data: INPUT FROM TERMINAL AGAIN
</pre>
<h3>Conclusion</h3>
<p>You might ask "why all the fuss over a little example?"  As I mentioned, my primary motivation in researching all of this was to get to the bottom of this issue and understand how it plays out in modern operating systems.</p>
<p>But if we were going to take all of this information and reflect on the "Worse is Better" argument, my personal observations/conclusions would be:</p>
<ul>
<li>The "worse" system (Unix) did indeed do "the right thing" eventually, even if it didn't at first.  "Worse is better" systems incrementally improve by responding to user needs.  Since users got tired of checking for EINTR, the "worse" system added the functionality for addressing this pain point.</li>
<li>The whole thing did leave a rather large wart, though -- all Unix programs have to wrap these system calls in an EINTR retry loop unless they can be absolutely sure the process will never catch signals that don't have SA_RESTART set.  So there is a price to pay for this incremental evolution.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://blog.reverberate.org/2011/04/18/eintr-and-pc-loser-ing-the-worse-is-better-case-study/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Addicted to hardware: new toy on the way</title>
		<link>http://blog.reverberate.org/2011/03/31/addicted-to-hardware-new-toy-on-the-way/</link>
		<comments>http://blog.reverberate.org/2011/03/31/addicted-to-hardware-new-toy-on-the-way/#comments</comments>
		<pubDate>Thu, 31 Mar 2011 08:50:24 +0000</pubDate>
		<dc:creator>Josh</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.reverberate.org/?p=395</guid>
		<description><![CDATA[I&#8217;m addicted to hardware. I can&#8217;t stop thinking about all of the CPUs that currently exist, how they compare to each other, and how to write the fastest possible code on them. (Actually I want to learn FPGA programming too, &#8230; <a href="http://blog.reverberate.org/2011/03/31/addicted-to-hardware-new-toy-on-the-way/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m addicted to hardware.  I can&#8217;t stop thinking about all of the CPUs that currently exist, how they compare to each other, and how to write the fastest possible code on them.  (Actually I want to learn FPGA programming too, in case they ever start bundling FPGAs with computers).</p>
<p>A year and a half ago I bought a <a href="http://en.wikipedia.org/wiki/SheevaPlug">SheevaPlug</a> which is a little ARM computer that&#8217;s hardly bigger than a wall wart AC adapter.  It has USB and Ethernet and that&#8217;s about it (no display).  Unfortunately when I tried to start playing with it again tonight I discovered it was bricked thanks to a faulty power supply, which is apparently <a href="http://www.google.com/search?&#038;q=sheevaplug+power+supply">a very common problem with SheevaPlugs</a>.</p>
<p>I was a bit sad about this, and I started looking for alternatives.  I wanted something that runs on a non-x86 architecture and that I could stick in a closet and SSH to.  I discovered that there is a new form factor of computers known as a <a href="http://en.wikipedia.org/wiki/Nettop">nettop</a>.  I found a nettop that appears to fulfill my wishes perfectly: the <a href="http://www.genesi-usa.com/products/efika">EFIKA MX Smarttop</a> made by a company called Genesi.  It&#8217;s surprisingly capable for $129: it&#8217;s powered by an ARM Cortex-A8 800MHz CPU, has 512MB RAM, 8GB of flash, Ethernet, WiFi, Bluetooth, USB, and 720p video output through HDMI (with hardware-accelerated video decoding).  Not bad for $129!  And it has a maximum power consumption of 15W.</p>
<p>So I ordered one &#8212; will look forward to seeing if it is really all it&#8217;s stacked up to be!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.reverberate.org/2011/03/31/addicted-to-hardware-new-toy-on-the-way/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>When a compiler&#8217;s slow code actually bites you</title>
		<link>http://blog.reverberate.org/2011/03/19/when-a-compilers-slow-code-actually-bites-you/</link>
		<comments>http://blog.reverberate.org/2011/03/19/when-a-compilers-slow-code-actually-bites-you/#comments</comments>
		<pubDate>Sat, 19 Mar 2011 22:26:12 +0000</pubDate>
		<dc:creator>Josh</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.reverberate.org/?p=374</guid>
		<description><![CDATA[A few days ago I posted GCC: the impressive and the disappointing where I looked at some cases where GCC produces not-quite-optimal code. One of the comments on that post was (emphasis mine): So, it seems like there is a &#8230; <a href="http://blog.reverberate.org/2011/03/19/when-a-compilers-slow-code-actually-bites-you/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>A few days ago I posted <a href="http://blog.reverberate.org/2011/03/03/gcc-the-impressive-and-the-disappointing/">GCC: the impressive and the disappointing</a> where I looked at some cases where GCC produces not-quite-optimal code.  One of the comments on that post was (emphasis mine):</p>
<blockquote><p>So, it seems like there is a much better way to give the compiler a shot at doing the right thing: [snip suggestion]. I think you will find the compiler will generate quite efficient code in this case, <b>particularly if you look at the real execution overhead, rather than what the assembler looks like.</b></p></blockquote>
<p>This is a common attitude I encounter when I am discussing my attempts to optimize my protocol buffer decoding library <a href="https://github.com/haberman/upb">upb</a>.  Programmers love to tell other programmers that they are prematurely optimizing, and most of the time they&#8217;re right.  I&#8217;m sure to some people it seems ludicrous that I would be looking at assembly language output to determine whether it is efficient enough.  For 99.99% of programs, it would be.  But I&#8217;m working in one of those rare domains where it actually matters.  And today I encountered pretty convincing evidence that the compiler&#8217;s bad code is actually affecting me.</p>
<p>The compiler&#8217;s bad code in this case is an example of a bug I previously filed on GCC: <a href="http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44194">struct returned by value generates useless stores</a>.  Though I had previously observed that bug only by inspecting assembly language output, today I had it show up on an actual profile as clear as day.  Here is a screenshot from Shark (click to get full-size):</p>
<p><a href="http://blog.reverberate.org/wp-content/uploads/2011/03/badcode.png"><img src="http://blog.reverberate.org/wp-content/uploads/2011/03/badcode-300x161.png" alt="Screenshot from Apple Shark showing the bad code." width="300" height="161" class="aligncenter size-medium wp-image-376" /></a></p>
<p>To summarize, the compiler took the code:</p>
<pre lang="c">
typedef struct {
  upb_flow_t flow;  // An enum defined elsewhere.
  void *closure;
} upb_sflow_t;

upb_flow_t upb_dispatch_startsubmsg([...]) {
  // [...]
  upb_sflow_t sflow = f->cb.startsubmsg([...]);
  if (sflow.flow != UPB_CONTINUE) {
    // [...]
  }
</pre>
<p>&#8230;and turned that function call/test into this awful machine code (here in its Intel-syntax form):</p>
<pre lang="asm">
  call   QWORD PTR [r12 + 16]
  mov    DWORD PTR [rbp - 64], eax
  mov    QWORD PTR [rbp - 56], rdx
  mov    rax, QWORD PTR [rbp - 64]    ; loads rax with data it already has.
  mov    QWORD PTR [rbp - 48], rax    ; stores rax into the stack a second time.
  mov    QWORD PTR [rbp - 40], rdx    ; stores rdx into the stack a second time.
  mov    edx, DWORD PTR [rbp - 48]    ; loads edx with data already in rax.
  testl  edx, edx
</pre>
<p>..and <i>then</i> (this is the important part) in an actual profile it shows up as being 43.4% of the execution time of a hot function in my program.</p>
<p>This is not a slam against the GCC developers.  GCC is a big and complex piece of software, and they have to prioritize all sorts of different bugs, feature requests, new hardware, etc.</p>
<p>This is just a reminder to those who jump to dare-I-say &#8220;premature&#8221; conclusions about what is premature optimization: some of us really are working in domains where things like virtual function overhead, branch predictability, and the efficiency of the compiler&#8217;s code make a difference.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.reverberate.org/2011/03/19/when-a-compilers-slow-code-actually-bites-you/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

