<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: The art of hashing</title>
	<atom:link href="http://blog.reverberate.org/2009/03/01/the-art-of-hashing/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.reverberate.org/2009/03/01/the-art-of-hashing/</link>
	<description>parsing, performance, minimalism with C99</description>
	<lastBuildDate>Wed, 05 May 2010 20:14:07 -0700</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.6</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Books About Parsing &#187; Josh the Outspoken</title>
		<link>http://blog.reverberate.org/2009/03/01/the-art-of-hashing/comment-page-1/#comment-1355</link>
		<dc:creator>Books About Parsing &#187; Josh the Outspoken</dc:creator>
		<pubDate>Sat, 27 Jun 2009 03:53:03 +0000</pubDate>
		<guid isPermaLink="false">http://blog.reverberate.org/?p=131#comment-1355</guid>
		<description>[...] Python&#8217;s Dictionary Implementation: Being All Things to All People, by Andrew Kuchling. Not about parsing, but about hashing, which I have quite recently been very interested in. [...]</description>
		<content:encoded><![CDATA[<p>[...] Python&#8217;s Dictionary Implementation: Being All Things to All People, by Andrew Kuchling. Not about parsing, but about hashing, which I have quite recently been very interested in. [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Shlomi Fish</title>
		<link>http://blog.reverberate.org/2009/03/01/the-art-of-hashing/comment-page-1/#comment-1297</link>
		<dc:creator>Shlomi Fish</dc:creator>
		<pubDate>Fri, 10 Apr 2009 10:27:03 +0000</pubDate>
		<guid isPermaLink="false">http://blog.reverberate.org/?p=131#comment-1297</guid>
		<description>
Thanks for this article. Once upon a time (back in 2002), I gave &lt;a href=&quot;http://tech.groups.yahoo.com/group/hackers-il/message/1831&quot; rel=&quot;nofollow&quot;&gt;a short overview of some hashing techniques to the Hackers-IL mailing list&lt;/a&gt;. Seems like I missed some stuff you mentioned here, which wasn&#039;t covered in my introductory Data Structures and Algorithms course.



I myself, always prefer to use some form of chaining instead of open addressing. That&#039;s because with open-addressing, it&#039;s harder to tell when an item was removed or does not exist at all, and as a result the hash becomes much more flimsy. Maybe there&#039;s some magic way to achieve it, but I&#039;m still ignorant of it.
</description>
		<content:encoded><![CDATA[<p>Thanks for this article. Once upon a time (back in 2002), I gave <a href="http://tech.groups.yahoo.com/group/hackers-il/message/1831" rel="nofollow">a short overview of some hashing techniques to the Hackers-IL mailing list</a>. Seems like I missed some stuff you mentioned here, which wasn&#8217;t covered in my introductory Data Structures and Algorithms course.</p>
<p>I myself, always prefer to use some form of chaining instead of open addressing. That&#8217;s because with open-addressing, it&#8217;s harder to tell when an item was removed or does not exist at all, and as a result the hash becomes much more flimsy. Maybe there&#8217;s some magic way to achieve it, but I&#8217;m still ignorant of it.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Kragen Javier Sitaker</title>
		<link>http://blog.reverberate.org/2009/03/01/the-art-of-hashing/comment-page-1/#comment-1286</link>
		<dc:creator>Kragen Javier Sitaker</dc:creator>
		<pubDate>Thu, 05 Mar 2009 14:03:01 +0000</pubDate>
		<guid isPermaLink="false">http://blog.reverberate.org/?p=131#comment-1286</guid>
		<description>Oops!  I confused you with the other person writing a novel parsing tool in a very-high-level language that I&#039;m watching on Github, Nathan Sobo, whose parsing system is Treetop (in Ruby).</description>
		<content:encoded><![CDATA[<p>Oops!  I confused you with the other person writing a novel parsing tool in a very-high-level language that I&#8217;m watching on Github, Nathan Sobo, whose parsing system is Treetop (in Ruby).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Serge</title>
		<link>http://blog.reverberate.org/2009/03/01/the-art-of-hashing/comment-page-1/#comment-1277</link>
		<dc:creator>Serge</dc:creator>
		<pubDate>Tue, 03 Mar 2009 08:19:24 +0000</pubDate>
		<guid isPermaLink="false">http://blog.reverberate.org/?p=131#comment-1277</guid>
		<description>I recently came across perfect hashing (see http://burtleburtle.net/bob/hash/perfect.html). It&#039;s not at all clear to me that this is necessarily a speed win versus the other hash table strategies you list, but it does avoid the issue of collisions entirely. I suspect the tradeoff is that finding a (possibly minimal) perfect hashing function for a given set of keys might be be tricky...</description>
		<content:encoded><![CDATA[<p>I recently came across perfect hashing (see <a href="http://burtleburtle.net/bob/hash/perfect.html)" rel="nofollow">http://burtleburtle.net/bob/hash/perfect.html)</a>. It&#8217;s not at all clear to me that this is necessarily a speed win versus the other hash table strategies you list, but it does avoid the issue of collisions entirely. I suspect the tradeoff is that finding a (possibly minimal) perfect hashing function for a given set of keys might be be tricky&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: josh</title>
		<link>http://blog.reverberate.org/2009/03/01/the-art-of-hashing/comment-page-1/#comment-1272</link>
		<dc:creator>josh</dc:creator>
		<pubDate>Mon, 02 Mar 2009 00:34:31 +0000</pubDate>
		<guid isPermaLink="false">http://blog.reverberate.org/?p=131#comment-1272</guid>
		<description>@Kragen: while shelling out to the system is always a possibility, I&#039;m really trying to stay within portable C99, and not make assumptions about the environment.  A good model for what I&#039;m going for is zlib: I think zlib is ubiquitous in large part because it makes absolutely no assumptions about your system.  It even lets you swap in your own implementations of malloc and free

For example, I want pbstream to be usable in embedded system.  Here&#039;s an example of a guy who&#039;s in a situation I want to accommodate: &lt;a href=&quot;http://groups.google.com/group/protobuf/msg/b374951e92d8e192&quot; rel=&quot;nofollow&quot;&gt;GPB on non-Linux, non-Windows OS?&lt;/a&gt;.

I hear what you&#039;re saying wrt. just using Python&#039;s dict.  Although accommodating languages like Python, Ruby, Lua, etc. is one of my goals, I also want this library to perform well standalone, in cases where there is no other hashtable implementation available.  Also, of my two hashtable use cases, the int-&gt;void* lookup that happens in the critical path of parsing is the more important one to me by far, and this happens at a layer below where you&#039;d interface something like Python or Ruby.

It makes me really happy that you&#039;re independently interested in Gazelle!  pbstream and Gazelle are actually soon to be intertwined: Gazelle needs a bytecode format and an AST serialization format (I mean to switch to protobufs from LLVM bitcode), and pbstream needs a way to parse .proto files.  pbstream-&gt;Gazelle won&#039;t be a hard dependency though -- you&#039;ll be able to load .proto definitions in binary form if you want.

I&#039;m afraid I&#039;m drawing a complete blank on your friend Matthew O&#039;Connor though, and I&#039;ve never been associated with Pivotal.  Is there something I&#039;m forgetting?</description>
		<content:encoded><![CDATA[<p>@Kragen: while shelling out to the system is always a possibility, I&#8217;m really trying to stay within portable C99, and not make assumptions about the environment.  A good model for what I&#8217;m going for is zlib: I think zlib is ubiquitous in large part because it makes absolutely no assumptions about your system.  It even lets you swap in your own implementations of malloc and free</p>
<p>For example, I want pbstream to be usable in embedded system.  Here&#8217;s an example of a guy who&#8217;s in a situation I want to accommodate: <a href="http://groups.google.com/group/protobuf/msg/b374951e92d8e192" rel="nofollow">GPB on non-Linux, non-Windows OS?</a>.</p>
<p>I hear what you&#8217;re saying wrt. just using Python&#8217;s dict.  Although accommodating languages like Python, Ruby, Lua, etc. is one of my goals, I also want this library to perform well standalone, in cases where there is no other hashtable implementation available.  Also, of my two hashtable use cases, the int->void* lookup that happens in the critical path of parsing is the more important one to me by far, and this happens at a layer below where you&#8217;d interface something like Python or Ruby.</p>
<p>It makes me really happy that you&#8217;re independently interested in Gazelle!  pbstream and Gazelle are actually soon to be intertwined: Gazelle needs a bytecode format and an AST serialization format (I mean to switch to protobufs from LLVM bitcode), and pbstream needs a way to parse .proto files.  pbstream->Gazelle won&#8217;t be a hard dependency though &#8212; you&#8217;ll be able to load .proto definitions in binary form if you want.</p>
<p>I&#8217;m afraid I&#8217;m drawing a complete blank on your friend Matthew O&#8217;Connor though, and I&#8217;ve never been associated with Pivotal.  Is there something I&#8217;m forgetting?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Kragen Javier Sitaker</title>
		<link>http://blog.reverberate.org/2009/03/01/the-art-of-hashing/comment-page-1/#comment-1270</link>
		<dc:creator>Kragen Javier Sitaker</dc:creator>
		<pubDate>Mon, 02 Mar 2009 00:00:42 +0000</pubDate>
		<guid isPermaLink="false">http://blog.reverberate.org/?p=131#comment-1270</guid>
		<description>Oh hey, I just noticed you&#039;re the guy that&#039;s writing Gazelle.  Cool!  My friend Matthew O&#039;Connor told me about you when he paired with you when applying to work at Pivotal, and I&#039;m watching Gazelle on Github.</description>
		<content:encoded><![CDATA[<p>Oh hey, I just noticed you&#8217;re the guy that&#8217;s writing Gazelle.  Cool!  My friend Matthew O&#8217;Connor told me about you when he paired with you when applying to work at Pivotal, and I&#8217;m watching Gazelle on Github.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Kragen Javier Sitaker</title>
		<link>http://blog.reverberate.org/2009/03/01/the-art-of-hashing/comment-page-1/#comment-1269</link>
		<dc:creator>Kragen Javier Sitaker</dc:creator>
		<pubDate>Sun, 01 Mar 2009 23:57:59 +0000</pubDate>
		<guid isPermaLink="false">http://blog.reverberate.org/?p=131#comment-1269</guid>
		<description>It&#039;s actually pretty easy to use GCC and dlopen as your JIT; see http://lists.canonical.org/pipermail/kragen-hacks/2003-February/000364.html for a Python example.  It does build a C extension for each arithmetic expression, but that happens automatically and transparently, and it only takes about a second on the really old machine I was using then.  With a little more work (making a version of Python.h that doesn&#039;t depend on the system header files, and provides just the definitions you need for the generated extensions, you could probably get it down below 10ms on a modern machine.

But if you&#039;re calling this hash lookup from Python anyway, it&#039;s silly to worry about any of the micro-efficiencies you&#039;re talking about.  Just use Python&#039;s dict; you might be able to do epsilon better, but that will be swallowed up by the bytecode dispatch overhead and function call and return (on the order of 30-300 instructions per bytecode).

BTW, you left out a slight variation of linear probing: you make each hash bucket big enough to contain more than one item.  Of course this doesn&#039;t save you from having to deal with the case for where the entire hash bucket fills up, when you either need to use one of the other strategies to figure out where the next item goes, or expand the hash table.</description>
		<content:encoded><![CDATA[<p>It&#8217;s actually pretty easy to use GCC and dlopen as your JIT; see <a href="http://lists.canonical.org/pipermail/kragen-hacks/2003-February/000364.html" rel="nofollow">http://lists.canonical.org/pipermail/kragen-hacks/2003-February/000364.html</a> for a Python example.  It does build a C extension for each arithmetic expression, but that happens automatically and transparently, and it only takes about a second on the really old machine I was using then.  With a little more work (making a version of Python.h that doesn&#8217;t depend on the system header files, and provides just the definitions you need for the generated extensions, you could probably get it down below 10ms on a modern machine.</p>
<p>But if you&#8217;re calling this hash lookup from Python anyway, it&#8217;s silly to worry about any of the micro-efficiencies you&#8217;re talking about.  Just use Python&#8217;s dict; you might be able to do epsilon better, but that will be swallowed up by the bytecode dispatch overhead and function call and return (on the order of 30-300 instructions per bytecode).</p>
<p>BTW, you left out a slight variation of linear probing: you make each hash bucket big enough to contain more than one item.  Of course this doesn&#8217;t save you from having to deal with the case for where the entire hash bucket fills up, when you either need to use one of the other strategies to figure out where the next item goes, or expand the hash table.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: josh</title>
		<link>http://blog.reverberate.org/2009/03/01/the-art-of-hashing/comment-page-1/#comment-1268</link>
		<dc:creator>josh</dc:creator>
		<pubDate>Sun, 01 Mar 2009 20:52:01 +0000</pubDate>
		<guid isPermaLink="false">http://blog.reverberate.org/?p=131#comment-1268</guid>
		<description>@Brian: there are many, many reasons that I hope to write up at some point.  Basically I&#039;m trying to implement protobufs in layers, where the lowest layer is blindingly fast and flexible.  Say you have large protobufs (which a lot of people already do) and you want to just pull out the first value for field X.  Existing protobuf libraries will make you parse the whole protobuf, copy it all into a separate data structure (malloc is expensive), just to get that one value!  Mine will let you scan the protobuf for that value, and stop parsing when you find it.

@Kragen: I should have said &quot;lookup speed trumps all within the constraint that I don&#039;t know the table at compile time.  One of the things I&#039;m trying to deliver is a Python API that doesn&#039;t require you to build a C extension for each different proto type you want to parse (ick!).

I&#039;m considering adding a JIT using LLVM or similar sometime in the future -- if I do that then I could do what you say.</description>
		<content:encoded><![CDATA[<p>@Brian: there are many, many reasons that I hope to write up at some point.  Basically I&#8217;m trying to implement protobufs in layers, where the lowest layer is blindingly fast and flexible.  Say you have large protobufs (which a lot of people already do) and you want to just pull out the first value for field X.  Existing protobuf libraries will make you parse the whole protobuf, copy it all into a separate data structure (malloc is expensive), just to get that one value!  Mine will let you scan the protobuf for that value, and stop parsing when you find it.</p>
<p>@Kragen: I should have said &#8220;lookup speed trumps all within the constraint that I don&#8217;t know the table at compile time.  One of the things I&#8217;m trying to deliver is a Python API that doesn&#8217;t require you to build a C extension for each different proto type you want to parse (ick!).</p>
<p>I&#8217;m considering adding a JIT using LLVM or similar sometime in the future &#8212; if I do that then I could do what you say.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Kragen Javier Sitaker</title>
		<link>http://blog.reverberate.org/2009/03/01/the-art-of-hashing/comment-page-1/#comment-1267</link>
		<dc:creator>Kragen Javier Sitaker</dc:creator>
		<pubDate>Sun, 01 Mar 2009 19:49:43 +0000</pubDate>
		<guid isPermaLink="false">http://blog.reverberate.org/?p=131#comment-1267</guid>
		<description>Hey, uh, if lookup speed really trumps *all*, then I speculate that you should probably compile each set of keys into a machine-code subroutine expressing a balanced binary tree. The routine will contain N-1 compare-and-conditional-jump pairs for N keys, plus N key comparisons if it&#039;s important to return an error when the key isn&#039;t actually in the table.  An immediate-compare-and-conditional-jump is 7 bytes on the x86, so about 9 of them fit into a 64-byte L1 cache line, and you execute 2 lg(N) + 4 or so instructions to look up one of N keys.  (Possibly you&#039;ll occasionally have an additional memory access to fetch another four bytes of the key, but I&#039;m assuming not.)  So if you have, say, 100 keys, you execute 16 or 18 instructions (and perform no memory accesses) to look one up and return the associated value.

A really straightforward way to implement this strategy (for integer keys) is to generate a C file with a switch statement in it, use GCC to compile it to a shared library, and then dlopen the shared library in order to call into it.  (Alternatively you could use LLVM or GNU Lightning.)

The nearest hashing equivalent is &quot;perfect hashing&quot;, where you search for a hash function that hashes each key into a distinct bucket in a minimal-size hash table.  gperf is a widely-available perfect-hash generator.  I haven&#039;t measured and so I don&#039;t know which one of these is faster.</description>
		<content:encoded><![CDATA[<p>Hey, uh, if lookup speed really trumps *all*, then I speculate that you should probably compile each set of keys into a machine-code subroutine expressing a balanced binary tree. The routine will contain N-1 compare-and-conditional-jump pairs for N keys, plus N key comparisons if it&#8217;s important to return an error when the key isn&#8217;t actually in the table.  An immediate-compare-and-conditional-jump is 7 bytes on the x86, so about 9 of them fit into a 64-byte L1 cache line, and you execute 2 lg(N) + 4 or so instructions to look up one of N keys.  (Possibly you&#8217;ll occasionally have an additional memory access to fetch another four bytes of the key, but I&#8217;m assuming not.)  So if you have, say, 100 keys, you execute 16 or 18 instructions (and perform no memory accesses) to look one up and return the associated value.</p>
<p>A really straightforward way to implement this strategy (for integer keys) is to generate a C file with a switch statement in it, use GCC to compile it to a shared library, and then dlopen the shared library in order to call into it.  (Alternatively you could use LLVM or GNU Lightning.)</p>
<p>The nearest hashing equivalent is &#8220;perfect hashing&#8221;, where you search for a hash function that hashes each key into a distinct bucket in a minimal-size hash table.  gperf is a widely-available perfect-hash generator.  I haven&#8217;t measured and so I don&#8217;t know which one of these is faster.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Brian Slesinsky</title>
		<link>http://blog.reverberate.org/2009/03/01/the-art-of-hashing/comment-page-1/#comment-1266</link>
		<dc:creator>Brian Slesinsky</dc:creator>
		<pubDate>Sun, 01 Mar 2009 18:47:43 +0000</pubDate>
		<guid isPermaLink="false">http://blog.reverberate.org/?p=131#comment-1266</guid>
		<description>I&#039;m curious, what made you decide to write a streaming protobuf implementation, versus sending a stream of protobufs embedded in another format?</description>
		<content:encoded><![CDATA[<p>I&#8217;m curious, what made you decide to write a streaming protobuf implementation, versus sending a stream of protobufs embedded in another format?</p>
]]></content:encoded>
	</item>
</channel>
</rss>
