<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Josh Haberman &#187; Gazelle</title>
	<atom:link href="http://blog.reverberate.org/category/gazelle/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.reverberate.org</link>
	<description>parsing, performance, minimalism with C99</description>
	<lastBuildDate>Mon, 30 Jan 2012 00:15:14 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Gazelle being integrated into a syntax aware editor</title>
		<link>http://blog.reverberate.org/2011/02/13/gazelle-being-integrated-into-a-syntax-aware-editor/</link>
		<comments>http://blog.reverberate.org/2011/02/13/gazelle-being-integrated-into-a-syntax-aware-editor/#comments</comments>
		<pubDate>Mon, 14 Feb 2011 07:53:14 +0000</pubDate>
		<dc:creator>Josh</dc:creator>
				<category><![CDATA[Gazelle]]></category>

		<guid isPermaLink="false">http://blog.reverberate.org/?p=358</guid>
		<description><![CDATA[I&#8217;ve been meaning to post this for several days now: I was very excited to hear on the gazelle-users mailing list that Gazelle is being integrated into a syntax-aware editor called Kod. See the screencast for a preliminary demo. This &#8230; <a href="http://blog.reverberate.org/2011/02/13/gazelle-being-integrated-into-a-syntax-aware-editor/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been meaning to post this for several days now: I was very excited to hear on the gazelle-users mailing list that <a href="http://groups.google.com/group/gazelle-users/msg/3221333034e5083c">Gazelle is being integrated into a syntax-aware editor</a> called <a href="http://kodapp.com/">Kod</a>.  See <a href="http://www.google.com/url?sa=D&#038;q=http://www.youtube.com/watch%3Fv%3DnSM41hToa0w">the screencast</a> for a preliminary demo.</p>
<p>This is very exciting for me, because the premise that a text editor could use a real parser as you type was one of my major motivations for creating Gazelle.  Most text editors clever pattern matching for the syntax highlighting, but since they aren&#8217;t real parsers they can get confused.  More importantly, they can&#8217;t give you true syntax-aware completion.  The notable exception is Java editors like Eclipse, which do this quite well, and I hear that Microsoft&#8217;s IDEs are good at it too.  But I always wanted an editor where you could add new languages easily and have more ability to tinker with how this syntax information is used.  I think it could open up so many doors.</p>
<p>Given this, I was very excited when Rasmus Andersson from the Kod project wrote to the gazelle-users mailing list asking if Gazelle was a dead project, or if it was something he and his team should invest effort into integrating with Kod.  I replied that it was definitely not a dead project, but that it is still subject to major change.  That didn&#8217;t seem to dissuade him though, and I look forward to learning from their needs and folding their improvements back into Gazelle.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.reverberate.org/2011/02/13/gazelle-being-integrated-into-a-syntax-aware-editor/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Gazelle/upb status</title>
		<link>http://blog.reverberate.org/2011/01/28/gazelleupb-status/</link>
		<comments>http://blog.reverberate.org/2011/01/28/gazelleupb-status/#comments</comments>
		<pubDate>Fri, 28 Jan 2011 18:35:01 +0000</pubDate>
		<dc:creator>Josh</dc:creator>
				<category><![CDATA[Gazelle]]></category>
		<category><![CDATA[upb]]></category>

		<guid isPermaLink="false">http://blog.reverberate.org/?p=343</guid>
		<description><![CDATA[It has been just over a year since I last posted, leading some people to rightfully wonder whether my projects Gazelle and upb are abandoned. The answer to that question is a resounding &#8220;no.&#8221; I am more motivated to complete &#8230; <a href="http://blog.reverberate.org/2011/01/28/gazelleupb-status/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>It has been just over a year since I last posted, leading some people to rightfully wonder whether my projects Gazelle and upb are abandoned.  The answer to that question is a resounding &#8220;no.&#8221;  I am more motivated to complete Gazelle and upb than I have ever been, and I have been working on upb actively lately (<a href="https://github.com/haberman/upb/commits/src-refactoring">here are my recent commits</a>).</p>
<p>However, if I&#8217;ve learned one thing over the last several years of working on Gazelle and upb, it&#8217;s that I am extraordinarily bad at knowing how close I am to being ready to release.  I don&#8217;t want to make any predictions or promises.  I honestly thought I was almost ready to release upb a year and a half ago, but I&#8217;ve almost completely rewritten it twice since then.  Each time it gets significantly better and closer to being what I want it to be.  Once I have a release I&#8217;ll describe more about the stages of evolution it went through and how each iteration was objectively better than the one before it.</p>
<p>The core interfaces have gotten to a point where I&#8217;m really happy with them and feel no more need to rework them.  I think they will continue to evolve incrementally, but not in a way that requires redoing them completely.  Here are the most core interfaces &#8212; if you&#8217;re interested in upb, I recommend reading these headers.  I&#8217;ve just added substantial comments to explain them, and more than anything these will give you a taste of what upb is all about:
<ul>
<li><a href="https://github.com/haberman/upb/blob/fbb9fd35e05b88908beeca2c2b88b15aec1fca01/core/upb_stream.h">upb_stream.h</a>, the streaming interfaces for doing SAX-like tree traversal of protobuf data, and abstractions of fread()/fwrite().  These are probably the most important interfaces in all of upb, since all of the encoders and decoders are based on them.</li>
<li><a href="https://github.com/haberman/upb/blob/fbb9fd35e05b88908beeca2c2b88b15aec1fca01/core/upb_string.h">upb_string.h</a>, an immutable, length-delimited (instead of NULL-terminated), reference-counted string type.</li>
<li><a href="https://github.com/haberman/upb/blob/fbb9fd35e05b88908beeca2c2b88b15aec1fca01/core/upb_def.h">upb_def.h</a>, the data structures for a protobuf schema (.proto file) and routines for loading them.</li>
</ul>
<p>By the way, I also mean to write zero-overhead C++ wrappers around the above to give you C++ programmers a nicer interface at no cost.</p>
<p>With those set, I have been rapidly getting everything building/working again.  It&#8217;s a bit annoying to rewrite upb_def.c for the third time (literally) but it feels good knowing the interfaces are right.</p>
<p>So I have renewed optimism that I&#8217;ll be releasing soon.  And once I&#8217;m happy with upb, it&#8217;s back to Gazelle.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.reverberate.org/2011/01/28/gazelleupb-status/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Gazelle is going to love SSE 4.2</title>
		<link>http://blog.reverberate.org/2009/07/18/gazelle-is-going-to-love-sse-4-2/</link>
		<comments>http://blog.reverberate.org/2009/07/18/gazelle-is-going-to-love-sse-4-2/#comments</comments>
		<pubDate>Sat, 18 Jul 2009 21:09:09 +0000</pubDate>
		<dc:creator>Josh</dc:creator>
				<category><![CDATA[Gazelle]]></category>
		<category><![CDATA[Hardware]]></category>

		<guid isPermaLink="false">http://blog.reverberate.org/?p=227</guid>
		<description><![CDATA[SSE 4.2 includes text processing instructions. In the words of Ars Technica: Intel has added a number of new instructions to Nehalem and it has sped up others. The 4.2 version of Intel&#8217;s SSE vector extensions takes the x86 ISA &#8230; <a href="http://blog.reverberate.org/2009/07/18/gazelle-is-going-to-love-sse-4-2/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>SSE 4.2 includes <a href="http://www.reghardware.co.uk/2008/03/18/intel_sse_4_text_tweaks/">text processing instructions</a>.  In the words of <a href="http://arstechnica.com/hardware/news/2008/04/what-you-need-to-know-about-nehalem.ars/3">Ars Technica</a>:</p>
<blockquote><p>Intel has added a number of new instructions to Nehalem and it has sped up others. The 4.2 version of Intel&#8217;s SSE vector extensions takes the x86 ISA back to the future just a bit by adding new string manipulation instructions. I say &#8220;back to the future&#8221; because ISA-level support for string processing is a hallmark of CISC architectures that was actively deprecated in the post-RISC years; typically, when a writer wants to give an example of crufty old corners of the x86 ISA that have caused pain for chip architects, string manipulation instructions are what he or she reaches for. But the new SSE 4.2 string instructions are aimed at accelerating XML processing, which makes them Web-friendly and therefore modern (i.e., not crufty).</p></blockquote>
<p>I chuckled a bit when I read this.  I&#8217;m not very purist when it comes to hardware.  If these instructions will make my parsers faster, then they sound great to me!</p>
<p>The four new instructions are:</p>
<ul>
<li><strong>pcmpestri</strong>: packed compare of <em>explicit</em> length strings, returning <em>index</em></li>
<li><strong>pcmpestrm</strong>: packed compare of <em>explicit</em> length strings, returning <em>mask</em></li>
<li><strong>pcmpistri</strong>: packed compare of <em>implicit</em> length strings, returning <em>index</em></li>
<li><strong>pcmpistrm</strong>: packed compare of <em>implicit</em> length strings, returning <em>mask</em></li>
</ul>
<p>The variants are as follows:</p>
<ul>
<li><em>implicit</em> length strings are NULL-terminated, <em>explicit</em> strings have an explicit length (ie. the whole input register).</li>
<li>they can return an <em>index</em> into the source string (if you were searching for something) or a <em>mask</em> (if you wanted to test each character of the input</li>
</ul>
<p>Both let you scan a 128-bit SSE register (treating it as either 16 8-bit characters or 8 16-bit characters) and perform all kinds of searches/comparisons.  The instructions are configurable; you supply a control word that specifies all of the different variations of the instructions.  For example, are the input values signed or unsigned, are we comparing against ranges or specific values, etc.</p>
<p>The reciprocal throughput of these instructions is high (2 cycles) but the latency is annoyingly slow (9 cycles).  This means that you have to wait nine cycles after issuing the instruction before you can use the result.  It&#8217;s hard to think of too many useful things you can execute in parallel while you&#8217;re waiting for that answer.  As a side note, these figures come from Intel&#8217;s <a href="http://www.intel.com/products/processor/manuals/">IntelÂ® 64 and IA-32 Architectures Optimization Reference Manual</a>, which says that the latency number is a worst case estimate:</p>
<blockquote><p>Actual performance of these instructions by the out-of-order core execution unit can range from somewhat faster to significantly faster than the latency data shown in these tables.</p></blockquote>
<p>I&#8217;m not enough of a hardware geek to know what to actually expect.</p>
<p>Still, that&#8217;s nine cycles to wait before getting a lot of really useful information.  In addition to returning the index or mask, the instructions set several of the flags in useful ways.</p>
<p>So what processors have SSE 4.2?  Or in other words, how long will my impatient self have to wait to try them out?  Apparently SSE 4.2 is available on <a href="http://en.wikipedia.org/wiki/Intel_Core_2#Penryn">Penryn</a>, which is the second-gen Core 2, which debuted in 2007/2008.  It uses a &#8220;45 nm process&#8221;, which I&#8217;m sure means something to hardware geeks but not to me.  All I know is that it&#8217;s not the Core 2 that&#8217;s inside the MacBook Pro sitting on my lap.  And of course SSE 4.2 is in the new <a href="http://en.wikipedia.org/wiki/Intel_Nehalem_(microarchitecture)">Nehalem</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.reverberate.org/2009/07/18/gazelle-is-going-to-love-sse-4-2/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Gazelle 0.4 released!</title>
		<link>http://blog.reverberate.org/2009/01/21/gazelle-04-released/</link>
		<comments>http://blog.reverberate.org/2009/01/21/gazelle-04-released/#comments</comments>
		<pubDate>Wed, 21 Jan 2009 19:45:48 +0000</pubDate>
		<dc:creator>Josh</dc:creator>
				<category><![CDATA[Gazelle]]></category>

		<guid isPermaLink="false">http://blog.reverberate.org/?p=94</guid>
		<description><![CDATA[I&#8217;m am very excited to announce the release of Gazelle 0.4! The most notable change in this release is that Gazelle can again handle whitespace. But there are many other changes that put Gazelle well on the path from being &#8230; <a href="http://blog.reverberate.org/2009/01/21/gazelle-04-released/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m am very excited to announce the release of <a href="http://www.reverberate.org/gazelle/">Gazelle</a> 0.4!  The most notable change in this release is that Gazelle can again handle whitespace.  But there are many other changes that put Gazelle well on the path from being a toy to being a tool.  See <a href="http://github.com/haberman/gazelle/blob/v0.4/ReleaseNotes">the ReleaseNotes</a> for all the details.</p>
<p>Next up for Gazelle 0.5: operator-precedence parsing and the ability to build parse trees.  Following up on my earlier post about Bitcode and Protocol Buffers, I&#8217;ve wholeheartedly decided to go the Protocol Buffer route, not only for the bytecode itself, but also for parse trees and abstract syntax trees.  More about this later.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.reverberate.org/2009/01/21/gazelle-04-released/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Gazelle in the browser</title>
		<link>http://blog.reverberate.org/2009/01/12/gazelle-in-the-browser/</link>
		<comments>http://blog.reverberate.org/2009/01/12/gazelle-in-the-browser/#comments</comments>
		<pubDate>Mon, 12 Jan 2009 15:35:16 +0000</pubDate>
		<dc:creator>Josh</dc:creator>
				<category><![CDATA[Gazelle]]></category>

		<guid isPermaLink="false">http://blog.reverberate.org/2009/01/12/gazelle-in-the-browser/</guid>
		<description><![CDATA[While I&#8217;m on a roll with these blog entries, I thought I&#8217;d share just one more secret that&#8217;s been kicking around in my head for a while. I desperately want to see a Gazelle development environment that runs in your &#8230; <a href="http://blog.reverberate.org/2009/01/12/gazelle-in-the-browser/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>While I&#8217;m on a roll with these blog entries, I thought I&#8217;d share just one more secret that&#8217;s been kicking around in my head for a while.  I desperately want to see a Gazelle development environment that runs in your web browser.</p>
<p>I first had this idea when I visited the excellent <a href="http://osteele.com/tools/reanimator/">regular expression reanimator</a>.  It&#8217;s this beautiful little visualization that runs in your browser and shows you how a DFA transitions in response to text.  And all you have to do to use it is click that little link!</p>
<p>My dream is that you can develop your grammars interactively.  You type rules and <i>as you type</i> you see little graphs build themselves up, state by state.  You add a kleene star and you see the extra transitions add themselves to the existing graph.  You put some sample text in another textarea and watch it start syntax-highlighting itself as you add the rules to properly recognize its structure.  Or you can develop your own syntax highlighting color schemes by typing syntax highlighting rules and again seeing them incrementally take effect on a block of sample text.</p>
<p>None of these ideas are brand-new.  ANTLR of course has <a href="http://www.antlr.org/works/index.html">ANTLRWorks</a>, which is an IDE for developing ANTLR grammars.  But I feel that having this on the web would really lower the barrier to entry and make it accessible to more people.  And despite how people poo-poo JavaScript and HTML, the web is not such a bad platform these days  &#8212; especially now that <a href="http://blog.reverberate.org/2008/08/04/svg-support-in-firefox-3/">SVG support is getting really good</a>.  SVG is the perfect tool for drawing graphs, styling the nodes in various ways, and having them do useful things when you hover/click on them.</p>
<p>So anyway, I <i>really</i> want this.  But the hurdles to making this happen are great!</p>
<ul>
<li>Gazelle&#8217;s compiler is written in Lua.  No real way to run Lua in a browser right now.  Possible options for doing this in the future: a Lua -> JavaScript compiler, a Lua VM written in JavaScript, porting the Gazelle compiler to JavaScript, running the Lua interpreter inside <a href="http://code.google.com/p/nativeclient/">NativeClent</a>, and running the Lua interpreter inside <a href="http://labs.adobe.com/technologies/alchemy/">Alchemy</a>.</li>
<li>Gazelle&#8217;s runtime is written in C.  Possible options: port the runtime to JavaScript, run it inside Alchemy, or run it inside NativeClient.</li>
<li>The Graphviz graph layout package is also written in C.  Possible options are the same as the Gazelle runtime.  I was seriously considering porting just the layout algorithms to JavaScript and had even started the port, but have recently been realizing how CPU-intensive these algorithms are even in C.  In JavaScript they could only be slower.  With Alchemy they would be slower than pure C, though it&#8217;s hard to say how much.  With NativeClient they would run at native speed.</li>
</ul>
<p>So it pretty much looks like I&#8217;m stuck waiting for either Alchemy or NativeClient to be prime-time.  While I could always run all this software on the server and AJAX the results in constantly, I don&#8217;t think that would provide a smooth enough user experience (not to mention the load it would put on my server &#8212; these algorithms are kind of expensive).</p>
<p>As far as Alchemy vs. NativeClient: Alchemy will be slower (2-10x according to them) since it compiles your C/C++ to the Flash VM instead of to machine code, and it will require the Flash plugin.  NativeClient will run at roughly native speeds, but will require a plugin download (most people will have flash already) and one that has potentially scary security implications if the NativeClient guys made any mistakes.</p>
<p>Time to play with Alchemy and see if I can get it to work!  And hope that NativeClient makes a lot of progress in no time!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.reverberate.org/2009/01/12/gazelle-in-the-browser/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The importance of being earnest</title>
		<link>http://blog.reverberate.org/2009/01/11/the-importance-of-being-earnest/</link>
		<comments>http://blog.reverberate.org/2009/01/11/the-importance-of-being-earnest/#comments</comments>
		<pubDate>Mon, 12 Jan 2009 01:36:51 +0000</pubDate>
		<dc:creator>Josh</dc:creator>
				<category><![CDATA[Gazelle]]></category>

		<guid isPermaLink="false">http://blog.reverberate.org/2009/01/11/the-importance-of-being-earnest/</guid>
		<description><![CDATA[While I&#8217;m at it, I wanted to take a moment to recognize that I&#8217;ve too frequently made claims that are unsupported or premature. The worst example of this was when I claimed to have found an algorithm that computes LL(k) &#8230; <a href="http://blog.reverberate.org/2009/01/11/the-importance-of-being-earnest/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>While I&#8217;m at it, I wanted to take a moment to recognize that I&#8217;ve too frequently made claims that are unsupported or premature.  The worst example of this was <a href="http://blog.reverberate.org/2008/07/27/gazelle-is-at-least-briefly-the-most-powerful-top-down-parser-there-is/">when I claimed to have found an algorithm that computes LL(k) for all k</a> (this is undecidable &#8212; my algorithm turned out to be a heuristic).  In other cases, what I&#8217;ve written is technically true but misleading: for example, in claiming that Gazelle is &#8220;more powerful&#8221; than ANTLR, this was not taking into account many features that ANTLR has that Gazelle currently does not, like semantic and syntactic predicates.</p>
<p>I&#8217;m not proud of this.  I&#8217;d like to take the opportunity to make a commitment to putting those days behind me.  I am extremely proud of Gazelle and what I think it brings to the table, and I&#8217;ve tried to consistently acknowledge my debts to prior work, especially ANTLR.  Indeed, in the <a href="http://www.reverberate.org/gazelle/docs/manual.html#_the_gazelle_algorithm">Gazelle manual</a> I say:</p>
<blockquote><p>Gazelle&#8217;s algorithm takes a great deal of inspiration from ANTLR, the LL(*) parser generator written by Terence Parr. While Gazelle seeks to augment the theory that Terence developed, the foundations of Gazelle are strongly based in ANTLR&#8217;s achievements. Like ANTLR, Gazelle is a top-down parser that models its lookahead as possibly-cyclic state machines. Gazelle&#8217;s algorithm for generating this lookahead is based on ANTLR&#8217;s algorithm.</p></blockquote>
<p>Still, I clearly need to be much more careful not to jump to conclusions, and be less susceptible to bravado.  If Gazelle is as good as I think it is, its capabilities should speak for themselves.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.reverberate.org/2009/01/11/the-importance-of-being-earnest/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Looking to 0.4</title>
		<link>http://blog.reverberate.org/2009/01/11/looking-to-04/</link>
		<comments>http://blog.reverberate.org/2009/01/11/looking-to-04/#comments</comments>
		<pubDate>Mon, 12 Jan 2009 01:11:59 +0000</pubDate>
		<dc:creator>Josh</dc:creator>
				<category><![CDATA[Gazelle]]></category>

		<guid isPermaLink="false">http://blog.reverberate.org/2009/01/11/looking-to-04/</guid>
		<description><![CDATA[Given that whitespace has been relatively painless to implement, I think I&#8217;ll tackle a number of smaller things I&#8217;ve been meaning to do before releasing 0.4. Like whitespace-handling, these are things that prevent Gazelle from being more than a toy &#8230; <a href="http://blog.reverberate.org/2009/01/11/looking-to-04/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Given that whitespace has been relatively painless to implement, I think I&#8217;ll tackle a number of smaller things I&#8217;ve been meaning to do before releasing 0.4.  Like whitespace-handling, these are things that prevent Gazelle from being more than a toy until they&#8217;re addressed:</p>
<ol>
<li>when Gazelle encounters a parse error, it should report it to the API in a sane way instead of doing assert(false) or just printing a message to stderr and aborting</li>
<li>Gazelle should provide line and column information to the API (this is more complicated than it might sound &#8212; read <a href="http://en.wikipedia.org/wiki/Newline">the Wikipedia article on &#8220;newline&#8221;</a> and <a href="http://unicode.org/reports/tr13/tr13-9.html">the Unicode recommendations on Newline</a> for some of the complications)</li>
<li>Gazelle&#8217;s C API should be namespaced (with <tt>gzl_</tt> or somesuch on all the identifiers) instead of polluting the global namespace with functions called things like &#8220;parse&#8221;</li>
<li>a &#8220;make install&#8221; target, so that you can just type &#8220;gzlc&#8221; for the compiler instead of &#8220;. lua_path; ./compiler/gzlc&#8221;</li>
</ol>
<p>These sound like a good set of goals for a 0.4 release.  I may drop a few or think of a few others.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.reverberate.org/2009/01/11/looking-to-04/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Gazelle handling whitespace/comments</title>
		<link>http://blog.reverberate.org/2009/01/11/gazelle-handling-whitespacecomments/</link>
		<comments>http://blog.reverberate.org/2009/01/11/gazelle-handling-whitespacecomments/#comments</comments>
		<pubDate>Mon, 12 Jan 2009 01:10:27 +0000</pubDate>
		<dc:creator>Josh</dc:creator>
				<category><![CDATA[Gazelle]]></category>

		<guid isPermaLink="false">http://blog.reverberate.org/2009/01/11/gazelle-handling-whitespacecomments/</guid>
		<description><![CDATA[I&#8217;m digging back into Gazelle after a several-month dormancy. If you happen to be someone who syndicates my commits, either directly or via your GitHub news feed (don&#8217;t laugh &#8212; I have 18 &#8220;watchers&#8221; on GitHub!) you&#8217;ll notice a flurry &#8230; <a href="http://blog.reverberate.org/2009/01/11/gazelle-handling-whitespacecomments/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m digging back into Gazelle after a several-month dormancy.  If you happen to be someone who syndicates my commits, either <a href="http://github.com/haberman/gazelle/commits/master">directly</a> or via your GitHub news feed (don&#8217;t laugh &#8212; I have 18 &#8220;watchers&#8221; on GitHub!) you&#8217;ll notice a flurry of commits from the last few days.  I&#8217;ve been working on whitespace handling, and progress is refreshingly brisk: I have things pretty well working after only a few days of work.</p>
<p>Whitespace handling was for a long time the biggest thing preventing Gazelle from being actually useful.  I had a whitespace-handling mechanism back in 0.1, but had to rip it out in 0.2 because it was naive and premature.  It&#8217;s back now, and while it looks mostly the same from the user&#8217;s perspective, it&#8217;s totally different under the hood.</p>
<p>Here&#8217;s the current picture of a JSON grammar, using the new whitespace facility:</p>
<pre>
// Grammar for JSON, as defined at json.org.

@start object;

object   -> "{" (string ":" value) *(,) "}";
array    -> "[" value *(,)              "]";

str_frag -> .chars=/[^\\"]+/ |
            .unicode_char=/\\u ([0-9A-Fa-f]{4})/ |
            .backslash_char=/\\[ntbr"\/\\]/;
string   -> '"' str_frag* '"';

number   -> .sign="-"?
            .integer=/ 0 | ([1-9][0-9]*) /
            .decimal=/ \. [0-9]+ /?
            .exponent=/ [eE] [+-]? [0-9]+ /? ;

value    -> string | number | "true" | "false" | "null" | object | array;

whitespace -> .whitespace=/[\r\n\s\t]+/;

@allow whitespace object ... number, string;</pre>
<p>It&#8217;s the <tt>@allow</tt> statement at the end that triggers the whitespace facility (which are more generally called &#8220;subparsers&#8221;).  That statement means that any time the parser is inside an object (eg. all the time), but not inside a number or a string, whitespace can appear anywhere.  More specifically, for all states in such rules, the <tt>@allow</tt> statement defines a &#8220;whitespace&#8221; transition back to the same state.</p>
<p>As a result, the rule graphs start looking like this:</p>
<p><a href='http://blog.reverberate.org/wp-content/uploads/2009/01/array.png' title='array'><img src='http://blog.reverberate.org/wp-content/uploads/2009/01/array.thumbnail.png' alt='array' /></a></p>
<p>All the whitespace transitions are a little bit annoying to look at, and I&#8217;ll definitely have to tweak the visualization to show them in a nicer way.  But this gives you an idea for what&#8217;s going on.</p>
<p>What&#8217;s nice about this scheme is that you get to specify the whitespace rules in a simple way (&#8220;whitespace is allowed everywhere except in these certain rules&#8221;), but then the whitespace becomes a part of the same grammar data structure and is analyzed by the same algorithms as everything else.</p>
<p>It&#8217;s worth mentioning that this design is significantly different than the traditional design, which is to have a separate lexer simply discard whitespace and comments before they ever hits the parser.  The lovely thing about doing it my way is that the whitespace <i>becomes a part of the parse</i> tree (if you want it to &#8212; you&#8217;ll be able to turn this off).  This means:</p>
<ol>
<li>round-trips from text -> parse tree -> text are lossless</li>
<li>therefore programs can operate on the parse tree to do whitespace-preserving transformations</li>
<li>programs can also analyze the parse tree with regard to how it follows style rules</li>
<li>programs like doxygen that consider comments and whitespace significant could operate on a Gazelle parse tree</li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://blog.reverberate.org/2009/01/11/gazelle-handling-whitespacecomments/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Gazelle v0.3</title>
		<link>http://blog.reverberate.org/2008/11/30/gazelle-v03/</link>
		<comments>http://blog.reverberate.org/2008/11/30/gazelle-v03/#comments</comments>
		<pubDate>Sun, 30 Nov 2008 20:40:29 +0000</pubDate>
		<dc:creator>Josh</dc:creator>
				<category><![CDATA[Gazelle]]></category>

		<guid isPermaLink="false">http://blog.reverberate.org/2008/11/30/gazelle-v03/</guid>
		<description><![CDATA[Things have been pretty quiet here lately. All I can say about that is &#8220;life happens!&#8221; But I&#8217;m very pleased to announce that I am ready to release Gazelle 0.3! The download link is on the Gazelle homepage. See the &#8230; <a href="http://blog.reverberate.org/2008/11/30/gazelle-v03/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Things have been pretty quiet here lately.  All I can say about that is &#8220;life happens!&#8221;  But I&#8217;m very pleased to announce that I am ready to release Gazelle 0.3!  The download link is on <a href="http://www.reverberate.org/gazelle/">the Gazelle homepage</a>.</p>
<p>See <a href="http://github.com/haberman/gazelle/tree/v0.3/ReleaseNotes">the Release Notes</a> for what has improved since the last release.  The gist of is it that the compiler is a lot more robust: it catches more error cases and will never hang due to errors in the input.  There is also explicit ambiguity resolution, which makes it possible to resolve the ambiguous &#8220;if then else&#8221; case explicitly.  And this is all accompanied by a large test suite, since this grammar analysis is difficult and subtle.</p>
<p>The biggest remaining problem that keeps Gazelle from being practical is still the lack of whitespace handling.  And this is exactly what the next release will fix.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.reverberate.org/2008/11/30/gazelle-v03/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A syntactic dilemma (and an intro to Gazelle&#8217;s ambiguity resolution)</title>
		<link>http://blog.reverberate.org/2008/08/10/a-syntactic-dilemma-and-an-intro-to-gazelles-ambiguity-resolution/</link>
		<comments>http://blog.reverberate.org/2008/08/10/a-syntactic-dilemma-and-an-intro-to-gazelles-ambiguity-resolution/#comments</comments>
		<pubDate>Mon, 11 Aug 2008 07:55:27 +0000</pubDate>
		<dc:creator>Josh</dc:creator>
				<category><![CDATA[Gazelle]]></category>

		<guid isPermaLink="false">http://blog.reverberate.org/2008/08/10/a-syntactic-dilemma-and-an-intro-to-gazelles-ambiguity-resolution/</guid>
		<description><![CDATA[As I&#8217;m adding more capabilities to the Gazelle grammar description language, I&#8217;ve come up against a problem for which I can&#8217;t find any solution I like. So I&#8217;m putting it to you, my esteemed readers, to suggest a solution I &#8230; <a href="http://blog.reverberate.org/2008/08/10/a-syntactic-dilemma-and-an-intro-to-gazelles-ambiguity-resolution/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>As I&#8217;m adding more capabilities to the Gazelle grammar description language, I&#8217;ve come up against a problem for which I can&#8217;t find any solution I like. So I&#8217;m putting it to you, my esteemed readers, to suggest a solution I will like more than the ones I&#8217;ve already brainstormed.</p>
<p>Here&#8217;s the problem.  I&#8217;m at a point where I want to support <i>prioritized</i> choice in Gazelle grammars.  Prioritized choice is a way of letting grammar writers explicitly resolve ambiguities that have crept into their grammar.  Take the following grammar rule:</p>
<p><code>s -> "X" | "Y";</code></p>
<p>The two alternatives in this rule (X and Y) have equal priority.  That doesn&#8217;t actually matter in this case, since no text can match them both, but now consider changing the rule a bit:</p>
<p><code>s -> "X" | "X";</code></p>
<p>Now there is a problem, because an X matches both alternatives.  There are two legal parse trees that are both correct according to the grammar.  As a result this grammar is <i>ambiguous</i>.  A parser doesn&#8217;t have any reason to choose one over the other, because they are of equal priority.</p>
<p>The solution to the problem is to let the user specify which alternative should be taken when both match.  If we allow users to do this by introducing <i>prioritized</i> choice, our solution looks very much like how <a href="http://en.wikipedia.org/wiki/Parsing_expression_grammar">Parsing Expression Grammars</a> (also known as PEG&#8217;s) work.  And the syntax that has become very standard in the PEG literature is to use &#8220;/&#8221; as the operator for prioritized choice.  So our ambiguous grammar would be made unambiguous by using prioritized choice:</p>
<p><code>s -> "X" / "X";</code></p>
<p>Now the parser will always choose the first alternative, and the grammar is no longer ambiguous.  You might wonder &#8220;what&#8217;s the point?&#8221;, since the second alternative will <i>never</i> be taken, but in many other cases of ambiguity, the choice of lesser priority <i>will</i> be taken.  The classic example of ambiguity (which can be resolved in this way) is if-then-else:</p>
<p><code>stmt -> "if" expr "then" stmt |<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"if" expr "then" stmt "else" stmt |<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;expr;<br />
expr -> "true" | "false" | /[0-9]+/;</code></p>
<p>This grammar is ambiguous, because the fragment:</p>
<p><code>if true then if true then 1 else 2</code></p>
<p>&#8230;can be parsed in two different ways.  The &#8220;else&#8221; could be assigned to either &#8220;if&#8221; &#8212; both are valid according to the grammar.</p>
<p>So I definitely know I will be introducing prioritized choice.  The question is what syntax to use.  Like I mentioned before, I would really really like to use &#8220;/&#8221; since there is a strong convention from the PEG world for it.  But unfortunately this introduces a major ambiguity in Gazelle&#8217;s own grammar, because it conflicts with the syntax for regex literals.  What does this mean?</p>
<p><code>s -> a / b;<br />
a -> c / d;</code></p>
<p>Is that two rules that both use prioritized choice, or is it one rule that has a giant regular expression in it?</p>
<p><code>s -> a <span style="color: red">/ b;<br />
a -> c /</span> d;</code></p>
<p>I <i>really</i> don&#8217;t want to give up using slashes to introduce regular expression literals, because that is a really strong convention that a lot of programmers are familiar with.  But I also <i>really</i> don&#8217;t want to give up using the really strongly established convention of using slash for prioritized choice.</p>
<p>Anyone have an idea for a compromise I might like?  The only one I can think of is adding a prefix to the regex literals sort of like Perl&#8217;s <code>m{}</code> form for regular expressions.  It could be something like m/regex/.  But I hate to add that uglification to the grammar, especially since regular expressions are going to be SO much more common than prioritized choice.</p>
<p>Any ideas, anyone?</p>
<h3>Bonus: the other form of ambiguity resolution</h3>
<p>Choice/alternation isn&#8217;t the only possible source of grammar ambiguity.  Gazelle&#8217;s constructs for optional and/or repeating grammar fragments can also be ambiguous.  The if-the-else example is perhaps more idiomatically written in Gazelle like so:</p>
<p><code>s -> "if" expr "then" stmt ("else" stmt)? | expr;</code></p>
<p>It&#8217;s written differently, but it&#8217;s just as ambiguous as before.  So we need the same sort of tool to resolve this ambiguity &#8212; a way to say &#8220;prefer to match the optional part&#8221; or &#8220;prefer to NOT match the optional part.&#8221;  This is equivalent to defining a &#8220;greedyness&#8221; to these operators like Perl lets you do in regexes.  The syntax I am planning for these is:</p>
<p><code>s -> a b?;  // no preference to match the b or not<br />
s -> a b+;  // no preference to keep matching b or not<br />
s -> a b*;  // no preference to keep matching b or not</code></p>
<p><b>non-greedy variants</b><br />
<code>s -> a b?-;  // prefer to NOT match b (non-greedy)<br />
s -> a b+-;  // prefer to STOP matching b (non-greedy)<br />
s -> a b*-;  // prefer to STOP matching b (non-greedy)</code></p>
<p><b>greedy variants</b><br />
<code>s -> a b?+;  // prefer to KEEP matching b (greedy)<br />
s -> a b++;  // prefer to KEEP matching b (greedy)<br />
s -> a b*+;  // prefer to KEEP matching b (greedy)</code></p>
<p>I&#8217;m not following the Perl convention here (Perl uses &#8220;?&#8221; instead of &#8220;-&#8221; to make the match non-greedy).  But on the other hand:</p>
<ol>
<li>Perl doesn&#8217;t have a way to explicitly make the match greedy &#8212; matches are greedy by default.  So it has no convention for a &#8220;make greedier&#8221; operator.</li>
<li>not very many people know/use Perl&#8217;s greediness operator anyway</li>
</ol>
<p>So to write the if-then-else example with explicit ambiguity resolution, you would write:</p>
<p><code>"if" expr "then" stmt ("else" stmt)?+ | expr;</code></p>
<p>&#8230;because when there is an ambiguity, we want to bind the else to the most recent if.</p>
<h3>An aside: &#8220;great, so I can just start slapping greedy and non-greedy modifiers on everything, just to be safe, right?&#8221;</h3>
<p>Not so fast there.  Ambiguity resolution isn&#8217;t something you should take lightly, because a grammar with an ambiguity in it presents a user-facing irregularity in your language.  For example, the if-then-else example is a user-facing issue: either &#8220;else&#8221; pairing makes logical sense, and it&#8217;s totally arbitrary how you decide to resolve the ambiguity.  You have to tell your users about it.  You should strive to absolutely minimize the numbers of such ambiguities in your language whenever you can!</p>
<p>The danger of using ambiguity resolution operators willy-nilly is that you don&#8217;t necessarily know if they&#8217;re actually resolving ambiguities.  For example, if you wrote:</p>
<p><code>s -> "X" / "Y";</code></p>
<p>&#8230;the prioritized choice here is completely extraneous, since this grammar is unambiguous already.  Because ambiguities are so important to keep track of and educate your users about, you don&#8217;t want to have to wonder whether a prioritized choice is resolving an ambiguity or not.  You want to know that it is.</p>
<p>To this end, Gazelle will keep track of all ambiguity-resolution operators and error if any of them are never actually used to resolve an ambiguity.  So you can&#8217;t just sprinkle them around &#8220;just to be safe,&#8221; you must only use them where they are effective.  And this will make Gazelle grammars much more useful, because you will know exactly where all your ambiguities are.</p>
<p>By the way, Gazelle can&#8217;t detect <i>any</i> ambiguity in your grammar; calculating whether an arbitrary context-free grammar is ambiguous or not is undecidable.  Basically, all grammars fall into one of several categories:</p>
<ol>
<li>Gazelle can handle the grammar (ie. the grammar is in the green region of the diagram in my previous entry).</li>
<li>Gazelle can handle the grammar <i>except</i> for the fact that it&#8217;s ambiguous.</li>
<li>Gazelle cannot handle the grammar, and would still not be able to even if you resolved the ambiguities.</li>
</ol>
<p>If your grammar falls into category 3, Gazelle doesn&#8217;t know whether the grammar is ambiguous or not.  But for grammars in categories 1 or 2, Gazelle knows exactly where the ambiguities are.</p>
<h3>Bonus #2: a word about Parsing Expression Grammars</h3>
<p>Something you might say (especially if you&#8217;re a PEG fan) is &#8220;gosh Josh, if you&#8217;re going to all this work to introduce ambiguity resolution, why not just use PEG&#8217;s flat-out and abandon all this LL(*) nonsense?&#8221;  Because essentially what I&#8217;m doing is supporting both PEG-based constructs (like ordered choice) <i>and</i> context-free grammar constructs like <i>non</i>-ordered choice.  And PEG&#8217;s support a larger set of grammars than Gazelle ever will (I <i>believe</i> they support any non-left-recursive PEG, though I don&#8217;t have a reference for this).</p>
<p>There are two really major reasons why I don&#8217;t think PEG&#8217;s are the way to go.  The first one is related to the point I was making before.  With PEG&#8217;s all choice is prioritized choice.  However, with PEG-based tools you have no idea if there are real ambiguities in your grammar or not.  PEG&#8217;s are defined such that ambiguity does not exist, but this does not make if-then-else style issues go away, it just sweeps them under the rug.  Even if you don&#8217;t call it &#8220;ambiguity,&#8221; it is still a user-facing issue that you as a language designer definitely want to know about!  If you have the following in a PEG:</p>
<p><code>s -> a / b;</code></p>
<p>&#8230;you have no way of knowing if this represents an ambiguity, which means that you don&#8217;t know if changing the grammar to:</p>
<p><code>s -> b / a;</code></p>
<p>&#8230;will make a substantive difference or not!  You&#8217;re flying blind.</p>
<p>The second reason is that Gazelle&#8217;s parsing algorithm will be far, far faster than packrat parsing (which is the algorithm used to parse PEG&#8217;s).  Both algorithms will have linear asymptotic complexity (though as I briefly mentioned before, I believe that LL(*) grammars can degrade to n^2 in degenerate cases; though I will argue that this will never happen in any sane language).  But though both are linear, packrat parsing&#8217;s constant factor is far higher, and packrat parsing uses O(n) space where n is the size of input.</p>
<p>This might all sound like hot air, but check out these real, hard numbers.  On the homepage of the <a href="http://aurochs.fr/">Aurochs</a> parser generator, which uses PEG&#8217;s and packrat parsing to parse JavaScript, they say:</p>
<blockquote><p><b>How much memory does it use?</b></p>
<p>An awful lot.</p>
<table>
<tbody>
<tr>
<th>Grammar</th>
<th>Input size</th>
<th>Memory consumption</th>
<th>CPU time</th>
<th>Memory overhead</th>
</tr>
<tr>
<td>Javascript</td>
<td>140kb</td>
<td>380Mb</td>
<td>1s</td>
<td>2700</td>
</tr>
<tr>
<td>Javascript</td>
<td>71kb</td>
<td>180Mb</td>
<td>0.45s</td>
<td>2500</td>
</tr>
<tr>
<td>Javascript</td>
<td>1717</td>
<td>5.6Mb</td>
<td>0.01s</td>
<td>3200</td>
</tr>
</tbody>
</table>
</blockquote>
<p>I can&#8217;t vouch for whether this is a great implementation of packrat parsing, but the authors sure seem to know what they&#8217;re doing.  I hope you&#8217;ll agree that 380Mb of memory just to parse 140kb of text is horrendous, and 1s is quite slow too.  Gazelle can&#8217;t parse JavaScript yet, but here is a rough ballpark estimate of what the performance will be like to parse 140kb of text once it can:</p>
<ul>
<li><b>memory</b>: Gazelle uses 24 bytes per entry on its stack, plus 12 bytes for every token of lookahead.  These are the only non-fixed costs.  If we assume a parse tree that goes 100 deep, and an input that requires 100 tokens of lookahead (both ridiculously large assumptions), Gazelle&#8217;s memory footprint is 3.6kb.  This is 0.0009% of the packrat parser&#8217;s memory requirement.</li>
<li><b>CPU</b>: Gazelle currently parses JSON at 18.8Mb/s.  JavaScript will be more complicated than JSON to parse, but Gazelle will also undergo a lot of optimization that hasn&#8217;t been done yet, so I&#8217;ll go with the 18.8Mb/s figure.  That&#8217;s 134x faster than the 140kb/s Aurochs is quoting.</li>
</ul>
<p>Like I said, the Aurochs authors seem to be very smart &#8212; I&#8217;m not saying they&#8217;ve done a bad job.  I&#8217;m just saying that if Aurochs is any indication of what we can expect from packrat parsing, it is orders of magnitude slower, and takes enormous amounts of memory even for modest amounts of input text.  If anyone has better numbers to show for packrat parsing, I&#8217;m all ears.</p>
<p>So even though it&#8217;s more work, I&#8217;m 110% convinced that using this combination of LL(*) with explicit ambiguity resolution is the way to go, and is the best way to achieve the goals that I have set out for this project.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.reverberate.org/2008/08/10/a-syntactic-dilemma-and-an-intro-to-gazelles-ambiguity-resolution/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
	</channel>
</rss>

