Archive

Archive for February, 2009

Microsoft Gazelle?

February 22nd, 2009 Josh No comments

Looks like there’s another Gazelle in the herd. Microsoft Research has released a paper about a web browser called Gazelle:

In this paper, we introduce Gazelle, a secure web browser constructed as a multi-principal OS. Gazelle’s Browser Kernel is an operating system that exclusively manages resource protection and sharing across web site principals.

It’s easy to have my first reaction be annoyance. After all, I’ve been using the “Gazelle” name for over a year. For the same reason I was a bit annoyed when a BitTorrent tracker named Gazelle was released last year.

Then I remember that my “Gazelle” isn’t all that discoverable, even for someone looking to do due diligence about this. For reasons I don’t fully understand, my Gazelle is not returned very high in Google search listings (proof that working at Google isn’t a guarantee of good search listings). Here are my search rankings for various search phrases:

  • gazelle software: result 83 (blog)
  • gazelle project: result 86 (blog)
  • gazelle program: result 89 (blog)
  • gazelle system: result 4 (project homepage)
  • gazelle download: result 34 (project homepage)
  • gazelle parse: result 1 (github), result 2 (project homepage)
  • gazelle parser: result 1 (project homepage)

Conclusion: unless you already know Gazelle is for parsing, or you think to call it a “system,” it’s pretty impossible to find it by its name. The reason I tried “gazelle system” is because the HTML title and headline on the homepage is: Gazelle: a system for building fast, reusable parsers. And that seemed to pay off — it has a much higher ranking for that phrase.

Guess I need to learn some search engine optimization. I should either use the words “project”, “software”, etc more on the homepage, or I should put them in a meta tag or something.

Not that Microsoft would have decided to change the name of their browser thing, but being discoverable is good in general.

In related news, I got the domain gazelle-parser.org, and replicated the homepage there, but I haven’t updated any links to point to it yet. Part of my vision for this site is that it will host a repository of known-good grammars, and let you parse snippets of text for these grammars through a web interface.

Categories: Uncategorized Tags:

one malloc to rule them all

February 20th, 2009 Josh 3 comments

Came across this on Reddit today:

TLSF is a general purpose dynamic memory allocator specifically designed to meet real-time requirements:

  • Bounded Response Time . The worst-case execution time (WCET) of memory allocation and deallocation has got to be known in advance and be independent of application data. TLSF has a constant cost O(1).
  • Fast. Additionally to a bounded cost, the allocator has to be efficient and fast enough. TLSF executes a maximum of 168 processor instructions in a x86 architecture. Depending on the compiler version and optimisation flags, it can be slightly lower or higher.
  • Efficient Memory Use . Traditionally, real-time systems run for long periods of time and some (embedded applications), have strong constraints of memory size. Fragmentation can have a significant impact on such systems. It can increase dramatically, and degrade the system performance. A way to measure this efficiency is the memory fragmentation incurred by the allocator. TLSF has been tested in hundreds of different loads (real-time tasks, general purpose applications, etc.) obtaining an average fragmentation lower than 15 %. The maximum fragmentation measured is lower than 25%.

So put another way, this malloc is fast, bounded in its response, and memory-efficient! Apparently it has no downsides. Just what I like to hear!

That’s the crazy thing about malloc implementations. They all claim to be awesome in every way. So the burning question is: is this better than what I’m using today? I don’t know, but I’d sure like to!

I think that malloc() is an under-appreciated source of slowness in many applications these days. Apple’s fantastic profiler Shark even has a malloc() tracing mode to help discover malloc-related sources of slowness in a program. From the link above:

In today’s large and complex software applications, it is often informative to understand the scope and pattern of memory allocations and deallocations. If you understand how your software objects are being created and destroyed, you can often improve the overall performance of your application significantly, since memory allocation is a very expensive operation in Mac OS X, especially for large blocks of memory. If your program suffers from memory leaks, Malloc Trace is also a good way to identify locations in your program that may be the culprits.

I wish I had real data about this, but I strongly suspect that the reason that Firefox (for example) slows down the longer it’s open is because malloc() calls start taking longer and longer. Have you ever noticed that after you’ve had Firefox open for a long time, even just opening a new blank tab takes a second or two? What does opening a new tab have to do but allocate memory?

So personally I’d love to be running a real-time and O(1) malloc() on my desktop. And the web page claims that it’s not only real-time but “real fast,” which is even better.

This made me think of Jason Evans, author of jemalloc, which he wrote for FreeBSD to scale better across multiprocessors (since malloc() can be a source of contention). He was also working with the Firefox guys to have Firefox 3 use jemalloc, though I’m not sure if that happened or not.

In any case, Jason Evans seems to have come to the conclusion mid-last-year that using red-black trees (an O(lg n) data structure) in jemalloc was a mistake and that he should have gone with O(1) (as TLSF above does). And it sounds like he eventually changed jemalloc to use O(1) algorithms instead. From his blog entry Overzealous use of my red-black tree hammer:

Once Firefox was successfully using jemalloc for all memory allocation, I started doing performance tests. Memory usage differences were minor, but jemalloc was consistently slower than OS X’s allocator. It took a lot of profiling for me to finally accept the hard truth: jemalloc was spending way too much time manipulating red-black trees. My first experimental solution was to replace red-black trees with treaps. However, this made little overall difference. So, the problem was too many tree operations, not slow tree operations.

After a bit of code review, it became clear that when I fixed a page allocation bottleneck earlier this year, I was overzealous with the application of red-black trees. It is possible to use constant-time algorithms based on linear page map data structures for splitting/coalescing sequential runs of pages, but I had re-coded these operations entirely using red-black trees. So, I enhanced the page map data structures to support splitting/coalescing, and jemalloc became markedly faster. For example, Firefox sped up by as much as ~10% on JavaScript-heavy benchmarks. (As a side benefit, memory usage went down by 1-2%).

In essence, my initial failure was to disregard the difference between a O(1) algorithm and a O(lg n) algorithm. Intuitively, I think of logarithmic-time algorithms as fast, but constant factors and large n can conspire to make logarthmic time not nearly good enough.

I’m not sure if the changes he made mean that jemalloc is now real-time, but I would be interested to know. I would also be quite interested to see benchmarks between all the mallocs floating around. The ones I know of are:

  • TLSF
  • jemalloc
  • dlmalloc (Doug Lea’s malloc, been around forever)
  • ptmalloc2 (Doug Lea’s malloc, extended to support per-thread arenas. Default allocator for glibc2.3?)
  • TCMalloc (Google’s malloc, claims to be 6x faster than ptmalloc2)
  • nedmalloc (claims to be faster than tcmalloc)
  • Hoard, also claims to be very fast

I think an open-source malloc benchmark that compared all of these under tons of different scenarios would be so so awesome to have. Everyone claims their malloc is the fastest. What is the truth? It could be that different implementations excel in different scenarios. After all, malloc performance has several dimensions:

  1. size of the allocations
  2. how the allocations are distributed across threads (for thread-safe allocators)
  3. current size of the heap
  4. max size the heap has had
  5. number of operations that have been performed (eg. how long has the app been open?)

And consequently there are several factors to measure:

  1. speed of each operation (obviously)
  2. worst-case performance
  3. memory fragmentation
  4. locality (if allocations that happen closely in time are also close in space, they will be more cache-friendly)

So dear college students who are interested in systems and want to do something awesome: create an awesome, open-source malloc benchmark suite!

Categories: Uncategorized Tags:

Static Lua

February 9th, 2009 Josh 7 comments

Following up on my last post, here’s what I’d like to do someday (read: in the future. not now).

I’d like to create a language that is as much like Lua as possible, but adds just enough to make it static and type-safe. Specifically:

  1. all variables must be defined before use.
  2. all variables and function parameters must have a type specified. though it would support “auto” that can infer the type from the initializer. there would be typenames for all of lua’s built-in types, and classes would have names too.
  3. there would be a “class” construct. just enough to let you specify the class name, parent class, maybe abstract interfaces you implement, and then a class definition inside curlies.
  4. inside a class definition, you would define your instance methods with the same syntax that you declare faux instance methods in Lua today.
  5. inside a class definition you can also put instance variables.
  6. class methods and instance variables can be public, private, protected, etc.
  7. objects would show up as a “userdata” in Lua, which would let you use them just like regular Lua faux classes, except that referencing nonexistent fields would throw an error.

The benefits would be:

  1. Static Lua and Lua could operate seamlessly.
  2. Static Lua would have much better performance because (1) type checks would be performed at compile time instead of runtime, (2) property accesses could be array offsets instead of hash table lookups, and (3) all types are known at compile time so the compiler could generate much more optimized code.
  3. Static Lua would have all the benefits of static languages, and Lua would have all the benefits of dynamic languages. But they would both be runtime-loadable, and could both be JITted using LLVM.

And to Java people who would say that I’m re-inventing the JVM, I would like to remind you that I happen to actually care about memory footprint, startup time (scroll all the way to the bottom to find Groovy), and having a lightweight (small, easy-to-distribute) runtime embeds nicely and plays well with others. Maybe the right implementation of Java would be what I’m looking for, but I’m not that interested in a joining a language community that has demonstrated for 15 years that it has no regard for such things.

I’m a die-hard minimalist, and will never be satisfied building on top of runtimes that do not have the same appreciation for minimalism. There’s a reason zlib (for example) is absolutely everywhere you look today. It’s the tiniest and most unobtrusive C library you could possibly imagine for doing compression. There is absolutely nothing about it that can make anybody say “it has great functionality, but [I only have 64k of RAM]/[I need to compress 1TB streams with much less than 1TB of RAM]/[I need to have control over all memory allocation]/[I need to be able to impose hard limits on its resource usage].”. No matter what you’re doing, zlib is there for you, because it imposes the absolutely minimal amount of requirements on whatever is embedding it.

I want the same to be true for Gazelle, my protobuf implementation, and pretty any software I’m writing for the ages.

Categories: Uncategorized Tags:

Disappointment

February 9th, 2009 Josh 2 comments

I have spent the entire weekend hacking on Gazelle and I have to say I’m somewhat disappointed in my results.

I was looking quite forward to Gazelle-hacking this weekend. There is so much I want to do — so many things completely designed in my head that only need to be typed out. Specifically:

I’ve been itching to work on all these things, and had the free time this weekend to do it.

So I sat down Friday night and looked at the first Google Code issue I intended to fix: Regex declaration are not taken into account if declared after they are used. When I thought about the right way to solve this problem, I realized that I needed to bind symbols lazily. If I’m parsing a grammar file and I see:

a -> b;

If “b” hasn’t been defined yet, I don’t know if it’s going to be a rule or a named regex. So I need to put a placeholder in “a” definition that I can bind later, once I know what “b” is.

And when I thought about ways to do that, an intuitive feeling came over me that I’ve reached a point with the Gazelle compiler that I need to start imposing a bit more structure and discipline. Don’t get me wrong, I’m proud of the compiler’s code in many ways. It’s got lots of useful comments and a pretty good design. But when you’re using a dynamic language like Lua that gives you functions but not classes, you find yourself writing things like this:

if fa.is_nonterm(self._transitions[1][1])
  -- ...
end

What is this table nested two deep? What does indexing into [1][1] mean? You just tend to get these tables where the knowledge about what the structure means is all implicit. Here’s the worst example I found (this one I’m *not* proud of). This expression occurs in the middle of a string concatenation:

gla_state.rtn_paths:to_array()[1].prediction[2].rtn.name

What are all of those intermediate data structures about? Well, if you search in the source file for “prediction” you find the answer:

obj.prediction = {predicted_edge, predicted_dest_state}

Ok, so this “prediction” thing is just a sort of ad-hoc tuple of (edge, dest_state), so the prediction[2] index above was pulling out the predicted dest state. But that wasn’t at all obvious from the call site.

Another thing that tends to happen with a language where adding properties is easy (like it is in Lua) is that you start attaching properties to objects in random places. Especially with the Gazelle compiler where there are all these graph data structures floating around that are related to each other in various ways, you’ll be writing some code and find that you just need a reference to this one other related object. Like you’ll have a handle to a state in a graph, and you want a reference to the graph itself. So you’ll find someplace else in the program where the state had a reference to that graph, and just slip in a quick little:

state.graph = graph

…and presto, your other code can now just read state.graph and you’ve got what you need. Everybody’s happy!

Except that having too much of this stuff accumulate can make your program hard to reason about. What are all the properties that these state objects ever have? Where are these properties added? When can you count on them being set? Is this the right list of properties for this object? Are the properties all orthogonal, or are some of them redundant? These questions are very difficult to answer when you have random code setting and reading properties in a dynamically-typed language like Lua.

For this reason I’ve got to respectfully disagree with Steve Yegge about the attractiveness of the properties pattern. The biggest bummer about it is that there’s no place that you write down the list of properties for each type and what they mean. Without that, it’s hard for other people to get the high-level picture about your program’s design, and it’s hard for you as the program’s author to think about the design of the program and how it should be refactored.

So I had an intuitive feeling that now was the time to introduce a little bit more structure. And my idea was to implement a very small subset of Ruby’s object model in Lua. Lua is a lovely language in that it gives you primitives that give you the tools you need to do a pretty convincing job of implementing whatever object-oriented model you like. And I’m a big fan of Ruby’s object model.

So I sat down on Friday night, thinking this would take me a few hours of hacking, after which I could move on to the work I really wanted to do (the list I mentioned above).

I’m a chronic underestimator, but this instance was pretty extreme. All in all, I spent about 20 hours on this problem this weekend. And while I finished what I set out to do, the performance is so disappointing that I’m not sure I’ll stick with it.

My primary goals were:

  1. encapsulation. I want to know that an object’s instance variables are only modified in methods of that object. as a program grows, you’ve got to have lines you can draw around chunks of code and say “I can reason about this code in isolation without worrying that its state can be mutated in unexpected ways.”
  2. inheritance. I was using the OO scheme that’s explained in the Lua book, but it’s really bizarre in some ways. for example, to derive from a class you create an instance of it, which becomes the subclass (!?), and then you add methods to it and use it to construct other instances.
  3. a well-defined set of properties for each object. I want this list somewhere so I can docment it and reason about it, and I want to prevent typos

With Ruby as my inspiration, I took a first go at it on Friday night until about 3AM. My first iteration supported properties even: if you said foo.bar = 5, that would turn into a foo.set_bar(5) call. I got it working, but it was super complex. I really wanted to keep the complexity down.

I went through about two more iterations before arriving at this:

object.lua (read the comments at the top of the file for more explanation and usage).

Rather happy with the interface at this point, I started porting a lot of the compiler’s classes to this new object system. This was an extremely long and painful process, and the final result was this diff:

700 lines of pain.

It would have all been worth it except for the horrible moment when I ran my unit test suite and saw this:

$ time make test
[...]
real 0m2.361s
user 0m1.889s
sys 0m0.070s

Compare this with almost the same test suite directly before the port:

$ time make test
[...]
real 0m0.633s
user 0m0.500s
sys 0m0.043s

A 4x slowdown is just more than I can swallow for this.

So I’m very sad to say that after an intense weekend of hacking, I’m pretty much back where I started, except for the experience of having written a small subset of Ruby’s object model in Lua. It’s very nice to use, but it’s just too slow. And a time profile doesn’t have too much low-hanging fruit. :(

I still need a way to manage complexity as the compiler grows. I think my next iteration may be to just create a list of properties for each object and prevent assignment to any property that’s not on the list. I won’t get encapsulation with this scheme, but at least I’ll get a list of properties for each object, which will give me more a way to reason about the design of my objects. For example, just looking at the latest version of compiler/fa.lua is illuminating, because all of the objects have attr_accessor/attr_reader/attr_writer for all the properties that they are ever assigned. Before this change, I would have had no easy way of telling you this information.

Still, the weekend has been a disappointment. I don’t so much enjoy any of this object system hacking for its own sake, especially when I have so much truly useful and ground-breaking work just on the tip of my tongue! I want to find a solution that I’m satisfied with, so I can get it over with and get back to real work.

I have to say, sometimes I really think that what I want is a lightweight, embeddable strongly typed language. Sure, it’s nice when you’re writing little programs not to have to specify the structure of your program with lots of classes, type, and prototype declarations. But when your program grows to a certain size, the ability to impose structure can be extremely freeing, because explicit structure implies predictability. Predictability that when you get called, you’ll get called with the right number of arguments of the right types. Predictability that the contents of your objects are what is specified in your class definition, no more and no less. Predictability that when you’re implementing an interface, you implemented all the methods that were pure virtual in the base class (because if you hadn’t the compiler would have yelled at you).

I’m using Lua because it’s a language that has a very small and efficient implementation, and also has implementations for other runtimes like the JVM. If there was a statically-typed language with these same characteristics, I would be tempted to use that instead. It could probably be more efficient too, because statically-typed languages can do offset-based member lookup instead of having everything be hash table lookups.

But like I said, I’d much rather stop talking about all of this and get back to parsing. But I want to find a strategy for managing the growing complexity of Gazelle’s compiler.

Categories: Uncategorized Tags: