Archive

Archive for February 9th, 2009

Static Lua

February 9th, 2009 Josh 7 comments

Following up on my last post, here’s what I’d like to do someday (read: in the future. not now).

I’d like to create a language that is as much like Lua as possible, but adds just enough to make it static and type-safe. Specifically:

  1. all variables must be defined before use.
  2. all variables and function parameters must have a type specified. though it would support “auto” that can infer the type from the initializer. there would be typenames for all of lua’s built-in types, and classes would have names too.
  3. there would be a “class” construct. just enough to let you specify the class name, parent class, maybe abstract interfaces you implement, and then a class definition inside curlies.
  4. inside a class definition, you would define your instance methods with the same syntax that you declare faux instance methods in Lua today.
  5. inside a class definition you can also put instance variables.
  6. class methods and instance variables can be public, private, protected, etc.
  7. objects would show up as a “userdata” in Lua, which would let you use them just like regular Lua faux classes, except that referencing nonexistent fields would throw an error.

The benefits would be:

  1. Static Lua and Lua could operate seamlessly.
  2. Static Lua would have much better performance because (1) type checks would be performed at compile time instead of runtime, (2) property accesses could be array offsets instead of hash table lookups, and (3) all types are known at compile time so the compiler could generate much more optimized code.
  3. Static Lua would have all the benefits of static languages, and Lua would have all the benefits of dynamic languages. But they would both be runtime-loadable, and could both be JITted using LLVM.

And to Java people who would say that I’m re-inventing the JVM, I would like to remind you that I happen to actually care about memory footprint, startup time (scroll all the way to the bottom to find Groovy), and having a lightweight (small, easy-to-distribute) runtime embeds nicely and plays well with others. Maybe the right implementation of Java would be what I’m looking for, but I’m not that interested in a joining a language community that has demonstrated for 15 years that it has no regard for such things.

I’m a die-hard minimalist, and will never be satisfied building on top of runtimes that do not have the same appreciation for minimalism. There’s a reason zlib (for example) is absolutely everywhere you look today. It’s the tiniest and most unobtrusive C library you could possibly imagine for doing compression. There is absolutely nothing about it that can make anybody say “it has great functionality, but [I only have 64k of RAM]/[I need to compress 1TB streams with much less than 1TB of RAM]/[I need to have control over all memory allocation]/[I need to be able to impose hard limits on its resource usage].”. No matter what you’re doing, zlib is there for you, because it imposes the absolutely minimal amount of requirements on whatever is embedding it.

I want the same to be true for Gazelle, my protobuf implementation, and pretty any software I’m writing for the ages.

Categories: Uncategorized Tags:

Disappointment

February 9th, 2009 Josh 2 comments

I have spent the entire weekend hacking on Gazelle and I have to say I’m somewhat disappointed in my results.

I was looking quite forward to Gazelle-hacking this weekend. There is so much I want to do — so many things completely designed in my head that only need to be typed out. Specifically:

I’ve been itching to work on all these things, and had the free time this weekend to do it.

So I sat down Friday night and looked at the first Google Code issue I intended to fix: Regex declaration are not taken into account if declared after they are used. When I thought about the right way to solve this problem, I realized that I needed to bind symbols lazily. If I’m parsing a grammar file and I see:

a -> b;

If “b” hasn’t been defined yet, I don’t know if it’s going to be a rule or a named regex. So I need to put a placeholder in “a” definition that I can bind later, once I know what “b” is.

And when I thought about ways to do that, an intuitive feeling came over me that I’ve reached a point with the Gazelle compiler that I need to start imposing a bit more structure and discipline. Don’t get me wrong, I’m proud of the compiler’s code in many ways. It’s got lots of useful comments and a pretty good design. But when you’re using a dynamic language like Lua that gives you functions but not classes, you find yourself writing things like this:

if fa.is_nonterm(self._transitions[1][1])
  -- ...
end

What is this table nested two deep? What does indexing into [1][1] mean? You just tend to get these tables where the knowledge about what the structure means is all implicit. Here’s the worst example I found (this one I’m *not* proud of). This expression occurs in the middle of a string concatenation:

gla_state.rtn_paths:to_array()[1].prediction[2].rtn.name

What are all of those intermediate data structures about? Well, if you search in the source file for “prediction” you find the answer:

obj.prediction = {predicted_edge, predicted_dest_state}

Ok, so this “prediction” thing is just a sort of ad-hoc tuple of (edge, dest_state), so the prediction[2] index above was pulling out the predicted dest state. But that wasn’t at all obvious from the call site.

Another thing that tends to happen with a language where adding properties is easy (like it is in Lua) is that you start attaching properties to objects in random places. Especially with the Gazelle compiler where there are all these graph data structures floating around that are related to each other in various ways, you’ll be writing some code and find that you just need a reference to this one other related object. Like you’ll have a handle to a state in a graph, and you want a reference to the graph itself. So you’ll find someplace else in the program where the state had a reference to that graph, and just slip in a quick little:

state.graph = graph

…and presto, your other code can now just read state.graph and you’ve got what you need. Everybody’s happy!

Except that having too much of this stuff accumulate can make your program hard to reason about. What are all the properties that these state objects ever have? Where are these properties added? When can you count on them being set? Is this the right list of properties for this object? Are the properties all orthogonal, or are some of them redundant? These questions are very difficult to answer when you have random code setting and reading properties in a dynamically-typed language like Lua.

For this reason I’ve got to respectfully disagree with Steve Yegge about the attractiveness of the properties pattern. The biggest bummer about it is that there’s no place that you write down the list of properties for each type and what they mean. Without that, it’s hard for other people to get the high-level picture about your program’s design, and it’s hard for you as the program’s author to think about the design of the program and how it should be refactored.

So I had an intuitive feeling that now was the time to introduce a little bit more structure. And my idea was to implement a very small subset of Ruby’s object model in Lua. Lua is a lovely language in that it gives you primitives that give you the tools you need to do a pretty convincing job of implementing whatever object-oriented model you like. And I’m a big fan of Ruby’s object model.

So I sat down on Friday night, thinking this would take me a few hours of hacking, after which I could move on to the work I really wanted to do (the list I mentioned above).

I’m a chronic underestimator, but this instance was pretty extreme. All in all, I spent about 20 hours on this problem this weekend. And while I finished what I set out to do, the performance is so disappointing that I’m not sure I’ll stick with it.

My primary goals were:

  1. encapsulation. I want to know that an object’s instance variables are only modified in methods of that object. as a program grows, you’ve got to have lines you can draw around chunks of code and say “I can reason about this code in isolation without worrying that its state can be mutated in unexpected ways.”
  2. inheritance. I was using the OO scheme that’s explained in the Lua book, but it’s really bizarre in some ways. for example, to derive from a class you create an instance of it, which becomes the subclass (!?), and then you add methods to it and use it to construct other instances.
  3. a well-defined set of properties for each object. I want this list somewhere so I can docment it and reason about it, and I want to prevent typos

With Ruby as my inspiration, I took a first go at it on Friday night until about 3AM. My first iteration supported properties even: if you said foo.bar = 5, that would turn into a foo.set_bar(5) call. I got it working, but it was super complex. I really wanted to keep the complexity down.

I went through about two more iterations before arriving at this:

object.lua (read the comments at the top of the file for more explanation and usage).

Rather happy with the interface at this point, I started porting a lot of the compiler’s classes to this new object system. This was an extremely long and painful process, and the final result was this diff:

700 lines of pain.

It would have all been worth it except for the horrible moment when I ran my unit test suite and saw this:

$ time make test
[...]
real 0m2.361s
user 0m1.889s
sys 0m0.070s

Compare this with almost the same test suite directly before the port:

$ time make test
[...]
real 0m0.633s
user 0m0.500s
sys 0m0.043s

A 4x slowdown is just more than I can swallow for this.

So I’m very sad to say that after an intense weekend of hacking, I’m pretty much back where I started, except for the experience of having written a small subset of Ruby’s object model in Lua. It’s very nice to use, but it’s just too slow. And a time profile doesn’t have too much low-hanging fruit. :(

I still need a way to manage complexity as the compiler grows. I think my next iteration may be to just create a list of properties for each object and prevent assignment to any property that’s not on the list. I won’t get encapsulation with this scheme, but at least I’ll get a list of properties for each object, which will give me more a way to reason about the design of my objects. For example, just looking at the latest version of compiler/fa.lua is illuminating, because all of the objects have attr_accessor/attr_reader/attr_writer for all the properties that they are ever assigned. Before this change, I would have had no easy way of telling you this information.

Still, the weekend has been a disappointment. I don’t so much enjoy any of this object system hacking for its own sake, especially when I have so much truly useful and ground-breaking work just on the tip of my tongue! I want to find a solution that I’m satisfied with, so I can get it over with and get back to real work.

I have to say, sometimes I really think that what I want is a lightweight, embeddable strongly typed language. Sure, it’s nice when you’re writing little programs not to have to specify the structure of your program with lots of classes, type, and prototype declarations. But when your program grows to a certain size, the ability to impose structure can be extremely freeing, because explicit structure implies predictability. Predictability that when you get called, you’ll get called with the right number of arguments of the right types. Predictability that the contents of your objects are what is specified in your class definition, no more and no less. Predictability that when you’re implementing an interface, you implemented all the methods that were pure virtual in the base class (because if you hadn’t the compiler would have yelled at you).

I’m using Lua because it’s a language that has a very small and efficient implementation, and also has implementations for other runtimes like the JVM. If there was a statically-typed language with these same characteristics, I would be tempted to use that instead. It could probably be more efficient too, because statically-typed languages can do offset-based member lookup instead of having everything be hash table lookups.

But like I said, I’d much rather stop talking about all of this and get back to parsing. But I want to find a strategy for managing the growing complexity of Gazelle’s compiler.

Categories: Uncategorized Tags: