…reads the latest comment on my last entry. It’s true, I’ve gone silent for a few months. Work on pbstream has languished. The answer to “what happened?” is a combination of a personal life that got crazy and a brick wall I hit in the design of pbstream. But I think I have resolved the latter of those at least and work is progressing again.
So what is the brick wall I hit with pbstream? I had conflicting goals that I couldn’t figure out how to reconcile. One goal was to include as little policy related to memory management as humanly possible. In other words, if you’re using pbstream to store fully-parsed protobufs in memory — I’ll use the SAX vs. DOM analogy again, where this is the DOM case — then I want pbstream to not define any semantics for how the memory for that tree is managed. Is it reference-counted, garbage-collected, or is it just allocated/deallocated in one big chunk? Every runtime already has its own answer to this question. I don’t want to include a memory management strategy, which adds to the code size and complexity, just to have it be an annoying thing that you have to integrate into whatever runtime you’re already using.
Or, as I said in a previous entry: Oh fantastic! Because definitely the one thing that my application doesn’t already have is a memory manager.
On the other hand, I realized that I really do want the ability to pass an in-memory protobuf representation between language implementations. For example, if there’s a C library that conjures up some complex protobuf, I want to be able to pass that parsed in-memory protobuf to Python without serializing/deserializing. I want different runtimes to be able to look at this memory and make sense of it.
So far so good. This alone doesn’t require memory management. But I also wanted two things that do pose a memory-management problem:
- I wanted to pass protobufs between language implementations without copying.
- I wanted one language implementation to be able to modify the protobuf that had been created by a different language implementation.
The first is a problem because if you take a protobuf you created in C and pass it to Python, once that Python returns C has no idea whether Python has taken references to any of the protobuf’s data. If C frees the protobuf, Python could try to later read from the freed memory. If C doesn’t free the protobuf, the memory will leak. The fundamental problem is that C and Python have no way of cooperating to be sure the memory is freed only when the last reference that either of them holds is dropped.
The second is a problem for a similar reason. Suppose you take a protobuf you created in C and pass it to Python, and Python wants to mutate that protobuf. Suppose the protobuf has a submessage, which is implemented as a reference to that other message. If in python you say:
message.submessage = other_submessage
…Python now has to decide what to do with the submessage that message already had. Should it free it or not? If you free it and C had a reference to it, C references freed memory. If you don’t free it and C doesn’t have a reference to it, it leaks.
I fretted over the question of what to do about this for weeks. I didn’t want to give up the ability to pass protobufs between languages without serializing/deserializing. But I was loath to include the slightest amount of memory management code because of the implied complexity, run-time overhead, and code size effects. So everything ground to a halt while I tried to reconcile these conflicting desires in my head.
At one point I convinced myself that the only way out was reference-counting. The in-memory protobufs themselves would contain one extra integer for the reference count, and the code would contain just these little extra reference count increments/decrements and checks. Then C and Python would both maintain reference counts to these shared data structures. For Python it would be a sort of two-stage reference counting, since the individual Python objects would be reference-counted, and when their reference count dropped to zero they would decrement the reference count of the shared protobuf structure one.
Eww, Eww, EWW!! Now maybe you can understand a bit better why I haven’t gotten anything done for a while. There’s no way I could be very happy implementing this scheme.
So quite recently I found a way to relax the sharing requirements just enough that I can get away with not implementing any memory management. My new strategy is:
- language A can reference language B’s protobufs, but only for as long as B guarantees that all the references will continue to be valid. In most cases this just means that when B calls A and passes a protobuf, A can only reference B’s protobufs before it returns control back to B.
- In general, languages cannot mutate each other’s protobufs unless they have some special arrangement with regards to memory management. I thought I really needed this capability in Gazelle, but I’m no longer convinced I do.
So with that resolved, work is resuming, and I hope to have some results soon. I always hate making promises though, because I’m unwilling to press on when I don’t have a satisfactory answer to a question (as the last month demonstrates), and this can delay my progress long beyond my original intentions.