The Balkanization of Distributed Version Control

Posted by josh at June 3rd, 2007

I think it’s a shame that distributed version control systems are so fragmented. Today you’ve got darcs, mercurial (hg), git, monotone, codeville, bzr, and more to choose from. Fragmentation in this space is really bad because it’s really hard to interoperate between VCS’s. Since each tool has its own learning curve, interacting with a new project could mean having to learn yet another VCS tool.

What really bothers me about this situation is that this field (dVCS) is not very well understood yet. I think there are very few people out there who can really weigh the fundamental differences between these systems. So most people either choose to go with the first system they try, or they prefer one to the other based on strengths or weaknesses of the implementation — eg. git is too hard to understand, I can’t use git on windows, etc.

Even very visible and respected people who write about this only compare the fundamental differences in very vague terms; Ted Tso says Git has “more legs” (more potential) and Keith Packard is leery of Mercurial because Linux’s implementation of file truncation (which Mercurial uses for crash recovery) is racy.

Do you see the problem? We’re making big decisions about how we’re going to encode the history of our precious code based on shallow characteristics of the current implementations. But Mercurial and Git differ in a really fundamental way: Git stores the nodes (file revisions), Mercurial stores the edges (diffs between revisions). And I don’t think anyone fully understands all the implications of either choice.

I’d love to see some academic research that really exhaustively compares all of the dVCS approaches that are flying around, and spells out their fundamental advantages and disadvantages. But until that happens, I’m afraid we’re going to see more of the status quo, where people choose sides based on shallow reasoning and stick with their team until the end.

Posted in Uncategorized| 11 Comments | 

Wishlist for data compression formats: fast seeking

Posted by josh at April 25th, 2007

We take for granted that plain files give you constant-time seek. You call fseek(), the hard drive controller gets a command telling it to seek its head to a given offset, and you read some bytes.

With files compressed with gzip and the like, there is no such ability. Gzip is essentially a VBR compression scheme, so while you could easily seek to a specific byte in the compressed stream, there is know way to automatically know how to find an offset in the uncompressed stream.

This is no reason to admit defeat, though. A strategy used by the FLAC lossless audio compression format is to build seek tables. A seek table is a simple mapping of (uncompressed stream offset -> compressed stream offset) entries. For example, an entry (1055 -> 400) in the table would mean “byte 1055 in the uncompressed stream is encoded into a block starting at byte 400 in the compressed stream.” So if you want to seek to offset X in the uncompressed stream, you find the smallest seek table entry with uncompressed stream offset <= X, physically seek to the corresponding compressed stream offset, and uncompress that block.

If you have a regularly-occurring seek table entries, this gets you constant-time seek even within your compressed files!

You could probably using the existing zlib API to build an external seek table of this sort (eg. in a different file). But it would be much nicer if this was built into the format.

Posted in Uncategorized| 1 Comment | 

Transcoding adventures with C

Posted by josh at April 21st, 2007

Today I spent a lot of time figuring out the state of the art for character set support in C. I wanted to summarize my findings: hopefully distilling what I’ve learned will be useful to someone, as well as giving me the chance to run this by people who are really good at this stuff (I’m looking at you, Matt Brubeck :).

A little background: I am currently working on a parsing framework — something designed to be an improvement on the current generation of parser/lexer generation tools like flex, bison, and ANTLR. Naturally, such a thing designed in 2007 will have to be encoding aware to deserve even a cursory glance from today’s sophisticated and culturally-sensitive programmers, so I spent today figuring out how to do that using my chosen tool: C.

After a lot of investigating, I arrived at this mental model:

Transcoding

“Encoded text” would be bytes of data in UTF-8, ASCII, Shift JIS, or any other character encoding. It’s data you would send across the wire as part of an HTTP request or response (provided you tell your peer what encoding you’re using), store in a file, or process as a character string in a language like C.

C programmers can transcode encoded text between two encodings using the iconv library. Iconv is a transcoding API that first appeared in HP-UX, but which was subsequently standardized as part of the Single Unix Specification. There is a GNU implementation that is used in Linux and probably other free UNIXes also.

The interface to iconv is extremely simple: you give it encoded data in one encoding, and it will transcode that data to another encoding. Here is some sample code:

#include <iconv.h>
#include <stdio.h>

int main()
{
    char latin1_buf[] = "\370";
    char utf8_buf[10];

    // We have to set up a few pointers to this stuff, because iconv()
    // likes to return data by modifying these pointers in place.
    char *latin1_ptr = latin1_buf;
    char *utf8_ptr = utf8_buf;
    size_t inbytesleft = sizeof(latin1_buf) - 1;
    size_t outbytesleft = sizeof(utf8_buf) - 1;
   
    // Allocate a "conversion descriptor" for converting ISO-8859-1 to UTF-8.
    iconv_t iconv_cd = iconv_open("UTF-8", "ISO-8859-1");
    if(iconv_cd == (iconv_t)-1)
    {
        printf("Unable to create conversion description!\n");
        return 1;
    }

    // Perform the actual conversion.
    size_t chars = iconv(iconv_cd, &latin1_ptr, &inbytesleft,
                                   &utf8_ptr, &outbytesleft);

    // Print the results
    if(chars == -1)
        printf("An error occured!\n");
    else
    {
        printf("%zd chars converted\n", chars);
        printf("UTF-8 bytes: ");
        for(int i = 0; i < (sizeof(utf8_buf) - outbytesleft - 1); i++)
            printf("%02hhx ", utf8_buf[i]);
        printf("\n");
    }
}
 

There are a few small complications, but that’s basically it. One thing to note is that in many encodings, it is possible to construct invalid byte sequences — byte sequences that don’t follow the rules of the encoding — and so Iconv has a way of reporting errors of this sort to the user. Since this indicates invalid or corrupt data, your options are to discard the corrupt data and press on, or halt and compain.

One question is how you specify the source and destination encodings to Iconv. Iconv identifies encodings by text strings like “UTF-8,” and you can dump a list of all the encodings supported by you local Iconv installation by typing “iconv -l” (at least for GNU libiconv). But this doesn’t answer the question of how the names for these encodings are standardized — will different Iconv implementations refer to these encodings by the same names? The SUSv2 documentation for Iconv says that these encoding names are implementation dependent. But that is far from ideal — we want to write programs that will work the same across Iconv implementations.

There is one standardized list of character sets and names for them, which is published and maintained by IANA: IANA Character Sets. It’s not written anywhere, but I get the impression that GNU libiconv at least supports all the IANA-standard charset names. I have no idea what other Iconv implementations do, but I bet it’s inconsistent.

As far as I can see, the only option for C programs that want to portably provide transcoding between arbitrary character sets is to tell their users to use IANA-standard charset names, and consider any Iconv implementation that doesn’t support them broken.

Decoding bytes into characters

So far I’ve only addressed the problem of how to convert byte-oriented text between encodings. With this capability alone, we don’t have the tools necessary to:

  • count the number of characters in a string
  • iterate over the characters
  • get integer values for the character, so we can compare the character to other characters or ask questions like isupper()

Basically we’re missing the ability to break the encoded stream into characters. Where do we turn for this?

It turns out (and this surprised me a lot) a bunch of wide character support was added to C99 (a PDF of the spec is freely available online). This includes functions for decoding bytes into characters. Unfortunately it’s not as useful as it could be — I’ll get into this in a moment. With C99 wide character support, the world looks like this.

  • C99 defines the “basic character set” in which everything fits into a single byte (think ASCII, because that’s what it will be in most cases), and the “extended character set”, a superset of the basic character set that adds locale-specific members.
  • There are two new types.
    • wchar_t must be large enough to hold members of the extended character set for any supported locale.
    • wint_t is the same but also must be able to hold a special value WEOF that is distinct from any member of the extended character set.
  • There are wchar_t and wint_t specific versions of existing C library functions that operate on characters like is{alpha,number,digit}() called isw{alpha,number,digit}

C99 also includes the jackpot in this case: the functions mbrtowc() and wcrtomb() functions convert back and forth between a char* buffer and a wchar_t. Using these functions, we can break an encoded byte stream into integer character values and vice-versa. We can pass the character values to functions like iswalpha().

Unfortunately there is a downside. All these C99 functions are based on the idea of your current locale. This is a process-wide global (seriously! gross, I know) that is set by the setlocale() call. So if you want to decode characters from more than one encoding, you have to do a setlocale() each time, which affects all locale-specific functions in all threads of your program. Thankfully, glibc at least has separate versions of all these functions that take locale specifically as a parameter. For example, mbrtowc_l(). But these are not standard, and I have no idea how widely supported they are.

Another problem is that locale names are not the same as charset names. Locale names look something like en_US or en_US.UTF-8. As you can see, they are a combination of country, language, and charset, and the charset name may or may not be part of the locale name. There is a SUSv2 function nl_langinfo() that will return a text description of a locale’s charset, but the names it returns are not standardized, which led a guy named Bruno Haible to create libcharset, whose only purpose is to call nl_langinfo() and translate its weird, non-standard names into “canonical” names (by which I can only assume he means IANA-standard). Furthermore, most systems don’t have that many locales installed (you can check your system with locale -a) — your locales probably don’t support as many charsets as your iconv installation does.

As one final twist of the knife, GCC’s C99 suport matrix says that wchar support is “missing” and also is a “library issue.” I don’t know exactly what this means, and I hope to get an answer to this question soon.

An iconv() trick

Because of all the problems I just mentioned with C99’s mbrtowc() and wcrtomb functions, the glibc manual recommends not using them. Its alternative? Use Iconv with a special “WCHAR_T” charset specifier. I couldn’t find any official documentation about what “WCHAR_T” means to iconv, but this mailing list post from Ulrich Drepper claims that it means “the system dependent and locale dependent wide character encoding.” So you can effectively use Iconv to convert byte-oriented text into wchar_t-size characters. It’s a sort of roundabout way to achieve the same ends as the mbrtowc() and wcrtomb() functions.

Using this trick will allow you to break down encoded byte streams into characters. But it’s not clear to me whether Iconv’s interpretation of WCHAR_T would depend on your process’s current locale. That seems to be what Ulrich was saying, but to have iconv() behave in a locale-aware way seems really contrary to its normal behavior. Would it use the current locale when you call iconv_open(), or iconv(), or both?

A bigger problem is that this “pass WCHAR_T to iconv” strategy doesn’t appear to be very portable — my OS X 10.4.9 installation of iconv doesn’t support it. Sounds like it’s back to the drawing board.

Posted in Uncategorized| 3 Comments | 

XML and XSD are bad for distributed computing

Posted by josh at April 12th, 2007

Naturally I think my own ideas are brilliant. Unfortunately I don’t have Yegge-like powers of persuasive writing, so I don’t often convince people of anything. This is why I’m really happy when someone who is more persuasive than me expresses my exact opinion on something.

This was my reaction to reading Do we need a new kind of schema language?. I’ve argued for a while that XML and XSD really suck, but never quite as eloquently as this guy. My favorite part:

a data model based around XML and XSD is just not very good data model for general-purpose computing. The structures of XML (attributes, elements and text) are those of SGML and these come from the world of markup. Considered as general purpose data structures, they suck pretty badly. There’s a fundamental lack of composability. Why do we need both elements and attributes? Why can’t attributes contain elements? Why is the type of thing that can occur as the content of an element not the same as the type of thing that can occur as a document? Why do we still have cruft like processing instructions and DTDs? XSD makes a (misguided in my view) attempt to add a OO/programming language veneer on top. But it can’t solve the basic problems, and, in my view, this veneer ends up making things worse not better.

On this and related subjects, this guy is saying exactly what I’ve been trying to argue for years, but he says it better than I ever did.

I need to read and understand his proposal for TEDI better, but I think it’s the direction I would advocate for problems of the sort that web services are trying to solve.

Posted in Uncategorized| 1 Comment | 

Software always gets to decide what to do next

Posted by josh at February 18th, 2007

Several people have misinterpreted what I was saying in my previous blog entry, even to the point of thinking I have no idea what I’m talking about. Since these are smart people who I respect, this seems a clear indication that my writing wasn’t clear enough. I think I got a little too excited about shooting from the hip, and sacrificed the integrity of the message. I’ll have to adjust my approach a little.

Let me try a slightly different way of explaining my point. My point, again, was:

software always gets to decide what to do next

Here’s what I mean by that. Let’s say things are going a little crazy on your computer. You’re streaming a sound file from the internet, your shell is in the middle of doing a “cp -r ~ /mnt/external-hd”, and you’re moving your mouse around maniacally and clicking with wild abandon.

Your computer has a bunch of stimuli right now. We have interrupts firing from the sound card, the network card, the mouse, and the disk controller. We’re doing a bunch of I/O, bringing pages into and evicting pages out of the buffer cache. We’re taxing the CPU with MP3 decoding. All at the same time. Sounds like a recipe for thrashing, slowdown, and instability, right? Well, if you’re using a mainstream OS.

OK, now take this crazy situation and freeze. Whew, that feels better. Things aren’t so crazy any more. I have a second to think. By the way, this is what things are always like for your CPU. Even though it might seem crazy and hectic for you, a slow-thinking human, that interrupts are coming in hundreds of times a second, your CPU can execute a million or more cycles between each one. Plenty of time to do a little contemplating.

What sort of contemplating might you do? All kinds of contemplating! Here’s an example. “I’m copying a whole tree of files. Yes, it’s true that I usually put all the most recently read pages in the buffer cache, and evict old pages to make space. But these pages are not likely to be the next pages to be accessed, so I will not evict anything to make room for them.” (to be fair, Linux does support this particular capability with madvise(MADV_DONTNEED).

Here’s another example: “a mouse interrupt just came in. yes, it’s true that my mp3 decoder (a real-time task, since it’s feeding data to the sound card) is currently synthesizing a frame, but that will only take 20ms of CPU time, and I have a good 100ms of real time left before the next sound card deadline. I know that these mouse interrupts only take about 1ms to fully service, so I will schedule that code now, for the purpose of high GUI responsiveness.”

Or: “this real-time mp3 decoder just took 100ms of CPU time to synthesize this frame, which is longer than the amount of CPU time available between sound card interrupts. this real-time process is unsustainable — we should stop it and warn the user as much.”

It is totally possible to write software that performs this kind of reasoning. The software always decides what to do next, and this kind of reasoning just represents a particularly sophisticated decision function. This is in contrast to the much more primitive decision functions of mainstream OS’s, which operate with almost no real information about tasks except for a static priority and measurements of how much CPU or I/O the process has used in the past. Linux uses this information in very complicated formulas that try to figure out whether a process is “interactive” or not (or at least it used to — it may have been changed since this article was written):

The [Linux] scheduler goes to great lengths to try to identify interactive tasks and to boost their priority accordingly. This process involves numerous strange computations involving a number of magic constants; it is difficult to understand, much less improve. –Linux Weekly News

Current OS’s decide what to do next in ways that are sub-optimal. One reason is what I have already outlined, which is that they don’t have very much information about the tasks on the system except a static priority. Another is that any device driver that handles interrupts has the power to halt progress of the system if they run for too long or disable interrupts. I do not believe that current OS’s impose any limitations on device drivers and interrupt handlers, which (if true) would imply that any driver has the opportunity to halt progress of the system indefinitely.

The sweet spot is surely somewhere between current OS’s and the omniscient OS I was describing before. The solution is to give the OS enough information that it can make intelligent decisions about how to schedule access to scarce resources (CPU time, memory space, and I/O bandwidth), without needing to give it so much information that the interfaces are complicated and the coupling too tight.

Here’s an easy example: if I am an audio editor trying to playback 10 tracks in real-time, I should be able to tell the OS “starting now, I need X MB of disk bandwidth per second, and if you cannot sustain that I want to know now.” Also, I expect the OS to prioritize the X MB/s of bandwidth I reserved above everything else, but only up to X MB/s. Apparently XFS has filesystem-level support for this kind of bandwidth reservation, but no OS I know of supports this.

It looks like I’m not the first person to have these sorts of ideas. I haven’t read the paper in detail, but it looks like some guys from CMU had these ideas ten years ago, and published them in a paper called A Resource-Centric Approach To Multimedia Operating Systems (warning, PostScript). Because ultimately, OS’s are all about divvying up access to scarce resources.

A final thought: it’s often repeated that most of the worthwhile ideas in computer science were discovered decades ago, and that all the “new” things these days are just reworkings of the same old ideas. It would follow that what’s really valuable is compelling implementations of these ideas. This bodes well for me, I think, because I excel at implementations. I like to survey the landscape of ideas, separate the wheat from the chaff, and build systems based on the best ideas. That is what I hope lies ahead for me.

Posted in Uncategorized| 3 Comments | 

« Previous Postings | Next Postings »