How do you convert a UTC timestamp to Unix Time (seconds since the epoch)?

 "2020-04-29 04:48:15" → 1588135695

Of course the right answer is “you use a standard library function.” But what if you don’t have one available? Or what if you’re the person implementing that library?

Converting the time portion is trivial. Unix Time pretends that leap seconds do not exist and makes every day exactly 86,400 seconds long. This is a fib on systems that implement UTC leap second insertion¹, but it makes the algorithm very simple:

time_t hms_to_time(int h, int m, int s) {
  return (h * 3600) + (m * 60) + s;
}

But the calendar part is more challenging. Months have unequal lengths, and leap years complicate everything. Leap years insert an extra day at the end of February whenever:

the year is divisible by four
excluding years divisible by 100
but including years divisible by 400

Surprisingly, no version of C includes a UTC → Unix Time conversion function in the standard library. There is a non-standard function timegm(), but its use is discouraged. The Linux manpage says:

  These functions are nonstandard GNU extensions that are also present
  on the BSDs.  Avoid their use.

And on BSD:

  The timegm() function is not specified by any standard; its function
  cannot be completely emulated using the standard functions described above.

I needed an algorithm to perform this UTC→UnixTime conversion for the JSON parser in upb. The JSON mapping for Protocol Buffers says that timestamps are formatted using strings like:

1972-01-01T10:00:20.021Z

Once we have parsed the individual numbers out of such a timestamp string, we need a way of translating to seconds since the Unix Epoch, which is the internal representation of the google.protobuf.Timestamp type.

Since upb is written in C and intended to be portable, I needed to roll my own. Since upb aims to be as small and fast as possible, I became very interested in the problem of how far this algorithm could be pushed in size and speed.

A Fortran Solution from 1968

Several leads pointed me to a Letter to the Editor published in Communications of the ACM in October 1968 (Volume 11, Number 10). In this letter Henry F. Fliegel and Thomas C. Van Flandern offered the following Fortran function for performing this conversion. This function uses integer math (all divisions round down):

JD (I, J, K) = K - 32075 + 1461*(I + 4800 + (J - 14)/12)/4
        + 367*(J - 2 - (J - 14)/12*12)/12 - 3
             *((I + 4900 + (J - 14)/12)/100)/4

JD here refers to the Julian Date, the number of days that have passed since the Julian Date Epoch. This epoch is very long ago, in 4713 BC, but it’s a simple matter to convert it to the Unix epoch of 1970-01-01 by subtracting 2440588, the Julian Day number of the Unix Epoch. The variables I, J, and K are the year, month, and day, respectively.

Once we have mapped a Y/M/D to a total number of days, it is easy to multiply this by 86,400 seconds and add hms_to_time() to get a full Unix Time result.

The function above is remarkable. It does not use any lookup table, and yet it somehow encodes the irregular pattern of month lengths and leap year rules. It is trivial to convert this Fortran function to C using the int data type, which also rounds down for positive numbers (the algorithm ensures that all numbers being divided are positive, as long as year >= -4800).

Improving Readability With A Lookup Table

In the interest of having an algorithm that is easier to understand than the Fortran algorithm above, I wrote the following algorithm for upb’s old JSON parser. All of the general techniques used here are published in various places, I just condensed them into the smallest and simplest algorithm I could. This uses the Unix epoch instead of a Julian Day.

/* https://github.com/protocolbuffers/upb/blob/22182e6e/upb/json/parser.rl#L1697
 * epoch_days(1970, 1, 1) == 1970-01-01 == 0 */
int epoch_days_table(int year, int month, int day) {
  static const uint16_t month_yday[12] = {0,   31,  59,  90,  120, 151,
                                          181, 212, 243, 273, 304, 334};
  uint32_t year_adj = year + 4800;  /* Ensure positive year, multiple of 400. */
  uint32_t febs = year_adj - (month <= 2 ? 1 : 0);  /* Februaries since base. */
  uint32_t leap_days = 1 + (febs / 4) - (febs / 100) + (febs / 400);
  uint32_t days = 365 * year_adj + leap_days + month_yday[month - 1] + day - 1;
  return days - 2472692;  /* Adjust to Unix epoch. */
}

This algorithm uses a lookup table to determine the number of days for each month. Like the Fortran algorithm, it requires that division rounds down. But C rounds towards zero² so like the Fortran algorithm above, we start by adding 4800 so the year is always positive.

This algorithm is nice and readable. It is small and fast. There is only one noticeable downside if we are being performance-obsessed: it relies on a lookup table, which adds a bit of latency in the best case, or a lot of latency if the lookup table is not in cache.

Eliminating the Lookup Table

More recently I came across the following algorithm published by Howard Hinnant. Like the Fortran algorithm, it avoids the lookup table. But its logic is a bit easier to follow, and Howard includes an extensive explanation for how it works.

// Returns number of days since civil 1970-01-01.  Negative values indicate
//    days prior to 1970-01-01.
// Preconditions:  y-m-d represents a date in the civil (Gregorian) calendar
//                 m is in [1, 12]
//                 d is in [1, last_day_of_month(y, m)]
//                 y is "approximately" in
//                   [numeric_limits<Int>::min()/366, numeric_limits<Int>::max()/366]
//                 Exact range of validity is:
//                 [civil_from_days(numeric_limits<Int>::min()),
//                  civil_from_days(numeric_limits<Int>::max()-719468)]
template <class Int>
constexpr
Int
days_from_civil(Int y, unsigned m, unsigned d) noexcept
{
    static_assert(std::numeric_limits<unsigned>::digits >= 18,
             "This algorithm has not been ported to a 16 bit unsigned integer");
    static_assert(std::numeric_limits<Int>::digits >= 20,
             "This algorithm has not been ported to a 16 bit signed integer");
    y -= m <= 2;
    const Int era = (y >= 0 ? y : y-399) / 400;
    const unsigned yoe = static_cast<unsigned>(y - era * 400);      // [0, 399]
    const unsigned doy = (153*(m + (m > 2 ? -3 : 9)) + 2)/5 + d-1;  // [0, 365]
    const unsigned doe = yoe * 365 + yoe/4 - yoe/100 + doy;         // [0, 146096]
    return era * 146097 + static_cast<Int>(doe) - 719468;
}

This algorithm replaces the month lookup table with the clever expression:

  DayOfYear(adjusted_month) = (153 * adjusted_month + 2) / 5

This formula describes a line with slope of 153/5, or 30.6, which is between 30 and 31 days in each month. But it is cleverly chosen to exploit truncating integer division to step in the same pattern that months do.

My colleague Gerben Stavenga realized that there are many such expressions of this form that will calculate the same value. If we use a variant where the divisor is a power of two, the compiler can optimize the division to a right shift. Specifically we can use:

  DayOfYear(adjusted_month) = (62719 * adjusted_month + 769) / 2048

With some help from Gerben I arrived at this formulation for the current version of upb:

/* https://github.com/protocolbuffers/upb/blob/22182e6e/upb/json_decode.c#L982
 * epoch_days_fast(1970, 1, 1) == 1970-01-01 == 0. */
int epoch_days_fast(int y, int m, int d) {
  const uint32_t year_base = 4800;    /* Before min year, multiple of 400. */
  const uint32_t m_adj = m - 3;       /* March-based month. */
  const uint32_t carry = m_adj > m ? 1 : 0;
  const uint32_t adjust = carry ? 12 : 0;
  const uint32_t y_adj = y + year_base - carry;
  const uint32_t month_days = ((m_adj + adjust) * 62719 + 769) / 2048;
  const uint32_t leap_days = y_adj / 4 - y_adj / 100 + y_adj / 400;
  return y_adj * 365 + leap_days + month_days + (d - 1) - 2472632;
}

Note that the although this algorithm has some ternary operators, it compiles into fully branchless code with one cmov in it. In fact, all the algorithms given in this article compile to branchless code (on x86-64 using recent versions of Clang).

Semantics and Correctness

Before we compare these algorithms for size and speed, a few notes about their semantics and correctness.

Supported Input Range

I verified all the algorithms in this article across a wide range of dates (-4712-01-01 to 22666-12-20, or Julian Date 1 to 10,000,000). For every day in this range I checked that the outputs match each other, and that they match the results of timegm(). There was a snag on macOS, as its timegm() function fails for all years prior to 1900, so I disabled it for those years.

The interval [-4712, 22666] is a very wide range of years, and likely to be far beyond what is needed in practice. That said, these algorithms effectively support much wider bounds than this. All of them support to at least the date 1000000-01-01 (year 1 Million).

Most of the algorithms given here use the year -4800 as their base, and will return incorrect results for years before that. However the days_from_civil() algorithm is coded to support over 10 million years into the past (assuming 32-bit integers). I think any of these algorithms could likely be adjusted to support whatever window of years is desired. The year -4800 is frequently chosen as a base out of convention, as it is before the Julian Day epoch, which is before any recorded historical events.

Input Normalization

Unlike the algorithms in this article, the timegm() function normalizes its input. For example, if you pass a month of -1 to timegm(), that indicates December of the previous year. The algorithms given here can return incorrect results if the month is out of range, or can even crash with out-of-bounds array access if you are using the epoch_days_table() function. So any normalization/checking of the month value must happen prior to calling the function.

Similarly timegm() will write the normalized values back to the input struct tm, including tm_yday (day of year) and tm_wday (day of week) values that it calculated from the day, month, and year. The algorithms in this article do not calculate day of week or day of year.

I did not need these features for my use case, so for me these represent unnecessary overhead. But it is worth noting that this is not a completely apples-to-apples comparison. It could be an interesting exercise to add these features so that the algorithms given here could be a drop-in replacement for timegm().

Leap Seconds

When we use the hms_to_time() function above, these algorithms return correct results around leap seconds. The primary test case for this is to verify:

  UnixTime(1998, 12, 31, 23, 59, 60) -> 915148800
  UnixTime(1999,  1,  1,  0,  0,  0) -> 915148800

These values follow the table given in the Wikipedia Article about Unix Time around leap seconds. These algorithms handle leap seconds correctly without any special code. The correct answers for seconds=60 fall out of the algorithm “for free”, as we are simply summing seconds.

Size and Speed Benchmarks

Here are microbenchmarks of the algorithms above, using the google/benchmark framework. Code for these benchmarks is available here.

I added the hms_to_time() code to each algorithm so they would be more comparable to the timegm() function. I also put the functions in a separate .cc file from the benchmark loop so that they could not be inlined, to make the comparison with libc more fair. I also translated the 1968 algorithm from Fortran into C++.

On my Linux Desktop i7-8700K with glibc 2.30, I get:

Running bazel-bin/date-algorithms
Run on (12 X 4700 MHz CPU s)
CPU Caches:
  L1 Data 32K (x6)
  L1 Instruction 32K (x6)
  L2 Unified 256K (x6)
  L3 Unified 12288K (x1)
------------------------------------------------------------------
Benchmark                           Time           CPU Iterations
------------------------------------------------------------------
BM_YMDToUnix_Fortran                5 ns          5 ns  148069867
BM_YMDToUnix_Table                  2 ns          2 ns  288071123
BM_YMDToUnix_DaysFromCivil          4 ns          4 ns  176474060
BM_YMDToUnix_Fast                   3 ns          3 ns  269155478
BM_timegm_libc                     46 ns         46 ns   15360235

All of the algorithms from this article are in the same general ballpark. The algorithm in libc is noticeably slower.

On my Mac Laptop running macOS 10.15.4, the disparity is even more stark:

Running bazel-bin/date-algorithms
Run on (12 X 2600 MHz CPU s)
CPU Caches:
  L1 Data 32K (x6)
  L1 Instruction 32K (x6)
  L2 Unified 262K (x6)
  L3 Unified 9437K (x1)
------------------------------------------------------------------
Benchmark                           Time           CPU Iterations
------------------------------------------------------------------
BM_YMDToUnix_Fortran                6 ns          6 ns  116857534
BM_YMDToUnix_Table                  3 ns          3 ns  213395767
BM_YMDToUnix_DaysFromCivil          5 ns          5 ns  147238851
BM_YMDToUnix_Fast                   3 ns          3 ns  211537241
BM_timegm_libc                   5421 ns       5413 ns     122749

I looked at a profile and cross-referenced the Darwin code in localtime.c. It appears that almost all of the time is attributable to the gmtsub() function, of which about 70% is mutex lock/unlock to lazily load time zone data for the “GMT” time zone. The timesub() function appears to take most of the remainder.

It is surprising that a lock would add this much overhead, as Jeff Dean’s Latency Numbers Every Programmer Should Know tells us to expect only 25ns for a mutex lock/unlock. Note that the benchmarks above are single-threaded; under contention timegm() on macOS degrades even more, while the others are unaffected (as they do not access any global state):

Running bazel-bin/date-algorithms
Run on (12 X 2600 MHz CPU s)
CPU Caches:
  L1 Data 32K (x6)
  L1 Instruction 32K (x6)
  L2 Unified 262K (x6)
  L3 Unified 9437K (x1)
----------------------------------------------------------------------------
Benchmark                                     Time           CPU Iterations
----------------------------------------------------------------------------
BM_YMDToUnix_Fortran/threads:6                1 ns          6 ns  119878410
BM_YMDToUnix_Table/threads:6                  1 ns          3 ns  206599374
BM_YMDToUnix_DaysFromCivil/threads:6          1 ns          5 ns  139911858
BM_YMDToUnix_Fast/threads:6                   1 ns          3 ns  208116546
BM_timegm_libc/threads:6                  14795 ns      86855 ns       8178

YMDToUnix_Table and YMDToUnix_Fast are basically tied, though as mentioned before the table-based algorithm can slow down drastically (on the order of 100ns) if the table is not in cache, and is vulnerable to crashes if the month is out of range.

For size profiling I used my tool Bloaty McBloatface. I display VM size only, so we see the parts of the binary that will actually be loaded into memory:

$ bloaty -d symbols,sections '--source-filter=^YMD|yday' --domain=vm \
  bazel-bin/date-algorithms
     VM SIZE
 -------------- 
1%     236    YMDToUnix_Fortran()
4%     204    .text
2%      24    .eh_frame
4%       8    .eh_frame_hdr
7%     210    YMDToUnix_DaysFromCivil()
0%     170    .text
2%      32    .eh_frame
8%       8    .eh_frame_hdr
0%     152    YMDToUnix_Fast()
9%     120    .text
8%      24    .eh_frame
3%       8    .eh_frame_hdr
1%     137    YMDToUnix_Table()
6%     105    .text
5%      24    .eh_frame
8%       8    .eh_frame_hdr
2%      24    epoch_days_table()::month_yday
0%      24    .rodata
0%     759    TOTAL
Filtering enabled (source_filter); omitted  128Ki of entries

All these functions have nearly the same .eh_frame overhead (this is unwind information, used for debugging and stack traces), so we can set that aside. When we account for the table size, YMDToUnix_Fast ends up smallest, narrowly beating YMDToUnix_Table. But these differences are very minor. All of these algorithms are very small and fast.

UTC days tick 86,401 seconds whenever a leap second occurs. Maintaining the fib that every day is 86,400 seconds long requires Unix Time to be discontinuous around a leap second. To avoid this complication, some systems have started “smearing” seconds around the leap second instead, so that every day is still exactly 86,400 seconds long, and it’s merely the length of each second that changes. This deviates from UTC, but solves lots of practical problems. ↩
Round-to-zero was standardized in C99. Before that the results for negative numbers were implementation-defined). ↩