BCHS//kcgi: dates

Dates! No, not like going on a date. Dates: the fifteenth of March; November 22, 1963; etc. In an effort to reduce hand-rolled but otherwise-generic code in kcgi, I recently set out to convert date handling to use the system functions strftime(3), mktime(3), etc. Unfortunately, I quickly realised that the supported systems handle dates quite differently, and ended up doing the opposite.

introduction

As it stood up to version 0.12.0, kcgi's handling of dates mixed hand-rolled functions for converting between epoch values and broken-down time, and system functions for formatting. These are laid out in datetime.c for that version. The regression tests that already existed were spotty and failed to cover any corner cases in date handling.

I didn't choose to examine the date functions randomly: it was part of an ongoing process to convert all kcgi kutil utility functions to having a khttp prefix; and in doing so, to review the implementation and correctness of said functions. BSD.lv's new portability infrastructure has played no small part in casting light in the areas where the code can use more clarity and consistency.

At heart, date handling needs to convert freely between two string representations and two binary representations.

relationship between date representations

It's critically important that each transition is fully defined and correct, so I started by replacing hand-rolled binary conversions with system functions.

removing hand-rolled conversions

The system functions to convert between binary representations are timegm(3) and gmtime(3). These convert between UNIX epoch time (an integer of seconds before or after the start of 1970, UTC) and broken-down time values, which separately represent day, month, year, hour, etc. as integers. Switching to these functions significantly reduced code size— or rather, off-loaded the complexity to the C library.

My development platform for this was OpenBSD, which has used clean 64-bit time_t values since a monumental effort in 2013. The time_t type is used by systems to represent UNIX epoch. Broken-down time is represented by a standard int. These are not fixed-width types.

I started with khttp_epoch2tms(3) and khttp_datetime2epoch(3), which convert between these two forms. While doing so, I also added regression tests for all of the converted functions. These regression tests probed the full range of possible input values. I committed the results and waited for our CI systems to test it on all other operating systems.

Result: immediate breakage.

The biggest breakage came from 32-bit time_t types on some systems, which I frankly didn't expect to be a problem any more. But it is. The problem arises because kcgi passes around explicitly-sized integers for the UNIX epoch, specifically int64_t, which allows for more values than a 32-bit time_t can handle. When passing these values into the 32-bit systems' gmtime(3), they suffered conversion.

Then there were also some surprising results, such as converting from broken-down time with a year before 1900 to an epoch value. On FreeBSD, this inexplicably failed.

The API itself presented problems that simply slipped my mind: a 64-bit time_t can represent more than a 32-bit int year can encode, so converting large times to broken-down time failed. These are documentation problems, as the broken-down time representation cannot change.

I also needed to worry about representing (time_t)-1, which is both a legitimate representation of one second before the UNIX epoch and the error return value for timegm(3). Confusing!

In light of these issues, I quickly decided to change my approach and instead return to using private copies of the conversion functions.

re-rolling conversions

I started by merging an appropriately-licensed implementations of timegm(3) and gmtime(3) from newlib that were small, easy to read, well-tested, and complete—and more importantly, 64-bit safe. Upon doing so, I was able to verify that all sane 64-bit input values were properly converting to and from the given time values.

Using these imports also relieved the burden of pre-checking for (time_t)-1, since they never returned an error.

For symmetry, I also added khttp_datetime2epoch(3), which mirrors khttp_epoch2datetime(3) in that it converts between int64_t broken-down time instead of int. I then moved on to formatting functions.

re-rolling formatting

There are two formatted outputs handled by kcgi: ISO 8601 and RFC 822 (modernised as 5322). Prior to this effort, kcgi used the strftime(3) function for the latter and normal string handling for the latter.

While the ISO 8601 date processing handled equally well on all systems, there were some corner cases for RFC 5322 formatting. First, negative years; the second, years with more than four digits. The RFC is mostly silent on how these are handled, but it's safe to assume that we should handle arbitrary dates in the sane way: negative years and as many year digits as required.

I was then surprised that the strftime(3) truncated year values on some SunOS systems, specifically Oracle Solaris. Moreover, by accepting a struct tm, I knew that formatting was impossible for year values beyond the 32-bit barrier.

Fortunately, fixing this is easy: since khttp_epoch2datetime(3) is able to convert into all the necessary date components, it only took two string table lookups for week names and months, then using regular string handling. Solved with khttp_epoch2str(3).

performance problems

When testing for corner-case dates, such as those with years needing more than 32 bits, unexpected difficulties came from the conversion utility. Specifically, when computing the seconds from the year, the code stepped through each year from 1970 or so, accumulating seconds. For the valid year of 1 152 921 504 606 846 976, or 2⁶⁰, this computation would take quite some time.

Fortunately, this code is easily optimised since the number of days in 400-year blocks, with 1900 as a baseline, is fixed. It was trivial to change the code to step only to 400-year multiples, eat the remaining 400 years with a single division, then compute the remainder.

conclusion and future steps

kcgi is now able to handle all representable 64-bit dates. A representable date is one that can convert between broken-down and epoch time without integer truncation, such as converting from a 64-bit epoch to a 32-bit year (the year might roll over) or a 64-bit year to a 64-bit epoch (the epoch might roll over).

The result of this work were produced in kcgi version 0.12.1. Most of this work was in datetime.c.

An important function that still needs adding is converting from formatted dates into epoch values. This is to prevent callers from using their system conversion functions, which may be limited in the ways described above. This won't be a difficult piece of code to write, and can wait for a future version.

Last but not least, it's important to remember that converting between UNIX epoch time and broken-down time will always be a source of error. Either the broken-down year will allow for more years than may be encoded in the epoch or vice versa. It's important that these functions specifically document the range of acceptable inputs!