My Experience Porting Python Dateutil's Date Parser to D

Over the past two weeks, I have been working to port the excellent date string capability of the dateutil library to D. I did this for a couple of reasons:

I wanted to use this functionality in a D project of mine
All of my work projects are in Python and I wished to see the work involved in porting a Python project to D. I wanted to know if it would be cost effective to put the work in to translate those projects

The parsing code in dateutil also represents a worse case scenario for porting a Python project to a type safe language. I will get into this more in the next section, but there are a lot of things in the parsing code that make it an outlier to a lot of Python code. Also, this article can act as a reference to anyone else who wished to port something from Python to D.

Before I go any further, here is the code and the docs.

The Hurdles to Conversion

Types

Many Python users like to describe Python as a "strongly typed" language. If that's true then the barrier to entry for strongly typed languages is extremely low. Sure, this happens:

>>> a = 1
>>> a += "2"
TypeError: unsupported operand type(s) for +=: 'int' and 'str'

Good, this avoids some subtle bugs. But Python still allows this:

>>> a = 1
>>> a = "1"
>>> a += "2"

This shouldn't be anything new to anyone who has used Python, but all variables in Python can become any other type, including None. So the checks on comparison operations and mathematical operations don't guarantee much. Especially since Python has no truly immutable data.

This has two obvious consequences. One, most Python functions can return two different types: the normal type or None. Some of the functions in dateutil return three or more, which is one of the reasons that porting this library to a type safe language with static type checking is rather difficult. Two, Python code is riddled with checks for None. In D, value types cannot be null, which presents a serious problem.

As a quick aside, this also has the consequence of the vast majority of libraries giving absolutely awful errors to the user when the wrong type is passed to a function. For example:

>>> from dateutil.parser import *
>>> parse(123455)

Giant stack trace omitted for brevity

AttributeError: 'int' object has no attribute 'read'

Wat?

Ranges vs Iterators

Another problem I ran into when porting is Python's generator semantics vs D range semantics. I HATE the Python iterator semantics. I don't think I could have come up with a worse way to do it.

The D language designers understood that there are three fundamental operations that users need at the basis of every iterator implementation in every language:

Getting the current value
Moving the sequence to the next value
Checking if the sequence is empty

Being type safe, and having a separation of concerns, D split these three operations into three functions, front, popFront, and empty. Python on the other hand, decided to put all of these operations into one super method called next (__next__ for people of the future).

This means that the only way to check if the sequence is empty is to get the next value and mutate the underlying data. So what do you do to check if the iterator is empty before calling next? You don't. You keep calling next until an exception is thrown. Yes, the way to signal that the iterator is empty in Python is by throwing an exception. This also means that in order to skip ahead in the sequence, if you don't care about certain values for some reason, you waste resources by always returning each value in the middle instead of just moving forward. Still worse, this is not exception safe. It's not uncommon for iterators to have some changing state that modifies the data in the sequence before being returned in different ways. Or, for the values of one iterator to be passed to another iterator. If one of those throws an exception, poof, the front value is gone forever because the iterator has already been moved to the next value.

The ES6 designers thought that all the above sounded great, and they managed to improve it by forcing you to allocate a new object on every next call. It manages to be very slightly better than Python by providing a done member of the object. But if that info is available then why not force the existence of an empty function on the object!?

GAH!

All this is to say that transforming Python iterators to type safe D ranges is not an easy task.

Translating

Because I have no experience with parsing/lexing tools, I decided to convert it all manually instead of making some tool. But I did have the help of D's library and language features to ease the transition.

Nullable To The Rescue

My first objective is to get the code compiling. Refactoring to idiomatic D can come later when the tests are passing.

Thank God for std.typecons.Nullable, which emulates a variable that can have a null state. This can provide the same functionality as Python variables while only allowing either a null state or one specific type. It remains type safe by throwing an unrecoverable error if the value is attempted to be access when the internal value if null while in debug mode. Having an analog to Python's variables made the conversion process much, much faster by not requiring a ton of refactoring just to get it to compile.

To get the value of a Nullable, the get method is used, which takes advantage of D's sub-typing with alias this. Observe:

import std.typecons;

void square(ref int x)
{
    x *= x;
}

void main()
{
    Nullable!int value;
    value.get(); // results in a error because value wasn't set

    // opAssign is overloaded to accept int's
    value = 1;

    // sub typing allows value to be passed as an int
    square(value);

    // the above is equivalent to this
    square(value.get());
}

Because the get method is aliased with alias get this in the Nullable code, Nullable can be treated as an int. This really helped to keep the code and the conversion process simple and not have overloads for int's and Nullable's.

I ♡ Dscanner

I just want to take a moment to mention how awesome Dscanner is. Running the --styleCheck on the first version of the compiling code automatically showed me things like:

Where const and immutable could be added without any extra effort
Unsafe things like subtracting from a length, e.g. auto someLegth = r.length - 1;. What happens when r.length is unsigned and zero?
When some variables were never used and can be removed

It's --syntaxCheck option also helped during the porting process. If DMD encounters too many errors in the syntax, it gives up before trying to parse the entire file. Dscanner on the other hand showed all of the errors at once to they could be fixed much faster.

Some Low Hanging Fruit

Python has no run time immutable data. This not only loses out on the safety of immutable data, but the speed advantages it gives as well. If a variable is immutable, then when the value is copied, the program can safely store it by reference instead of a bitwise copy because it knows that the value will never change. Several of the variables in the original code base never changed after they were assigned, so translating these was as simple as putting immutable in front of the variable declaration.

Python also doesn't have compile (or lex in Python's case?) time manifest constants. enums have several advantages, especially in D with its Compile Time Function Execution (CTFE), which allows any function that doesn't access OS APIs and don't rely on runtime data to be run during compilation. For example, dateutil was creating a dictionary of month and day names and abbreviations to their calendar number (e.g. "Mon" to 1) as class variables. With CTFE, these dictionaries are made at compile time.

Another example of the usefulness of manifest constants is use in state machines. In the dateutil code, a state machine was designed to use a string to hold its state, therefore every time the state changed, a new string was allocated and the old state string became garbage. With named manifest constants, the state can be an int when the code is compiled.

Current State Of The Code

So, it's compiling and somewhat works! Out of the 148 applicable tests that were translated from dateutil's test, 99 are passing. The code also isn't in idiomatic D. There are tons of tiny string allocations everywhere, the GC is used a lot, and a very small number of the functions accept lazy ranges, making the entire code base allocate eagerly. As a consequence, the code is only 3x as fast as the original dateulil parser (when compiled with all of DMD's optimization flags). It also compiles with LDC, but is about 30% slower from some reason, even with all of the optimizations turned on.

There is a ton of low hanging fruit in the code to make this even faster, I estimate that I can, with little effort, get this to 5x faster. But first, the 49 failing tests need to be green.

Update 2016-05-31: All tests have been green for a while now and dateparser is now 12x faster than the Python version when compiled with LDC.

Conclusion

My conclusion is that translating my work projects to D would be a cost effective move. Most of my current projects are Flask apps that are a lot less complex than the parser code, and can be translated to vibe.d easily. If I can make all of my projects even just twice as fast for 40 hours or less of work, then it is totally worth it.

Questions or comments? Feel free to contact me.