Over the past two weeks, I have been working to port the excellent date string capability of the dateutil library to D. I did this for a couple of reasons:
The parsing code in dateutil also represents a worse case scenario for porting a Python project to a type safe language. I will get into this more in the next section, but there are a lot of things in the parsing code that make it an outlier to a lot of Python code. Also, this article can act as a reference to anyone else who wished to port something from Python to D.
Before I go any further, here is the code and the docs.
Many Python users like to describe Python as a "strongly typed" language. If that's true then the barrier to entry for strongly typed languages is extremely low. Sure, this happens:
>>> a = 1
>>> a += "2"
TypeError: unsupported operand type(s) for +=: 'int' and 'str'
Good, this avoids some subtle bugs. But Python still allows this:
>>> a = 1
>>> a = "1"
>>> a += "2"
This shouldn't be anything new to anyone who has used Python, but all
variables in Python can become any other type, including None
.
So the checks on comparison operations and mathematical operations don't
guarantee much. Especially since Python has no truly immutable data.
This has two obvious consequences. One, most Python functions can return
two different types: the normal type or None
. Some of the
functions in dateutil return three or more, which is one of the reasons that
porting this library to a type safe language with static type checking is rather
difficult. Two, Python code is riddled with checks for None
.
In D, value types cannot be null
, which presents a serious
problem.
As a quick aside, this also has the consequence of the vast majority of libraries giving absolutely awful errors to the user when the wrong type is passed to a function. For example:
>>> from dateutil.parser import *
>>> parse(123455)
Giant stack trace omitted for brevity
AttributeError: 'int' object has no attribute 'read'
Wat?
Another problem I ran into when porting is Python's generator semantics vs D range semantics. I HATE the Python iterator semantics. I don't think I could have come up with a worse way to do it.
The D language designers understood that there are three fundamental operations that users need at the basis of every iterator implementation in every language:
Being type safe, and having a separation of concerns, D split these
three operations into three functions, front
,
popFront
, and empty
. Python on the other hand,
decided to put all of these operations into one super method called
next
(__next__
for people of the future).
This means that the only way to check if the sequence is empty is to get
the next value and mutate the underlying data. So what do you do
to check if the iterator is empty before calling next
? You
don't. You keep calling next
until an exception is thrown.
Yes, the way to signal that the iterator is empty in Python is by
throwing an exception. This also means that in order to
skip ahead in the sequence, if you don't care about certain values for some
reason, you waste resources by always returning each value in the middle
instead of just moving forward. Still worse, this is not exception safe.
It's not uncommon for iterators to have some changing state that modifies
the data in the sequence before being returned in different ways. Or, for
the values of one iterator to be passed to another iterator. If one of
those throws an exception, poof, the front value is gone forever because
the iterator has already been moved to the next value.
The ES6 designers thought that all the above sounded great, and they
managed to improve it by forcing you to allocate a new object on
every next
call. It manages to be very slightly better than
Python by providing a done
member of the object. But if that
info is available then why not force the existence of an empty
function on the object!?
GAH!
All this is to say that transforming Python iterators to type safe D ranges is not an easy task.
Because I have no experience with parsing/lexing tools, I decided to convert it all manually instead of making some tool. But I did have the help of D's library and language features to ease the transition.
My first objective is to get the code compiling. Refactoring to idiomatic D can come later when the tests are passing.
Thank God for std.typecons.Nullable
, which emulates a
variable that can have a null
state. This can provide the same
functionality as Python variables while only allowing either a
null
state or one specific type. It remains type safe by
throwing an unrecoverable error if the value is attempted to be access when
the internal value if null while in debug
mode. Having an
analog to Python's variables made the conversion process much, much faster
by not requiring a ton of refactoring just to get it to compile.
To get the value of a Nullable
, the get
method
is used, which takes advantage of D's sub-typing with alias this
. Observe:
import std.typecons;
void square(ref int x)
{
x *= x;
}
void main()
{
Nullable!int value;
value.get(); // results in a error because value wasn't set
// opAssign is overloaded to accept int's
value = 1;
// sub typing allows value to be passed as an int
square(value);
// the above is equivalent to this
square(value.get());
}
Because the get
method is aliased with alias get this
in the Nullable
code, Nullable can be treated as
an int
. This really helped to keep the code and the conversion
process simple and not have overloads for int
's and
Nullable
's.
I just want to take a moment to mention how awesome Dscanner is. Running
the --styleCheck
on the first version of the compiling code
automatically showed me things like:
const
and immutable
could be added
without any extra effortauto someLegth = r.length - 1;
.
What happens when r.length
is unsigned and zero?It's --syntaxCheck
option also helped during the porting
process. If DMD encounters too many errors in the syntax, it gives up
before trying to parse the entire file. Dscanner on the other hand showed
all of the errors at once to they could be fixed much faster.
Python has no run time immutable data. This not only loses out on the
safety of immutable data, but the speed advantages it gives as well. If a
variable is immutable, then when the value is copied, the program can
safely store it by reference instead of a bitwise copy because it knows
that the value will never change. Several of the variables in the original
code base never changed after they were assigned, so translating these was
as simple as putting immutable
in front of the variable
declaration.
Python also doesn't have compile (or lex in Python's case?) time
manifest constants. enum
s have several advantages, especially
in D with its Compile Time Function Execution (CTFE), which allows any
function that doesn't access OS APIs and don't rely on runtime data to be
run during compilation. For example, dateutil was creating a dictionary of
month and day names and abbreviations to their calendar number (e.g. "Mon"
to 1) as class variables. With CTFE, these dictionaries are made at compile
time.
Another example of the usefulness of manifest constants is use in state
machines. In the dateutil code, a state machine was designed to use a
string to hold its state, therefore every time the state changed, a new
string was allocated and the old state string became garbage. With named
manifest constants, the state can be an int
when the code is
compiled.
So, it's compiling and somewhat works! Out of the 148 applicable tests that were translated from dateutil's test, 99 are passing. The code also isn't in idiomatic D. There are tons of tiny string allocations everywhere, the GC is used a lot, and a very small number of the functions accept lazy ranges, making the entire code base allocate eagerly. As a consequence, the code is only 3x as fast as the original dateulil parser (when compiled with all of DMD's optimization flags). It also compiles with LDC, but is about 30% slower from some reason, even with all of the optimizations turned on.
There is a ton of low hanging fruit in the code to make this even faster, I estimate that I can, with little effort, get this to 5x faster. But first, the 49 failing tests need to be green.
Update 2016-05-31: All tests have been green for a while now and dateparser is now 12x faster than the Python version when compiled with LDC.
My conclusion is that translating my work projects to D would be a cost effective move. Most of my current projects are Flask apps that are a lot less complex than the parser code, and can be translated to vibe.d easily. If I can make all of my projects even just twice as fast for 40 hours or less of work, then it is totally worth it.
Questions or comments? Feel free to contact me.