D's Auto Decoding and You

Published May 17th, 2016

I'm going to keep this first part of this article objective in order to stay informative. I have put my personal opinions in the final section.

One feature of D that is confusing to a lot of new comers is the behavior of strings in relation to range based features like the foreach statement and range algorithms. For example, consider this D code:

import std.range;

void main()
{
    string s = "Hello, World!";

    static assert(is(typeof(s[0]) == immutable(char)));
    static assert(is(typeof(s.front) == dchar));
}

Why on Earth is the type of the first element and the type of the front range primitive different? Before we get into the details, there are a few things to understand about D.

Ranges and Arrays

First, it's important to understand that arrays are not ranges, at least not on their own. Try the code above, but comment out the import of std.range. You should get a compiler error, that's because front for dynamic arrays in D are defined in Phobos, specifically in std.array, and publicly imported in many Phobos modules. Let's examine the exact test for input ranges:

template isInputRange(R)
{
    enum bool isInputRange = is(typeof(
    (inout int = 0)
    {
        R r = R.init;     // can define a range object
        if (r.empty) {}   // can test for empty
        r.popFront();     // can invoke popFront()
        auto h = r.front; // can get the front of the range
    }));
}

By themselves, arrays only satisfy the first requirement. But, because of UFCS, built-in types can be treated as ranges if front, popFront, and empty free functions are defined and are visible from the same module as the range accepting functions. This allows dynamic arrays to be used without a wrapping struct. This will tie in later because for string types, front has special behavior.

D and Unicode

Second, all string types in D (string as UTF-8, wstring as UTF-16, dstring as UTF-32) are Unicode based. Base on the language specification, it's incorrect to store anything other than Unicode in the built-in string types. Unicode is the future, and there's no reason system languages cannot handle Unicode correctly. For non-Unicode encoded strings, you can use ubyte[] and call the functions in std.encoding to get Unicode. Of course, you could always use C style char*.

Also, strings in D are not library types like Java. They are just aliases to array types.

alias string = immutable(char)[];
alias wstring = immutable(wchar)[];
alias dstring = immutable(dchar)[];

Auto Decoding: What and How

The rest of the article will use Unicode terminology; if you're unaware of certain terms, there's a helpful glossary on the std.uni docs. The two most important ones are

Code Point

Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF (hex). Not all code points are assigned to encoded characters.

-- std.uni docs

Code Unit

The minimal bit combination that can represent a unit of encoded text for processing or interchange. Depending on the encoding this could be: 8-bit code units in the UTF-8 (char), 16-bit code units in the UTF-16 (wchar), and 32-bit code units in the UTF-32 (dchar). Note that in UTF-32, a code unit is a code point and is represented by the D dchar type.

-- std.uni docs

So, why is typeof(s.front) == dchar? Because of something called auto decoding. Auto decoding in D refers to the process of overloading std.array.front to special case string and wstring to decode UTF-8 and UTF-16 code units into UTF-32 code points. In plain English, this means when iterating over string's and wstring's' in D using the range interface, e.g. with all of the functions in std.algorithm and std.range and foreach, D will look ahead in the string and combine any code units that make up a single code point using std.utf.decode. E.g. for ë the code units C3 AB (for UTF-8) would turn into a single code point.

Because all code points fit into a single dchar, std.array.front for strings returns a dchar for all string types.

Why?

This was done in order simplify string handling in range code by making each element a character when iterating over strings rather than the individual code units. To quote Andrei Alexandrescu, who implemented auto decoding,

When I designed std.algorithm I recall I put the following options on the table:

  1. All algorithms would by default operate on strings at char/wchar level (i.e. code unit). That would cause the usual issues and confusions I was aware of from C++. Certain algorithms would require specialization and/or the user using byDchar for correctness. At some point I swear I've had a byDchar definition somewhere; I've searched the recent git history for it, no avail. (Note: this is now in Phobos)

  2. All algorithms would by default operate at code point level. That way correctness would be achieved by default, and certain algorithms would require specialization for efficiency. (Back then I didn't know about graphemes and normalization. I'm not sure how that would have affected the final decision.)

  3. Change the alias string, wstring etc. to be some type that requires explicit access for code units/code points etc. instead of implicitly mixing the two.

My fave was (3). And not mine only - several people suggested alternative definitions of the "default" string type. Back then however we were in the middle of the D1/D2 transition and one more aftershock didn't seem like a good idea at all. Walter opposed such a change, and didn't really have to convince me.

From experience with C++ I knew (1) had a bad track record, and (2) "generically conservative, specialize for speed" was a successful pattern.

The technical, correctness, and performance implications of auto decoding are not immediately obvious, which leads into the next section.

The Debate of Auto Decoding's Merits

The practice of auto decoding has sparked a large amount of debates about its inclusion ever since Walter Bright (the creator of D) learned of its inclusion in the standard library. Many have called for its removal while others have defended the practice. It's important that as a D user, you understand the technical side effects of auto decoding and how it will change your program.

I would just like to preface this discussion by saying that personally, in context, this behavior is neither deal breaking nor that big of an issue, it's just an annoyance and can be confusing. I'm being completely genuine when I say that D's templates and compile time features more than make up for any string follies.

The Benefits

  • Auto decoding gives correct iteration by character for the plurality of popular text. The fact is that English is the current lingua franca, and the vast majority of programs either deal with English or some other European language. Auto decoding gives correct behavior for these languages.
  • Auto decoding allows those not familiar with Unicode specifics to use D without requiring them to choose an iteration method. For example, the following behavior is not clear to those who don't know about Unicode specifics
string s = "cassé";

// currently works as expected due to auto decoding,
// this wouldn't work if the string was iterated by code unit
assert(s.canFind!(x => x == 'é'));
  • Auto decoding lowers the barrier to entry for those who come from scripting languages where such behavior is commonplace.

The Downsides

The following analysis is mostly adapted from Walter Bright's comments in this post and was reproduced here with his permission.

  • The fact that D auto decodes is hard to convey through anything other than a D tutorial or book. This means people who learn the language in other ways will be unaware of this behavior until they either profile or some malformed string causes it to throw.
  • Auto decoding is slow, meaning strings are slow by default, and therefore D has the illusion of slow string handling vs other languages.
  • Auto decoding only iterates over characters for English and a small subset of European languages. Which in practice means that auto decoding fails to guarantee anything more than "you'll iterate over characters ... most of the time." Very few languages have a one to one mapping between code points and graphemes. Every other language has things like combining marks and scripts that take up many code points in order to make one "character". Some languages don't even have the concept of a character! That's why the Unicode standard calls them "Grapheme Clusters". For example, how many "characters" are in the string "Приве́т नमस्ते שָׁלוֹם"? There is no answer; the question is philosophical, not scientific.
  • Ranges of characters do not auto decode, but arrays of characters do. Iterating a char[] with C style for loops produces different results than foreach loops due to the auto decoding on range primitives. Finally, wrapping an array in a struct with an alias this to an array turns off auto decoding. These glaring inconsistencies are the cause of a lot of confusion for new comers and special casing in generic code. Which leads into,
  • Every time one wants a generic algorithm to work with both strings and ranges, you wind up special casing via static if-ing narrow strings to defeat the auto decoding, or to decode the ranges. Case in point. Having to constantly special case it makes for more special cases when plugging together components. These issues often escape detection when unit testing because it is convenient to unit test only with arrays. Case in point.
  • Very few algorithms require decoding. For example, if you're searching for 'a' in a UTF-8 string, what does it matter if there are invalid encodings in that string?
  • Auto decoding has two choices when encountering invalid code units: throw, or produce an error dchar like std.utf.byUTF does. Currently, it throws a UTFException, meaning no algorithms using auto decode can be made nothrow. This means that a large portion of functions will be automatically marked as throwing simply by handling strings, unless you special case if possible as mentioned above. Losing the nothrow means that you get slightly worse code-gen because the compiler can leave out the exception handling code for a function if it knows that it will never throw.
  • Despite the fact that your function may never throw, Phobos uses the GC to throw exceptions with the new keyword. As a consequence, all uses of auto decoding in templates automatically marks the template as using the GC because it might throw. This is, again, viral. Any code that uses auto decoding is marked as GC allocating, and code that uses that code, and so on.
  • There are lots of places where invalid Unicode is either commonplace or legal, e.g. Linux file names, and therefore auto decoding cannot be used. It turns out in the wild that pure Unicode is not universal - there's lots of dirty Unicode that should remain unmolested because it's user data, and auto decoding does not play well with that mentality.
  • Auto decoding cannot be turned off, i.e. it isn't practical to avoid importing std.array because std.range and std.algorithm publicly import it.
  • Auto decoded arrays cannot be random access ranges and cannot have a length function that can be relied on (as they are variable length), losing out on any performance optimizations in algorithms that take advantage of random access ranges or length information. This causes a lot of unnecessary walking of entire strings in range functions, e.g. with std.range.walkLength to get the missing information.

My View

Auto decoding was a mistake; its benefits do not outweigh its downsides. Perhaps iterating by code point is a good compromise for the vast majority of situations, and I wouldn't care if the behavior was relegated to some String library type. But, the fact that char[] and friends are hijacked automatically is what bugs me.

Decoding strings into code points has the worst of both worlds: it's neither fast nor is it actually correct. What auto decoding gives is the illusion of correctness. For example, take a look at this file provided by Vladimir Panteleev.

If you download it and run it, the assert will fail. That's because the first é is encoded as U+65 U+301 whereas the second one is U+E9. Code point to code point comparison does not always give the correct behavior. In order to correct this, you have to have to know about and use the correct normalizing Unicode algorithm (of which, there are many) on the string before searching. What you actually want when searching for a character in a string is a substring search on a normalized string, which bypasses all grapheme and normalization issues. Again, code points give only the illusion of correctness; the situation is more nuanced than auto decoding would have you believe.

Also, it's not correct because, as mentioned above, only a few languages can have their graphemes fit into one code point. For every other language, you have to use std.uni.byGrapheme to correctly iterate over each character, which is even slower than decoding into code points and may allocate memory. For example, the Sanskrit character ddhrya consists of seven different code points. So people who deal with such languages have no advantage with auto decoding and have to explicitly state the manner in which the string should be iterated over.

Also, this article makes some good points as to why iterating over code units vs code points makes more sense and is more efficient in most cases.

I really don't see the benefit of the automatic behavior fulfilling this one specific corner case when you're going to make everyone else call a range generating function when they want to iterate over code units or graphemes. Just make everyone call a range generating function to specify the type of iteration and save a lot of people the trouble!

One point about auto decoding that is a matter of opinion is that auto decoding is hidden from the user, which is bad. Automatically doing special processing without the programmer's express permission is contrary to the basic ideals of system languages. It would be like auto throwing whenever a number type under/overflows, you're trying to hide the reality of the situation behind special cases. Paired with the inability to turn it off, your effectively stuck with it unless you litter your code with calls to std.utf.byUTF and std.uni.byCodeUnit.

Also, the special casing you have to do in some areas is just ridiculous. For example,

import std.algorithm;
import std.utf;

void main()
{
    string a = "HelløWorld"; // encoded to UTF-8

    // try to split the string at 'W' using slices
    auto s1 = a[0 .. a.countUntil('W')]; // WRONG, will split ø in half
    auto s2 = a[0 .. a.byCodeUnit.countUntil('W')]; // right
}

Unicode is hard. Trying to hide Unicode specifics helps no one because it's going to bite you in the ass eventually.

Conclusion

Whether you agree with my opinions on auto decoding or not, D is essentially stuck with auto decoding until a clear and very easy migration path is shown. Perhaps the Java route of a Phobos String type would be the best option and a deprecation of the string front function. Perhaps you just deprecate the front function and have everyone who wants the current behavior use std.uni.byCodePoint. Perhaps a parallel D3 with no auto decoding could be made with a end of life for D2 announced.

However, in my opinion D is too far along to suddenly ask people to change up all of their string handling code Python 3 style and is not big enough to survive the damage that such a change would cause. A change like this without an automated conversion tool has the possibility to permanently stifle D and forever brand it as "That language that always breaks your code".

Questions or comments? Feel free to contact me.