I'm going to keep this first part of this article objective in order to stay informative. I have put my personal opinions in the final section.
One feature of D that is confusing to a lot of new comers is the behavior of strings in relation
to range based features like the foreach
statement and range algorithms. For example,
consider this D code:
import std.range;
void main()
{
string s = "Hello, World!";
static assert(is(typeof(s[0]) == immutable(char)));
static assert(is(typeof(s.front) == dchar));
}
Why on Earth is the type of the first element and the type of the front
range primitive
different? Before we get into the details, there are a few things to understand about D.
First, it's important to understand that arrays are not ranges, at least not on their own. Try the
code above, but comment out the import of std.range
. You should get a compiler error,
that's because front
for dynamic arrays in D are defined in Phobos, specifically in
std.array
, and publicly imported in many Phobos modules. Let's examine the exact test
for input ranges:
template isInputRange(R)
{
enum bool isInputRange = is(typeof(
(inout int = 0)
{
R r = R.init; // can define a range object
if (r.empty) {} // can test for empty
r.popFront(); // can invoke popFront()
auto h = r.front; // can get the front of the range
}));
}
By themselves, arrays only satisfy the first requirement. But, because of
UFCS, built-in types can
be treated as ranges if front
, popFront
, and empty
free
functions are defined and are visible from the same module as the range accepting functions. This
allows dynamic arrays to be used without a wrapping struct
. This will tie in later
because for string types, front
has special behavior.
Second, all string types in D (string
as UTF-8, wstring
as
UTF-16, dstring
as UTF-32) are Unicode based. Base on the language specification,
it's incorrect to store anything other than Unicode in the built-in string types. Unicode is the
future, and there's no reason system languages cannot handle Unicode correctly. For non-Unicode
encoded strings, you can use ubyte[]
and call the functions in std.encoding
to get Unicode. Of course, you could always use C style char*
.
Also, strings in D are not library types like Java. They are just aliases to array types.
alias string = immutable(char)[];
alias wstring = immutable(wchar)[];
alias dstring = immutable(dchar)[];
The rest of the article will use Unicode terminology; if you're unaware of certain terms, there's a helpful glossary on the std.uni docs. The two most important ones are
Code Point
Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF (hex). Not all code points are assigned to encoded characters.
-- std.uni docs
Code Unit
The minimal bit combination that can represent a unit of encoded text for processing or interchange. Depending on the encoding this could be: 8-bit code units in the UTF-8 (char), 16-bit code units in the UTF-16 (wchar), and 32-bit code units in the UTF-32 (dchar). Note that in UTF-32, a code unit is a code point and is represented by the D dchar type.
-- std.uni docs
So, why is typeof(s.front) == dchar
? Because of something called auto decoding. Auto decoding
in D refers to the process of overloading std.array.front
to special case string
and wstring
to decode UTF-8 and UTF-16 code units into UTF-32 code points. In plain English,
this means when iterating over string
's and wstring
's' in D using the range
interface, e.g. with all of the functions in std.algorithm
and std.range
and foreach
, D will look ahead in the string and combine any code units that make up
a single code point using std.utf.decode
. E.g. for ë
the code units
C3 AB
(for UTF-8) would turn into a single code point.
Because all code points fit into a single dchar
, std.array.front
for
strings returns a dchar
for all string types.
This was done in order simplify string handling in range code by making each element a character when iterating over strings rather than the individual code units. To quote Andrei Alexandrescu, who implemented auto decoding,
When I designed std.algorithm I recall I put the following options on the table:
All algorithms would by default operate on strings at char/wchar level (i.e. code unit). That would cause the usual issues and confusions I was aware of from C++. Certain algorithms would require specialization and/or the user using byDchar for correctness. At some point I swear I've had a byDchar definition somewhere; I've searched the recent git history for it, no avail. (Note: this is now in Phobos)
All algorithms would by default operate at code point level. That way correctness would be achieved by default, and certain algorithms would require specialization for efficiency. (Back then I didn't know about graphemes and normalization. I'm not sure how that would have affected the final decision.)
Change the alias string, wstring etc. to be some type that requires explicit access for code units/code points etc. instead of implicitly mixing the two.
My fave was (3). And not mine only - several people suggested alternative definitions of the "default" string type. Back then however we were in the middle of the D1/D2 transition and one more aftershock didn't seem like a good idea at all. Walter opposed such a change, and didn't really have to convince me.
From experience with C++ I knew (1) had a bad track record, and (2) "generically conservative, specialize for speed" was a successful pattern.
The technical, correctness, and performance implications of auto decoding are not immediately obvious, which leads into the next section.
The practice of auto decoding has sparked a large amount of debates about its inclusion ever since Walter Bright (the creator of D) learned of its inclusion in the standard library. Many have called for its removal while others have defended the practice. It's important that as a D user, you understand the technical side effects of auto decoding and how it will change your program.
I would just like to preface this discussion by saying that personally, in context, this behavior is neither deal breaking nor that big of an issue, it's just an annoyance and can be confusing. I'm being completely genuine when I say that D's templates and compile time features more than make up for any string follies.
string s = "cassé";
// currently works as expected due to auto decoding,
// this wouldn't work if the string was iterated by code unit
assert(s.canFind!(x => x == 'é'));
The following analysis is mostly adapted from Walter Bright's comments in this post and was reproduced here with his permission.
Приве́т नमस्ते שָׁלוֹם
"? There is no answer; the question is philosophical,
not scientific.char[]
with C style for
loops produces different results
than foreach
loops due to the auto decoding on range primitives. Finally,
wrapping an array in a struct
with an alias this
to an array turns
off auto decoding. These glaring inconsistencies are the cause of a lot of confusion for new comers
and special casing in generic code. Which leads into,static if
-ing narrow strings to defeat the auto decoding, or to
decode the ranges.
Case in point.
Having to constantly special case it makes for more special cases when plugging together components.
These issues often escape detection when unit testing because it is convenient to unit test only
with arrays. Case in point.'a'
in a
UTF-8 string, what does it matter if there are invalid encodings in that string?dchar
like std.utf.byUTF
does. Currently, it throws a
UTFException
, meaning no algorithms using auto decode can be made nothrow
.
This means that a large portion of functions will be automatically marked as throwing simply by
handling strings, unless you special case if possible as mentioned above. Losing the nothrow
means that you get slightly worse code-gen because the compiler can leave out the exception handling
code for a function if it knows that it will never throw.new
keyword. As a consequence, all uses of auto decoding in templates
automatically marks the template as using the GC because it might throw. This is, again, viral.
Any code that uses auto decoding is marked as GC allocating, and code that uses that code, and so on.std.array
because std.range
and std.algorithm
publicly import it.std.range.walkLength
to get the missing information.Auto decoding was a mistake; its benefits do not outweigh its downsides. Perhaps iterating by
code point is a good compromise for the vast majority of situations, and I wouldn't care if the
behavior was relegated to some String
library type. But, the fact that char[]
and friends are hijacked automatically is what bugs me.
Decoding strings into code points has the worst of both worlds: it's neither fast nor is it actually correct. What auto decoding gives is the illusion of correctness. For example, take a look at this file provided by Vladimir Panteleev.
If you download it and run it, the assert will fail. That's because the first é
is
encoded as U+65 U+301 whereas the second one is U+E9. Code point to code point comparison does not
always give the correct behavior. In order to correct this, you have to have to know about and use
the correct normalizing Unicode algorithm (of which, there are many) on the string before searching.
What you actually want when searching for a character in a string is a substring search on a normalized
string, which bypasses all grapheme and normalization issues. Again, code points give only the
illusion of correctness; the situation is more nuanced than auto decoding would have you
believe.
Also, it's not correct because, as mentioned above, only a few languages can have their
graphemes fit into one code point. For every other language, you have to use
std.uni.byGrapheme
to correctly iterate over each character, which is even slower than
decoding into code points and may allocate memory. For example, the Sanskrit character
ddhrya consists of seven
different code points. So people who deal with such languages have no advantage with auto decoding
and have to explicitly state the manner in which the string should be iterated over.
Also, this article makes some good points as to why iterating over code units vs code points makes more sense and is more efficient in most cases.
I really don't see the benefit of the automatic behavior fulfilling this one specific corner case when you're going to make everyone else call a range generating function when they want to iterate over code units or graphemes. Just make everyone call a range generating function to specify the type of iteration and save a lot of people the trouble!
One point about auto decoding that is a matter of opinion is that auto decoding is hidden from the
user, which is bad. Automatically doing special processing without the programmer's express
permission is contrary to the basic ideals of system languages. It would be like auto throwing whenever
a number type under/overflows, you're trying to hide the reality of the situation behind special cases.
Paired with the inability to turn it off, your effectively stuck with it unless you litter your code
with calls to std.utf.byUTF
and std.uni.byCodeUnit
.
Also, the special casing you have to do in some areas is just ridiculous. For example,
import std.algorithm;
import std.utf;
void main()
{
string a = "HelløWorld"; // encoded to UTF-8
// try to split the string at 'W' using slices
auto s1 = a[0 .. a.countUntil('W')]; // WRONG, will split ø in half
auto s2 = a[0 .. a.byCodeUnit.countUntil('W')]; // right
}
Unicode is hard. Trying to hide Unicode specifics helps no one because it's going to bite you in the ass eventually.
Whether you agree with my opinions on auto decoding or not, D is essentially stuck with auto decoding until a
clear and very easy migration path is shown. Perhaps the Java route of a Phobos String
type would be the best option and a deprecation of the string front
function. Perhaps
you just deprecate the front
function and have everyone who wants the current behavior
use std.uni.byCodePoint
. Perhaps a parallel D3 with no auto decoding could be made
with a end of life for D2 announced.
However, in my opinion D is too far along to suddenly ask people to change up all of their string handling code Python 3 style and is not big enough to survive the damage that such a change would cause. A change like this without an automated conversion tool has the possibility to permanently stifle D and forever brand it as "That language that always breaks your code".
Questions or comments? Feel free to contact me.