Update [2016-12-21]: The decision was made to not put ndslice in the standard library due to it's quick development cycle and changing API. Instead, please use Mir which is the exact same code.
Disclosure: I am writing this article from a biased perspective. I have been writing Python for six years, three professionally, and have written a book on Python. But, I have been writing D for the past eight months and four months ago I started contributing to D's standard library. I also served as the review manager for std.ndslice's inclusion into the standard library.
Today, the new addition to D's standard library, std.ndslice, was merged into master, and will be included in the next D release (v2.070, which is due this month). std.ndslice is multidimensional array implementation, not unlike Numpy, with very low overhead, as it's based on D's concept of ranges which avoids a lot of copying and allows for lazy generation of data. In this article, I will show some of the advantages std.ndslice has over Numpy and why you should consider D for your next numerical project.
This article is written for Numpy users who might be interested in using D. So while it will cover some D basics, D veterans can learn something about std.ndslice as well.
Simply put, if you write your numerical code in D, it will be much, much faster while retaining code readability and programmer productivity.
This section is mainly for D newbies. If you already know D, I suggest you skim the code and then head straight for the Getting Hands On section.
To give you a quick taste of the library before diving in, the following code will take the numbers
0 through 999 using the iota
function (acts like Python's xrange
) and return
a 5x5x40 three dimensional range.
import std.range : iota;
import std.experimental.ndslice;
void main() {
auto slice = sliced(iota(1000), 5, 5, 40);
}
D is statically typed, but for the sake of simplicity, this article will use D's type deduction
with auto
. The sliced
function is just a factory function that returns a
multidimensional slice. The sliced
factory function can also accept regular arrays, as
they are ranges as well. So now we have a 5x5x40 cube with the numbers 0 through 999.
A range is a common abstraction of any sequence of values. A range is any type (so a class
or struct) which provides the functions front
, which returns the next value in
the sequence, popFront
which moves the sequence to the next value, and empty
,
which returns a boolean determining if the sequence is empty or not. Ranges can either generate
their values as they are called, lazy, or have a sequence of values already and just provide an
interface to those values, eager.
For a more in depth look at ranges, see The official D tutorial's section on ranges
Look Ma, no allocations! This is due to iota
returning a lazy input range, and
sliced
returning a Slice
(which is the struct that lies at the heart of
std.ndslice) that acts as a wrapper around the iota
range and modifies the underlying
data as it's accessed. So, when the data in sliced
is accessed, the Slice
range calls the iota
values, which is lazily generated, and
determines in what dimension the value will be in and how it will be returned to the user.
iota -> slice -> user accessing the data
So, std.ndslice is a bit different in concept than Numpy. Numpy creates its own type of arrays while std.ndslice provides a view of existing data. The composition of ranges to create something completely new is the basis of ranged-based code, and is one of the reasons D is so powerful. It allows you to make programs where the values returned are like parts on an assembly line, going from station to station, only to be assembled at the very end to avoid unnecessary allocations. This will be important to remember when the performance benchmarks are compared.
The classic example of this is the following code, which takes in input from stdin
,
takes only the unique lines, sorts them, and outputs it back to stdout
import std.stdio;
import std.array;
import std.algorithm;
void main() {
stdin
// get stdin as a range
.byLine(KeepTerminator.yes)
.uniq
// stdin is immutable, so we need a copy
.map!(a => a.idup)
.array
.sort
// stdout.lockingTextWriter() is an output range, meaning values can be
// inserted into to it, which in this case will be sent to stdout
.copy(stdout.lockingTextWriter());
}
For an advanced look at lazy generation with ranges, see H. S. Teoh's article Component programming with ranges in which, he writes a calendar program with ranges (that sits entirely on the stack!).
Because slice
is three dimensional, it is a range which returns ranges of ranges. This
can easily be seen by looping over the values:
import std.range : iota;
import std.stdio : writeln;
import std.experimental.ndslice;
void main() {
auto slice = sliced(iota(1000), 5, 5, 40);
foreach (item; slice) {
writeln(item);
}
}
Which outputs something like this (shortened for brevity)
[[0, 1, ... 38, 39], [40, 41, ... 78, 79], [80, 81, ... 118, 119], [120, 121, ... 158, 159], [160, 161, ... 198, 199]]
...
[[800, 801, ... 838, 839], [840, 841, ... 878, 879], [880, 881, ... 918, 919], [920, 921, ... 958, 959], [960, 961, ... 998, 999]]
The foreach
loop in D is much like the for
loop in Python. The
difference being that D gives you the option of C style loops and Python style loops (using
for
and foreach
respectively) without having to use workarounds like
enumerate
or xrange
in the loop.
Using Uniform Function Call Syntax (UFCS), the original example can be rewritten as the following:
import std.range : iota;
import std.experimental.ndslice;
void main() {
auto slice = 1000.iota.sliced(5, 5, 40);
}
UFCS transforms the call
a.func(b)
to
func(a, b)
if a
doesn't have a method named func
.
UFCS makes generative range-based code easier to follow, so it will be used in the rest of the examples in this article. For a primer on UFCS and why it was made, see this article by Walter Bright, D's creator.
If you don't want to follow along with the code and play around with std.ndslice, then skip to the next section. There are two ways to get your hands on std.ndslice: use digger to download and build the head of the DMD, the reference D compiler, master branch, or use dub, D's official package manager/build system, to download the dub version.
This article will cover the dub path as using digger to get the latest executable is well explained on it's GitHub page. Download dub from the above link or use the instructions on the same page to get it using your package manager of choice.
Once you have dub, create a new directory with a new file called dub.json
which is
dub's config file. I will not explain the dub.json format here, there is a tutorial for that
here, if you just want to follow along, copy
and paste the following code:
{
"name": "test",
"sourcePaths": ["."],
"dependencies": {
"dip80-ndslice": "~>0.8.7"
},
"targetType": "executable"
}
This configuration tells dub that your project, named test, that lies in the current directory,
will be compiled to a executable, and requires the package "dip80-ndslice" (a DIP is a D Improvement
Proposal, much like a PEP). Now, in a new file called main.d
, we can import std.ndslice
import std.experimental.ndslice;
void main() {
}
Why the std.experimental? For those of you who are not familiar with the process, all new modules in the D standard library must wait in a staging area, std.experimental, before going into the main namespace. This is to allow people to test new modules and find any bugs that were overlooked during the review process while signaling that the code is not quite ready for prime time.
To build and run this project, use dub with no arguments
$ dub
std.ndslice has many of the same functions that Numpy has. In the following two sections, I could just provide some simple examples of Numpy and translate them, but halfway through writing that I realized anyone could find that out themselves by reading the documentation, so this section is designed to whet your appetite. To read the docs for std.ndslice and see the function equivalents, click here.
Translating multidimensional slicing from Numpy to std.ndslice is very simple. The example
a = numpy.arange(1000).reshape((5, 5, 40))
b = a[2:-1, 1, 10:20]
is equivalent to
auto a = 1000.iota.sliced(5, 5, 40);
auto b = a[2 .. $, 1, 10 .. 20];
The main difference is D's use of the $
as a symbol for the range's length. Any
Numpy slicing code can be translated to std.ndslice no problem.
So let's look at something a bit more involved. Lets take a 2d array and get an array of means of each of the columns.
Pythonimport numpy
data = numpy.arange(100000).reshape((100, 1000))
means = numpy.mean(data, axis=0)
D
import std.range;
import std.algorithm.iteration;
import std.experimental.ndslice;
import std.array : array;
void main() {
auto means = 100_000.iota
.sliced(100, 1000)
.transposed
.map!(r => sum(r) / r.length)
.array;
}
To make this comparison apples to apples, I forced execution of the result in order to get a D
array at the end by appending array
. If I had not done that, the final D result would
be a lazy input range rather than a D array, which would be unfair to Numpy, as the Numpy code
outputs an array at the end. In a normal D program however, the results would not be computed until
they are used by another part of the program. Also, D doesn't have any stats functions in it's
standard library (yet, it's being worked on), so this example uses a simple lambda function for the
mean. In the map
function call, you may have noticed the !
in front of
the parentheses. This denotes a compile time function argument rather than a run time argument.
The compiler generates the map
function code based on the lambda function.
As a quick aside, this example also illustrates something Walter Bright said about D in his 2014 Dconf talk:
No [explicit] loops. Loops in your program are bugs.
The reason the D code is much more verbose than the Python, is the map function with the mean lambda in this code works on any sequence of values that conforms to the concept of a finite input range (duck typing), where as the Python version uses a special Numpy function that only works on Numpy arrays. I will elaborate on this point in the section titled Numpy's Main Problem, and How D Avoids It and why I believe the D version is better.
But despite the D code's length, it is way faster.
Update [2016-01-04]: After much discussion on Hacker News and a conversation
with std.ndslice's author, I have realized that this benchmark wasn't exactly fair to Numpy, as
the above code is not accessing memory. The iota
function dynamically creates the values
passed to the sliced
function, and therefore the D code never has to touch memory
until the final allocation. Also, the D code returned a array of long
s while the Numpy
code created and array of double
s. Finally, I increased the data size to a million rather
than 10,000. Here is an updated version that is combined with the code I used to benchmark the original
function:
import std.range : iota;
import std.array : array;
import std.algorithm;
import std.datetime;
import std.conv : to;
import std.stdio;
import std.experimental.ndslice;
enum test_count = 10_000;
double[] means;
int[] data;
void f0() {
means = data
.sliced(100, 10000)
.transposed
.map!(r => sum(r, 0L) / cast(double) r.length)
.array;
}
void main() {
data = 1_000_000.iota.array;
auto r = benchmark!(f0)(test_count);
auto f0Result = to!Duration(r[0] / test_count);
f0Result.writeln;
}
These numbers were recorded on a 2015 MacBook Pro with a 2.9 GHz Intel Core Broadwell i5. For
Python, I used IPython's %timeit
functionality to get a fair time. I made sure to
only test the numpy.mean
line in the Python code in order to not measure Numpy's
known slow initialization times. For the D code, I used std.datetime.benchmark with 10000 tests
and took the mean of the results. Compiled with LDC, the LLVM based D compiler v0.17.0 alpha 1
(compiled with LLVM 3.6) ldmd2 -release -inline -boundscheck=off -O
. For those of
you using dub, that is equivalent to doing dub --build=release-nobounds --compiler=ldmd2
.
Python: 145 µs
LDC: 5 µs
D is 29x faster
Results From 2015-01-04:
Python: 1.43 msec
LDC: 628 μs
D is 2.27x faster
Not bad, considering the above D code uses the often loathed D GC in order to allocate the new array, and the fact that the vast majority of Numpy is written in C. To quote Walter Bright once again:
There really is no reason your D code can't be as fast or faster than C or C++ code.
Numpy is fast. Compared to regular array handling in Python, Numpy is several orders of magnitude faster. But there in lies the problem: normal Python and Numpy code don't mix well.
Numpy lives in it's own world with its own functions and ways of handling values and types.
For example, when using a non-numpy API or functions that don't use Numpy that return regular
arrays, you either have to use the normal Python functions (slow), or use np.asarray
which copies the data into a new variable (also slow).
A
quick search on GitHub shows just how widespread this issue is with 139,151 results. Granted,
some of those are mis-uses of Numpy, where array literals could be directly passed to the function
in order to avoid copies, but many aren't. And this is just open source code! I have seen this
pattern many times in closed source projects where it can't be avoided, save rewriting large parts
of an existing code base.
Another example of this problem is the amount of Python standard library functions that had to be rewritten in Numpy to take advantage of the type information. Examples include:
No, not all of the functions in those links have standard library equivalents, but there is enough of them to start asking questions about DRY.
The problem with the duplication is having, again, switching contexts with Python and Numpy code. Accidentally writing
sum(a)
instead of
a.sum()
and whoops, your code is 10x slower.
The root cause of the above problems is that Python code can only be made so fast, so the Numpy developers tried to make an unholy match of Python and a type system using a ton of C code.
D is a compiled and statically, strongly typed language to begin with. It's code generation already takes advantage of type information with arrays and ranges. With std.ndslice, you can use the entire std.algorithm and std.range libraries with no problems. No code had to be reworked/rewritten to accommodate std.ndslice. And, as a testament to D's code generation abilities with it's templates, std.ndslice is entirely a library solution, there were no changes to the compiler or runtime for std.ndslice, and only D code is used.
Using the sum example above:
import std.range : iota;
import std.algorithm.iteration : sum;
import std.experimental.ndslice;
void main() {
auto slice = 1000.iota.sliced(5, 5, 40);
auto result = slice
// sum expects an input range of numerical values, so to get one
// we call std.experimental.ndslice.byElement to get the unwound
// range
.byElement
.sum;
}
This code is using the same sum
function that every other piece of D code uses, in
the same way you use it every other time.
As another example, the Pythonic way to get a list of a specified length that is initialized with a certain value is to write
a = [0] * 1000
but Numpy has a special function for that
a = numpy.zeros((1000))
and if you don't use it your code is four times slower (in this case) not even counting the copying
you would have to do with numpy.asarray
in the first example. In D, to get a range of a
specified length initialized with a certain value you write
auto a = repeat(0, 1000).array;
and to get the ndslice of that
auto a = repeat(0, 1000).array.sliced(5, 5, 40);
The where Numpy really shines is the large amount of libraries that are built with it. Numpy is used in tons of open source financial and machine learning libraries, so if you just use those libraries, you can write fast numerical programs in Python. Numpy also has tons of tutorials, books, and examples on the Internet for people to learn from.
But, this isn't exactly a fair comparison in my opinion, as it could be argued that std.ndslice isn't actually released yet, as it's still in std.experimental. Also, this is already starting to change, as ndslice's author, Ilya Yaroshenko, has stated his next project is writing a std.blas for D, completely in D using std.ndslice.
The following example and explanation was written by Ilya Yaroshenko, the author of std.ndslice, who was gracious enough to let me include it in this article. I have reworded and expanded in some places. This example uses more complicated D code, so don't worry if you don't understand everything.
Now that you have a more through understanding of the library, this will be a more advanced example.
This code is a median image filter as well as the command line interface for the resulting
program. The function movingWindowByChannel
can also be used with other filters
that use a sliding window as the argument, in particular with convolution matrices such as the
Sobel operator.
movingWindowByChannel
iterates over an image in sliding window mode. Each window
is transferred to a filter, which calculates the value of the pixel that corresponds to the given
window.
This function does not calculate border cases in which a window overlaps the image partially. However, the function can still be used to carry out such calculations. That can be done by creating an amplified image, with the edges reflected from the original image, and then applying the given function to the new file.
/**
Params:
filter = unary function. Dimension window 2D is the argument.
image = image dimensions `(h, w, c)`,
where с is the number of channels in the image
nr = number of rows in the window
nс = number of columns in the window
Returns:
image dimensions `(h - nr + 1, w - nc + 1, c)`,
where с is the number of channels in the image.
Dense data layout is guaranteed.
*/
Slice!(3, C*) movingWindowByChannel(alias filter, C)
(Slice!(3, C*) image, size_t nr, size_t nc)
{
// local imports in D work much like Python's local imports,
// meaning if your code never runs this function, these will
// never be imported because this function wasn't compiled
import std.algorithm.iteration: map;
import std.array: array;
// 0. 3D
// The last dimension represents the color channel.
auto wnds = image
// 1. 2D composed of 1D
// Packs the last dimension.
.pack!1
// 2. 2D composed of 2D composed of 1D
// Splits image into overlapping windows.
.windows(nr, nc)
// 3. 5D
// Unpacks the windows.
.unpack
// 4. 5D
// Brings the color channel dimension to the third position.
.transposed!(0, 1, 4)
// 5. 3D Composed of 2D
// Packs the last two dimensions.
.pack!2;
return wnds
// 6. Range composed of 2D
// Gathers all windows in the range.
.byElement
// 7. Range composed of pixels
// 2D to pixel lazy conversion.
.map!filter
// 8. `C[]`
// The only memory allocation in this function.
.array
// 9. 3D
// Returns slice with corresponding shape.
.sliced(wnds.shape);
}
A function that calculates the value of iterator median is also necessary. This function was designed more for simplicity than for speed, and could be optimized heavily.
/**
Params:
r = input range
buf = buffer with length no less than the number of elements in `r`
Returns:
median value over the range `r`
*/
T median(Range, T)(Range r, T[] buf)
{
import std.algorithm.sorting: sort;
size_t n;
foreach (e; r) {
buf[n++] = e;
}
buf[0 .. n].sort();
immutable m = n >> 1;
return n & 1 ? buf[m] : cast(T)((buf[m - 1] + buf[m]) / 2);
}
The main function:
void main(string[] args)
{
import std.conv: to;
import std.getopt: getopt, defaultGetoptPrinter;
import std.path: stripExtension;
// In D, getopt is part of the standard library
uint nr, nc, def = 3;
auto helpInformation = args.getopt(
"nr", "number of rows in window, default value is " ~ def.to!string, &nr,
"nc", "number of columns in window, default value is equal to nr", &nc);
if (helpInformation.helpWanted)
{
defaultGetoptPrinter(
"Usage: median-filter [<options...>] [<file_names...>]\noptions:",
helpInformation.options);
return;
}
if (!nr) nr = def;
if (!nc) nc = nr;
auto buf = new ubyte[nr * nc];
foreach (name; args[1 .. $])
{
import imageformats; // can be found at code.dlang.org
IFImage image = read_image(name);
auto ret = image.pixels
.sliced(cast(size_t)image.h, cast(size_t)image.w, cast(size_t)image.c)
.movingWindowByChannel
!(window => median(window.byElement, buf))
(nr, nc);
write_image(
name.stripExtension ~ "_filtered.png",
ret.length!1,
ret.length!0,
(&ret[0, 0, 0])[0 .. ret.elementsCount]);
}
}
I hope any Python users who have read this found std.ndslice tempting, or at least interesting. If you feel the need to learn more about D, then I highly suggest the official D tutorial here.
And I would suggest any D users reading this to consider moving any Numpy code they have written to std.ndslice.
Questions or comments? Feel free to contact me.
Discussion on Hacker News and Reddit