Using D and std.ndslice as a Numpy Replacement

Update [2016-12-21]: The decision was made to not put ndslice in the standard library due to it's quick development cycle and changing API. Instead, please use Mir which is the exact same code.

Disclosure: I am writing this article from a biased perspective. I have been writing Python for six years, three professionally, and have written a book on Python. But, I have been writing D for the past eight months and four months ago I started contributing to D's standard library. I also served as the review manager for std.ndslice's inclusion into the standard library.

Today, the new addition to D's standard library, std.ndslice, was merged into master, and will be included in the next D release (v2.070, which is due this month). std.ndslice is multidimensional array implementation, not unlike Numpy, with very low overhead, as it's based on D's concept of ranges which avoids a lot of copying and allows for lazy generation of data. In this article, I will show some of the advantages std.ndslice has over Numpy and why you should consider D for your next numerical project.

This article is written for Numpy users who might be interested in using D. So while it will cover some D basics, D veterans can learn something about std.ndslice as well.

As a Python Programmer, Why Should I Care?

Simply put, if you write your numerical code in D, it will be much, much faster while retaining code readability and programmer productivity.

The Basics

This section is mainly for D newbies. If you already know D, I suggest you skim the code and then head straight for the Getting Hands On section.

To give you a quick taste of the library before diving in, the following code will take the numbers 0 through 999 using the iota function (acts like Python's xrange) and return a 5x5x40 three dimensional range.

import std.range : iota;
import std.experimental.ndslice;

void main() {
    auto slice = sliced(iota(1000), 5, 5, 40);
}

D is statically typed, but for the sake of simplicity, this article will use D's type deduction with auto. The sliced function is just a factory function that returns a multidimensional slice. The sliced factory function can also accept regular arrays, as they are ranges as well. So now we have a 5x5x40 cube with the numbers 0 through 999.

Side Note: What's a Range?

A range is a common abstraction of any sequence of values. A range is any type (so a class or struct) which provides the functions front, which returns the next value in the sequence, popFront which moves the sequence to the next value, and empty, which returns a boolean determining if the sequence is empty or not. Ranges can either generate their values as they are called, lazy, or have a sequence of values already and just provide an interface to those values, eager.

For a more in depth look at ranges, see The official D tutorial's section on ranges

Look Ma, no allocations! This is due to iota returning a lazy input range, and sliced returning a Slice (which is the struct that lies at the heart of std.ndslice) that acts as a wrapper around the iota range and modifies the underlying data as it's accessed. So, when the data in sliced is accessed, the Slice range calls the iota values, which is lazily generated, and determines in what dimension the value will be in and how it will be returned to the user.

iota -> slice -> user accessing the data

So, std.ndslice is a bit different in concept than Numpy. Numpy creates its own type of arrays while std.ndslice provides a view of existing data. The composition of ranges to create something completely new is the basis of ranged-based code, and is one of the reasons D is so powerful. It allows you to make programs where the values returned are like parts on an assembly line, going from station to station, only to be assembled at the very end to avoid unnecessary allocations. This will be important to remember when the performance benchmarks are compared.

The classic example of this is the following code, which takes in input from stdin, takes only the unique lines, sorts them, and outputs it back to stdout

import std.stdio;
import std.array;
import std.algorithm;

void main() {
    stdin
        // get stdin as a range
        .byLine(KeepTerminator.yes)
        .uniq
        // stdin is immutable, so we need a copy
        .map!(a => a.idup)
        .array
        .sort
        // stdout.lockingTextWriter() is an output range, meaning values can be
        // inserted into to it, which in this case will be sent to stdout
        .copy(stdout.lockingTextWriter());
}

For an advanced look at lazy generation with ranges, see H. S. Teoh's article Component programming with ranges in which, he writes a calendar program with ranges (that sits entirely on the stack!).

Because slice is three dimensional, it is a range which returns ranges of ranges. This can easily be seen by looping over the values:

import std.range : iota;
import std.stdio : writeln;
import std.experimental.ndslice;

void main() {
    auto slice = sliced(iota(1000), 5, 5, 40);

    foreach (item; slice) {
        writeln(item);
    }
}

Which outputs something like this (shortened for brevity)

[[0, 1, ... 38, 39], [40, 41, ... 78, 79], [80, 81, ... 118, 119], [120, 121, ... 158, 159], [160, 161, ... 198, 199]]

...

[[800, 801, ... 838, 839], [840, 841, ... 878, 879], [880, 881, ... 918, 919], [920, 921, ... 958, 959], [960, 961, ... 998, 999]]

The foreach loop in D is much like the for loop in Python. The difference being that D gives you the option of C style loops and Python style loops (using for and foreach respectively) without having to use workarounds like enumerate or xrange in the loop.

Using Uniform Function Call Syntax (UFCS), the original example can be rewritten as the following:

import std.range : iota;
import std.experimental.ndslice;

void main() {
    auto slice = 1000.iota.sliced(5, 5, 40);
}

UFCS transforms the call

a.func(b)

to

func(a, b)

if a doesn't have a method named func.

UFCS makes generative range-based code easier to follow, so it will be used in the rest of the examples in this article. For a primer on UFCS and why it was made, see this article by Walter Bright, D's creator.

Getting Hands On

If you don't want to follow along with the code and play around with std.ndslice, then skip to the next section. There are two ways to get your hands on std.ndslice: use digger to download and build the head of the DMD, the reference D compiler, master branch, or use dub, D's official package manager/build system, to download the dub version.

This article will cover the dub path as using digger to get the latest executable is well explained on it's GitHub page. Download dub from the above link or use the instructions on the same page to get it using your package manager of choice.

Once you have dub, create a new directory with a new file called dub.json which is dub's config file. I will not explain the dub.json format here, there is a tutorial for that here, if you just want to follow along, copy and paste the following code:

{
    "name": "test",
    "sourcePaths": ["."],
    "dependencies": {
        "dip80-ndslice": "~>0.8.7"
    },
    "targetType": "executable"
}

This configuration tells dub that your project, named test, that lies in the current directory, will be compiled to a executable, and requires the package "dip80-ndslice" (a DIP is a D Improvement Proposal, much like a PEP). Now, in a new file called main.d, we can import std.ndslice

import std.experimental.ndslice;

void main() {
}

Why the std.experimental? For those of you who are not familiar with the process, all new modules in the D standard library must wait in a staging area, std.experimental, before going into the main namespace. This is to allow people to test new modules and find any bugs that were overlooked during the review process while signaling that the code is not quite ready for prime time.

To build and run this project, use dub with no arguments

$ dub

Examples

std.ndslice has many of the same functions that Numpy has. In the following two sections, I could just provide some simple examples of Numpy and translate them, but halfway through writing that I realized anyone could find that out themselves by reading the documentation, so this section is designed to whet your appetite. To read the docs for std.ndslice and see the function equivalents, click here.

Slicing

Translating multidimensional slicing from Numpy to std.ndslice is very simple. The example

a = numpy.arange(1000).reshape((5, 5, 40))
b = a[2:-1, 1, 10:20]

is equivalent to

auto a = 1000.iota.sliced(5, 5, 40);
auto b = a[2 .. $, 1, 10 .. 20];

The main difference is D's use of the $ as a symbol for the range's length. Any Numpy slicing code can be translated to std.ndslice no problem.

A Basic Example With A Benchmark

So let's look at something a bit more involved. Lets take a 2d array and get an array of means of each of the columns.

Python
import numpy

data = numpy.arange(100000).reshape((100, 1000))
means = numpy.mean(data, axis=0)
D
import std.range;
import std.algorithm.iteration;
import std.experimental.ndslice;
import std.array : array;

void main() {
    auto means = 100_000.iota
        .sliced(100, 1000)
        .transposed
        .map!(r => sum(r) / r.length)
        .array;
}

To make this comparison apples to apples, I forced execution of the result in order to get a D array at the end by appending array. If I had not done that, the final D result would be a lazy input range rather than a D array, which would be unfair to Numpy, as the Numpy code outputs an array at the end. In a normal D program however, the results would not be computed until they are used by another part of the program. Also, D doesn't have any stats functions in it's standard library (yet, it's being worked on), so this example uses a simple lambda function for the mean. In the map function call, you may have noticed the ! in front of the parentheses. This denotes a compile time function argument rather than a run time argument. The compiler generates the map function code based on the lambda function.

As a quick aside, this example also illustrates something Walter Bright said about D in his 2014 Dconf talk:

No [explicit] loops. Loops in your program are bugs.

The reason the D code is much more verbose than the Python, is the map function with the mean lambda in this code works on any sequence of values that conforms to the concept of a finite input range (duck typing), where as the Python version uses a special Numpy function that only works on Numpy arrays. I will elaborate on this point in the section titled Numpy's Main Problem, and How D Avoids It and why I believe the D version is better.

The Benchmark

But despite the D code's length, it is way faster.

Update [2016-01-04]: After much discussion on Hacker News and a conversation with std.ndslice's author, I have realized that this benchmark wasn't exactly fair to Numpy, as the above code is not accessing memory. The iota function dynamically creates the values passed to the sliced function, and therefore the D code never has to touch memory until the final allocation. Also, the D code returned a array of longs while the Numpy code created and array of doubles. Finally, I increased the data size to a million rather than 10,000. Here is an updated version that is combined with the code I used to benchmark the original function:

import std.range : iota;
import std.array : array;
import std.algorithm;
import std.datetime;
import std.conv : to;
import std.stdio;
import std.experimental.ndslice;

enum test_count = 10_000;

double[] means;
int[] data;

void f0() {
    means = data
        .sliced(100, 10000)
        .transposed
        .map!(r => sum(r, 0L) / cast(double) r.length)
        .array;
}

void main() {
    data = 1_000_000.iota.array;

    auto r = benchmark!(f0)(test_count);
    auto f0Result = to!Duration(r[0] / test_count);
    f0Result.writeln;
}

These numbers were recorded on a 2015 MacBook Pro with a 2.9 GHz Intel Core Broadwell i5. For Python, I used IPython's %timeit functionality to get a fair time. I made sure to only test the numpy.mean line in the Python code in order to not measure Numpy's known slow initialization times. For the D code, I used std.datetime.benchmark with 10000 tests and took the mean of the results. Compiled with LDC, the LLVM based D compiler v0.17.0 alpha 1 (compiled with LLVM 3.6) ldmd2 -release -inline -boundscheck=off -O. For those of you using dub, that is equivalent to doing dub --build=release-nobounds --compiler=ldmd2.

Original Results:
Python: 145 µs
LDC:      5 µs

D is 29x faster
Results From 2015-01-04:
Python: 1.43 msec
LDC:  628    μs

D is 2.27x faster

Not bad, considering the above D code uses the often loathed D GC in order to allocate the new array, and the fact that the vast majority of Numpy is written in C. To quote Walter Bright once again:

There really is no reason your D code can't be as fast or faster than C or C++ code.

Numpy's Main Problem, and How D Avoids It

Numpy is fast. Compared to regular array handling in Python, Numpy is several orders of magnitude faster. But there in lies the problem: normal Python and Numpy code don't mix well.

Numpy lives in it's own world with its own functions and ways of handling values and types. For example, when using a non-numpy API or functions that don't use Numpy that return regular arrays, you either have to use the normal Python functions (slow), or use np.asarray which copies the data into a new variable (also slow). A quick search on GitHub shows just how widespread this issue is with 139,151 results. Granted, some of those are mis-uses of Numpy, where array literals could be directly passed to the function in order to avoid copies, but many aren't. And this is just open source code! I have seen this pattern many times in closed source projects where it can't be avoided, save rewriting large parts of an existing code base.

Another example of this problem is the amount of Python standard library functions that had to be rewritten in Numpy to take advantage of the type information. Examples include:

No, not all of the functions in those links have standard library equivalents, but there is enough of them to start asking questions about DRY.

The problem with the duplication is having, again, switching contexts with Python and Numpy code. Accidentally writing

sum(a)

instead of

a.sum()

and whoops, your code is 10x slower.

The root cause of the above problems is that Python code can only be made so fast, so the Numpy developers tried to make an unholy match of Python and a type system using a ton of C code.

D Doesn't Have This Problem.

D is a compiled and statically, strongly typed language to begin with. It's code generation already takes advantage of type information with arrays and ranges. With std.ndslice, you can use the entire std.algorithm and std.range libraries with no problems. No code had to be reworked/rewritten to accommodate std.ndslice. And, as a testament to D's code generation abilities with it's templates, std.ndslice is entirely a library solution, there were no changes to the compiler or runtime for std.ndslice, and only D code is used.

Using the sum example above:

import std.range : iota;
import std.algorithm.iteration : sum;
import std.experimental.ndslice;

void main() {
    auto slice = 1000.iota.sliced(5, 5, 40);
    auto result = slice
        // sum expects an input range of numerical values, so to get one
        // we call std.experimental.ndslice.byElement to get the unwound
        // range
        .byElement
        .sum;
}

This code is using the same sum function that every other piece of D code uses, in the same way you use it every other time.

As another example, the Pythonic way to get a list of a specified length that is initialized with a certain value is to write

a = [0] * 1000

but Numpy has a special function for that

a = numpy.zeros((1000))

and if you don't use it your code is four times slower (in this case) not even counting the copying you would have to do with numpy.asarray in the first example. In D, to get a range of a specified length initialized with a certain value you write

auto a = repeat(0, 1000).array;

and to get the ndslice of that

auto a = repeat(0, 1000).array.sliced(5, 5, 40);

Where Numpy Comes Out on Top

The where Numpy really shines is the large amount of libraries that are built with it. Numpy is used in tons of open source financial and machine learning libraries, so if you just use those libraries, you can write fast numerical programs in Python. Numpy also has tons of tutorials, books, and examples on the Internet for people to learn from.

But, this isn't exactly a fair comparison in my opinion, as it could be argued that std.ndslice isn't actually released yet, as it's still in std.experimental. Also, this is already starting to change, as ndslice's author, Ilya Yaroshenko, has stated his next project is writing a std.blas for D, completely in D using std.ndslice.

Image Processing Example

The following example and explanation was written by Ilya Yaroshenko, the author of std.ndslice, who was gracious enough to let me include it in this article. I have reworded and expanded in some places. This example uses more complicated D code, so don't worry if you don't understand everything.

Now that you have a more through understanding of the library, this will be a more advanced example. This code is a median image filter as well as the command line interface for the resulting program. The function movingWindowByChannel can also be used with other filters that use a sliding window as the argument, in particular with convolution matrices such as the Sobel operator.

movingWindowByChannel iterates over an image in sliding window mode. Each window is transferred to a filter, which calculates the value of the pixel that corresponds to the given window.

This function does not calculate border cases in which a window overlaps the image partially. However, the function can still be used to carry out such calculations. That can be done by creating an amplified image, with the edges reflected from the original image, and then applying the given function to the new file.

/**
Params:
    filter = unary function. Dimension window 2D is the argument.
    image = image dimensions `(h, w, c)`,
        where с is the number of channels in the image
    nr = number of rows in the window
    nс = number of columns in the window

Returns:
    image dimensions `(h - nr + 1, w - nc + 1, c)`,
        where с is the number of channels in the image.
        Dense data layout is guaranteed.
*/
Slice!(3, C*) movingWindowByChannel(alias filter, C)
(Slice!(3, C*) image, size_t nr, size_t nc)
{
    // local imports in D work much like Python's local imports,
    // meaning if your code never runs this function, these will
    // never be imported because this function wasn't compiled
    import std.algorithm.iteration: map;
    import std.array: array;

    // 0. 3D
    // The last dimension represents the color channel.
    auto wnds = image
        // 1. 2D composed of 1D
        // Packs the last dimension.
        .pack!1
        // 2. 2D composed of 2D composed of 1D
        // Splits image into overlapping windows.
        .windows(nr, nc)
        // 3. 5D
        // Unpacks the windows.
        .unpack
        // 4. 5D
        // Brings the color channel dimension to the third position.
        .transposed!(0, 1, 4)
        // 5. 3D Composed of 2D
        // Packs the last two dimensions.
        .pack!2;

    return wnds
        // 6. Range composed of 2D
        // Gathers all windows in the range.
        .byElement
        // 7. Range composed of pixels
        // 2D to pixel lazy conversion.
        .map!filter
        // 8. `C[]`
        // The only memory allocation in this function.
        .array
        // 9. 3D
        // Returns slice with corresponding shape.
        .sliced(wnds.shape);
}

A function that calculates the value of iterator median is also necessary. This function was designed more for simplicity than for speed, and could be optimized heavily.

/**
Params:
    r = input range
    buf = buffer with length no less than the number of elements in `r`
Returns:
    median value over the range `r`
*/
T median(Range, T)(Range r, T[] buf)
{
    import std.algorithm.sorting: sort;

    size_t n;

    foreach (e; r) {
        buf[n++] = e;
    }

    buf[0 .. n].sort();
    immutable m = n >> 1;
    return n & 1 ? buf[m] : cast(T)((buf[m - 1] + buf[m]) / 2);
}

The main function:

void main(string[] args)
{
    import std.conv: to;
    import std.getopt: getopt, defaultGetoptPrinter;
    import std.path: stripExtension;

    // In D, getopt is part of the standard library
    uint nr, nc, def = 3;
    auto helpInformation = args.getopt(
        "nr", "number of rows in window, default value is " ~ def.to!string, &nr,
        "nc", "number of columns in window, default value is equal to nr", &nc);

    if (helpInformation.helpWanted)
    {
        defaultGetoptPrinter(
            "Usage: median-filter [<options...>] [<file_names...>]\noptions:",
            helpInformation.options);
        return;
    }

    if (!nr) nr = def;
    if (!nc) nc = nr;

    auto buf = new ubyte[nr * nc];

    foreach (name; args[1 .. $])
    {
        import imageformats; // can be found at code.dlang.org

        IFImage image = read_image(name);

        auto ret = image.pixels
            .sliced(cast(size_t)image.h, cast(size_t)image.w, cast(size_t)image.c)
            .movingWindowByChannel
                !(window => median(window.byElement, buf))
                 (nr, nc);

        write_image(
            name.stripExtension ~ "_filtered.png",
            ret.length!1,
            ret.length!0,
            (&ret[0, 0, 0])[0 .. ret.elementsCount]);
    }
}

Wrapping Up

I hope any Python users who have read this found std.ndslice tempting, or at least interesting. If you feel the need to learn more about D, then I highly suggest the official D tutorial here.

And I would suggest any D users reading this to consider moving any Numpy code they have written to std.ndslice.

Questions or comments? Feel free to contact me.


Discussion on Hacker News and Reddit