published: January 3, 2013 — tags: c, d — 4 comments

A look at the D programming language

The D programming language is an interesting alternative to C++ as being also natively compiled. Let us look at some of D constructs from a C++ perspective, to prove their usefulness. Then a few words will be said about the garbage collector in D, and why it's still considered a controversial feature.

I think by now there's a programming language for every letter of the alphabet. There are hundreds of languages in various stages of development. Not so many – if you only count the natively-compiled ones that could be used instead of C++ and have a similarly looking code. D is the best fit in this category, if not the only fit.¹

Why switch from C++ to D? Obviously, with a big C++ code base, not many would consider rewriting it, but for new projects it's a valid option. Usually libraries are part of what makes a great programming language, and that is why Python is great. There were some problems in the past, when Phobos, the standard library of D, was far behind in development, but thankfully that is no longer the case. You may have second thoughts if you are tied to some existing library. If it has a C interface, D will handle most of it, but you will have to do some work converting the header files and taking care of variables and calls. If it's a C++ library, there's more work involved and you may need a C wrapper, so that may be a deal breaker.

You may wonder who uses D in the real world. Will knowing D land you a job? Well, the demand is much smaller than for C++, C# or Java, but so is the supply. It depends on being well informed and being able to work remotely, but even then there's not much liquidity in this market. You would however benefit from D if you write code for yourself, are a researcher or one-man developer. Then there's nothing holding you back from checking all the possible options, D among them, and finding out that D brings some great things to the table.

What are those good parts? Expressiveness of code, no need for header files or forward class declarations, safer code by default (including threads), being able to write code using multiple paradigms, unicode-aware strings, fast compilation time and probably most importantly: templates and algorithms done right.

Safety and debugging

Let us start with the most simple program, where nothing could go wrong (wishful thinking). Here's a typical greeting program, that should be called with a name as the argument:

import std.stdio;
void main(string[] args) {
    writeln("Hello ", args[1]);
}

Notice that the code is flawed. There should be a check if there are at least 2 elements in the args array, so instead of args[1] we could write args.length > 1 ? args[1] : "stranger". But we didn't do that, we compiled the code with dmd -g hello.d, and then some uninformed lunatic executed the program without any arguments. What has happened then? The program exited printing the following information:

core.exception.RangeError@hello(3): Range violation
----------------
0x0040E48C in char[][] core.sys.windows.stacktrace.StackTrace.trace()
0x0040E317 in core.sys.windows.stacktrace.StackTrace core.sys.windows.stacktrace.StackTrace.__ctor()
0x0040466C in onRangeError
0x0040202F in _Dmain at C:\data\dmd2\test\hello.d(3)
0x00402A48 in extern (C) int rt.dmain2.main(int, char**).void runMain()
0x00402A82 in extern (C) int rt.dmain2.main(int, char**).void runAll()
0x004026A4 in main
0x004187AD in mainCRTStartup
0x75853677 in BaseThreadInitThunk
0x77709F42 in RtlInitializeExceptionChain
0x77709F15 in RtlInitializeExceptionChain
----------------

This is quite informative. On the other hand, a replica of this program in C++ has printed "Hello ", as if nothing bad happened, giving the programmer a false feeling of accomplishment. When the array index was increased from 1 to 1000 we finally got some reaction - this time the program crashed, but the only information given to the user was Segmentation fault (core dumped). This is a joke... except that it's not!

Development in D is a much better experience than in C++ because you are better informed about errors, they aren't ignored, and are reported in a consistent way. You also get better tools built right into the language to help verify that the code does what you intended it to do. Those are: asserts, static asserts, unit tests, input-output contracts.

Basic types, arrays, slices and strings

Strength of D lies in its type system. First of all, complying with the RAII principle (resource allocation is initialization), one doesn't end up with random garbage in variables that were not initialized. Integer types will always initialize with 0, floats with NaN (not a number) and, probably of the highest importance, pointers will be initialized with null. This can sometimes save your skin.

Secondly, arrays have been made first class citizens unlike in C, where they are in general just a pointer in disguise and account for many buffer overruns. There are static fixed-size arrays like type double[3], which are passed by value, useful in calculations. Then there are dynamic arrays, like int[], that know their current size and capacity. There are even associative arrays, like int[string], in which a string key points to an integer value, that are useful when counting words for example:

int[string] words;
// ... a loop or something where we read someWord ...
words[someWord]++;

Notice that we do not have to check if the item words[someWord] exists and write: if (someWord in words) words[someWord]++; else words[someWord] = 1;. We don't have to do that thanks to int and other types having a default initializer!

Coming back to dynamic arrays... since they consist of a pointer to a vector in memory and a number representing its length, it is possible to create an array that points inside a bigger array. That is what D does with slices. Very useful in processing big chunks of data, when it would be inefficient to copy it around. The C way would be to use 2 pointers or pointer and size. Slices on the other hand hide the fact that we're dealing with a view of another array – they behave exactly like an array, except when trying to grow them, in which case a copy of the original array would have to be made or an assurance given, that there is still enough space to expand the view. Generally slices are one of the reasons D has a garbage collector, so that the bigger array isn't accidentally removed when it gets out of scope while slices of it still exist.

Then there are strings. They are arrays of char, wchar or dchar (8, 16, and 32-bit respectively), but let's stick with the basic char[]. In D char and byte are separate types, therefore you can't confuse a string with an array of numbers. That's one good thing. The other is D's handling of Unicode. 8-bit strings are in UTF-8 and the standard library including std.regex knows that and can handle it properly. This is a big deal if you are using a language other than English. Regular expressions just work. There's no outside dependency on ICU like in Boost. If you process text in UTF-8, you will enjoy what D has to offer. Using ranges to process a string also has it's benefits. C++ std::string are far less efficient.

Many of the problems with C++ are inherited from C. Arrays being unbounded pointers and zero-terminated strings come from C. C++ does offer proper dynamic arrays (std::vector), but on the other hand its std::string is less than optimal, due to compatibility with C. As a result you'll find that D programs doing string parsing outperform their C++ counterparts, especially thanks to slices and ranges. C++ strings are doomed by the design choice, that std::string shall use the C zero-terminated string for storage, which in result means lots of copying around and poor performance. Recent C++11 standard at least mitigates some of the copying pain by introducing rvalue references. Still it's a much bigger effort to do efficient and Unicode-aware processing in C++, when in D the programmer gets all what is needed in the standard library.

Speaking about arrays in general, there are some other design choices that affect programming. C++ uses the + sign for concatenation and, for reasons unknown to me, introduces two stream operators << and >>. The problem is that all three of those operators have also other meaning, not to mention that >> can appear when using templates, for example hash_map<int, set<int>>, and that was until recently considered a syntax error. D doesn't use angle brackets for templates and has a special ~ operator reserved for array concatenation, which saves people and compilers from the confusion C++ may cause:

cout << 1 << 3 << endl;  // it's "13" and not "8"!
cout << (1 << 3) << endl;  // now it's "8"
cout << string("a ") + "test" << endl;  // prints "a test"
cout << "a " + "test" << endl;  // doesn't compile
cout << string("it's ") + "a " + "test" << endl;  // prints "it's a test"
cout << string() + 8 << endl;  // doesn't compile!

Algorithms, Ranges and code generation

In my opinion one of the best things to happen to C++ was the introduction of STL (Standard Template Library) by Alexander Stepanov. It employs often used algorithms and data containers while providing separation of algorithms and data, without losing type information. That was a major leap forward allowing to write generic code, where you would not have to develop n implementations of an algorithm for n types of containers.

In plain C generic code comes at a loss of type identity. When you use qsort (quicksort from the standard library) you have to cast from type* to void* and back, which asks for trouble. C++ brings type safety, but there's one design drawback. It uses iterators to move over data collections. They bring C pointers to a higher level, but are basically the same, except that they are (supposedly) much harder to implement in a new class. Quite often you need not one but two pointers in an algorithm: the current position and the ending one. Dealing with two iterators is hard, because you can't easily pass them around and return from a function without creating a special structure to hold both of them. D library creators have entertained the idea of having such structures or even interfaces, which they called Ranges. This quite simple concept enabled some very powerful constructs.

Abstracting from iterators and simple structs holding them, a range doesn't have such limitations and can even be infinite, becoming a sort of generator. repeat, iota, sequence and recurrence are good examples of standard functions that create such generator ranges, sequence and recurrence being infinite.

Add to that the power of templates and generated code and you get terse code that is almost magical. The following example taken straight from Phobos documentation presents the power of recurrence:

// a[0] = 1, a[1] = 1, and compute a[n+1] = a[n-1] + a[n]
auto fib = recurrence!("a[n-1] + a[n-2]")(1, 1);
// print the first 10 Fibonacci numbers
foreach (e; take(fib, 10)) { writeln(e); }

// print the first 10 factorials
foreach (e; take(recurrence!("a[n-1] * n")(1), 10)) { writeln(e); }

Isn't it impressive? The last line could have been written using a lambda: recurrence!((a, n) => a[n-1] * n)(1), but that is a longer and more explicit form than writing recurrence!"a[n-1] * n"(1) – yes, you can even drop the parentheses after the exclamation mark if it's clear where the actual function parameters start.

There are a lot of things about D, that I haven't covered here, most notably templates, as one could (and have) write a book about that alone, but let me say a few words about what still needs improvement – the garbage collector.

Garbage Creation, err... Collection

This is a tough call: may a systems programming language use a Garbage Collector? D does. There's a problem with GCs – they are good on paper since the eighties, but usually their implementation is quite poor because of two reasons: not enough time/resources or design barrier. We've seen the first kind in Mono – for a long time they were only using a Boehm Conservative GC and only recently offer SGen, a generational GC, though not yet precise, while the original Microsoft .NET GC was a generational one for like forever. Compacting is good for long-running programs, which normally keep expanding their heap due to fragmentation.

The second reason is software design. There have been a few excellent garbage collectors written for Java, but their appearance was made possible thanks to the existence and limitations of the Java Virtual Machine. D uses Boehm GC, like standard Mono, (no, D's GC is actually a custom one written in D, but it is conservative as well) and faces the triangle of impossibility – the programs can't be fast, memory efficient, and close to the metal at the same time. Some people say that having GC is very bad in low-latency applications like games. Yes, but in games even malloc is often considered inefficient and people end up using special pools of memory and preallocated resources. One can do this in C++ as well as D, only that in D it is still somewhat awkward, nevertheless it's doable. Let me show you a complete example using malloc and defining custom _new and _delete:

import std.stdio, std.conv, core.stdc.stdlib;

class C {
    int a;
    this(int a) {
        this.a = a;
        writefln("C created");
    }
    ~this() {
        writefln("C destroyed");
    }
}

T _new(T, Args...) (Args args) {
    size_t objSize = __traits(classInstanceSize, T);
    void* tmp = core.stdc.stdlib.malloc(objSize);
    if (!tmp) throw new Exception("Memory allocation failed");
    void[] mem = tmp[0..objSize];
    T obj = emplace!(T, Args)(mem, args);
    return obj;
}

void _delete(T)(T obj) {
    clear(obj);
    core.stdc.stdlib.free(cast(void*)obj);
}

void main() {
    C x = _new!C(100);
    _delete(x);
}

You can replace malloc and free with something else, if you wish; only make sure there is proper alignment. Using this and structs you could almost switch GC off completely if you stay away from constructs and routines that rely on the existence of a garbage collector.

Looking at those constructs brings us to the point where we actually might appreciate GC being in the runtime. First of all it eliminates a whole class of problems (double free, forgot to free, don't know who owns the object), then there are slices which bring great performance to string parsing programs.

GC gives such possibilities, but there's a catch. The garbage collector in D isn't very fast and stops the world (halts other threads). It is also imprecise – can't always tell if the group of bytes it encountered while scanning is a pointer or not and assumes that it is. This can be catastrophic in consequences. If such a false pointer points to an object allocated by the GC, that object won't be freed. It can be easily observed when the allocated object occupies a lot of RAM on a 32-bit system. The probability that some double word on the stack, static data or run-time type information cast to a pointer would point somewhere inside that object becomes high enough to prevent it from being freed. Observe how memory constantly grows when running the following program, until it crashes with an OutOfMemoryError:

import std.stdio, core.memory, core.thread;
class C { 
    byte[] s;
    this() { s = new byte[80 * 1024 * 1024]; }
}
void main(string[] args) {
    for (int i=0, j=0; i < 1000; i++) {
        auto x = new C();
        Thread.sleep(dur!"msecs"(500));
        // delete x.s;
    }
}

The only way to free the byte array is by telling the garbage collector to do so using delete (uncomment the appropriate line). It's an annoyance that you can't perform that deletion inside the destructor, because destructors cannot allocate or delete objects using the GC (it is not reentrant). For similar reasons while a writeln() for debugging purposes works in a destructor, concatenating strings using ~ will end with a fatal exception. Unfortunately delete is being deprecated. Hopefully it will remain available and usable until a better, more precise GC is ready. If you happen to be an expert in garbage collectors (outside of Java world), maybe you could help make that happen?

When you consider Go, which also is marketed as a systems language, it doesn't really provide as fine a replacement for C++ as D. First of all, the syntax is different – for example variable declarations types come last, which in my case results in a mind block. But, more importantly, Go lacks type inheritance, method overloading or pointer arithmetic, which makes transition from C++ code more difficult. ↩

This article was discussed at reddit on January 7, 2013. Have a look.

Share this! Other articles

Comments

A good overview! One inaccuracy - D currently uses a conservative custom GC written in D, not Boehm. Also there are efforts to make it partially precise.

Thanks for pointing that out.

Yeah, I've read that there is already an almost-precise GC implementation for Visual-D. I'm not using Visual Studio myself, so I'm eagerly waiting until it gets merged with the main branch.

Great read! BTW, when it comes to building reliable safe programs, what do you think of the new Ada 2012? http://www.ada2012.org

I remember when D had comparison charts and I could easily tell how D has all the features while other languages are lacking. Now D has removed comparison charts and also the language shootout guy removed D for no good reason. Ada has some comparison charts though http://www.ada2012.org/comparison.html - how would these relate to D?

You mean http://dlang.org/comparison.html had columns with other languages in the past and now they're gone? They probably got lots of letters like: "what do you mean: no unit testing? We have dozens of 3rd party libraries for unit testing" or "you're wrong, we have CTFE in the beta 2 version of our latest compiler you never heard about", so they've dropped it.

Yes, it's sad that D is gone from the computer language benchmarks game. You can still download all the source files, but by not seeing D listed on the site, one could assume the language is irrelevant.

I don't know Ada, new or old, and I'm quite allergic to the begin and end keywords, but I definitely support ideas that make writing easier, like conditional and case expressions, a string encoding package, or safer (when required): task-safe queues, pre- and postconditions, to name a few things from that chart.

Add your comment