zenmumbler

A Struct of Arrays Type in C++14

In data-oriented design the efficient layout of data in memory is the primary driver for designing software. Main memory accesses are now so slow compared to L1/2/3 caches that cache misses can become an overriding factor in the speed of any system that deals with large arrays of items.

I was reading the excellent articles by BitSquid and was motivated to change my own component design in my stardazed engine. Now, unlike some people in game dev, I'm not averse to C++ and/or templates. I do use them sparingly and not to the heights found in some Boost libraries. I've got another post brewing about this, but I digress.

When I did some prototype test conversions of my existing types, I noticed a usage pattern returning every time. It is related to one of the main tenets of data-oriented design, namely that instead of an Array of Structs, you use a Struct of Arrays. For example, given an example struct like:

struct Transform {
    Vec3 position;
    Quat rotation;
    Vec3 scale;
    Mat4 modelMatrix;
};

If you look at the data layout of an array of Transforms in memory it would be this:

[position] [rotation] [scale] [modelMatrix] [position] [rotation] [scale] [modelMatrix] ...

and when you iterate over the contents of the array and are just looking at the matrices the CPU will load the entire struct into the cache but the other fields might as well not be there. This causes a cache miss and thus delay for likely every iteration because you're wasting cache space on data that is not needed at the moment.

Conversely, in a Struct of Arrays setup the layout is as follows:

struct Transform {
    Vec3 positions[];
    Quat rotations[];
    Vec3 scales[];
    Mat4 modelMatrices[];
};

Each field is separated out into its own contiguous array. Where things get a bit tricky is in the requirement to have all the arrays allocated together in a single allocation. The principle is not hard, you allocate a single buffer that can hold N items of each array, then calculate the base pointers for each array and from there on just use those. E.g.:

size_t numItems = 16384;
size_t combinedSize = sizeof(Vec3) + sizeof(Quat) + sizeof(Vec3) + sizeof(Mat4);
void* instanceData = allocator.alloc(combinedSize * numItems);

auto positionBase = static_cast<Vec3*>(instanceData);
auto rotationBase = reinterpret_cast<Quat*>(positionBase + numItems);
auto scaleBase = reinterpret_cast<Vec3*>(rotationBase + numItems);
auto modelMatrixBase = reinterpret_cast<Mat4*>(scaleBase + numItems);

// ... read/write data to offsets from base pointers
// somewhere down the line implement resizing the block with multiple memcpys, etc.

Additionally, with dynamic data blocks you will run out of array space and will need to expand the array space of all arrays together. The setup and resizing of this data is grunt work, but when done manually each time you implement this in a different component it can be error prone.

Enter the inventively named MultiArrayBuffer, a variadic template type that will allocate and resize the buffer and give you properly typed base pointers on demand. Like I said, going overboard with template stuff is not what I usually do, but I value the service MAB gives me in code that uses it over the chicanery I had to do to make it work, which is all contained in the implementation of a very low-level helper type. This is how the above setup code is done using MAB:

MultiArrayBuffer<Vec3, Quat, Vec3, Mat4> instanceData{ allocator, 16384 };

auto positionBase = instanceData.elementsBasePtr<0>();
auto rotationBase = instanceData.elementsBasePtr<1>();
auto scaleBase = instanceData.elementsBasePtr<2>();
auto modelMatrixBase = instanceData.elementsBasePtr<3>();

// ... read/write data to offsets from base pointers
// somewhere down the line:
instanceData_.resize(32768);

The elementsBasePtr<N>() method returns a typed pointer to the Nth array in the buffer. This method is also trivial enough that you don't need to cache the start pointers per se, potentially simplifying usage code.

Implementation

MAB being a variadic template means having to deal with the somewhat awkward recursion of types using templated helpers1. I will highlight a few things in this article but read the full implementation if you're interested.

The first thing I had to calculate was the total byte size of each parameterised type to mirror the sequence of sizeof(T)s in the manual version. Here's how I did2:

template <typename T>
constexpr size_t elementSumSize(T* = nullptr) {
    return sizeof(T);
};

template <typename T, typename... Ts>
constexpr size_t elementSumSize(T* = nullptr, Ts*... ts = nullptr) {
    return sizeof(T) + elementSumSize<Ts...>(std::forward<Ts*>(ts)...);
};

// ... in the class definition:

template <typename... Ts>
class MAB {
    static constexpr size_t sumSize = elementSumSize<Ts...>();
};

Quite straightforward with the notable exception of the annoying (defaulted) parameters I had to pass and also forward to make things compile. Omitting the parameters is not possible using this method as the two function calls become ambigious, I also tried this with structs but hit the same limitation. I use functions as they are less verbose than using structs.

The main challenge of this type came when I needed to implement resizing the buffer. To do so, I have to iterate over each base pointer and copy data to a new buffer, this combines the compile-time templated functions with run-time actions in a not entirely intuitive way. Here is the code I used to handle it:

// -- Call a callback with base pointer and element size info for each T

using ItFn = std::function<void(void*, uint32)>;

template <typename T>
void eachArrayBasePtr(void* basePtr, size_t /*capacity*/, const ItFn& fn, T* = nullptr) {
    fn(basePtr, sizeof(T));
}

template <typename T, typename... Ts>
void eachArrayBasePtr(void* basePtr, size_t capacity, const ItFn& fn, T* = nullptr, Ts*... ts = nullptr) {
    fn(basePtr, sizeof(T));

    auto bytePtr = static_cast<uint8_t*>(basePtr);
    eachArrayBasePtr<Ts...>(bytePtr + (capacity * sizeof(T)), capacity, fn, std::forward<Ts*>(ts)...);
}

// ... at the usage site:

newData = /* newly allocated buffer */;
newCapacity = /* number of items in the new arrays */;
data_ = /* current buffer */;
count_ = /* count of used elements in the arrays */;

eachArrayBasePtr<Ts...>(data_, capacity_,
    [newDataPtr = (uint8_t*)newData, newCapacity, usedCount = count_]
    (void* basePtr, size_t elementSizeBytes) mutable {
        memcpy(newDataPtr, basePtr, usedCount * elementSizeBytes);
        newDataPtr += elementSizeBytes * newCapacity;
    });

The eachArrayBasePtr<Ts...>() helper essentially recurses over the Ts and bumps a base pointer to the next array by using the sizeof(T) and the capacity that the caller has to supply. Before each iteration, a callback function is called with the current array base pointer and the size of the currently pointed-at element. Types are eschewed in this setup as they were not necessary and it would complicate things quite a bit. What that does mean is that MultiArrayBuffer can only be safely used for TriviallyCopyConstructible types. In fact, ideally you would use (almost) completely trivial types.

Other containers in my library properly accomodate non-trivial types but given the intended usage of this container, it is reasonable not to support them. If you have a large array of tens or hundreds of thousands of elements, it is simply not wise to use types with user specified destructors. POD types are the way to go for this; put the logic in the manager class maintaining the instance data instead. It also follows the philoshopical idea of SoA: these are trivial types that are initialized and processed en masse by other functions. Individual constructors and destructors at scale will annihilate the efficiency gains of this data layout.

If you have comments or see horrible mistakes or misconceptions, throw me a line on Twitter.

  1. Haters, I understand why you don't like this, I really do. I'm just not that bothered by it. Each tool has its use.
  2. In the real type, I use 32-bit sizes and some other helper functions, but the essence is the same.

Memoize This

I was watching the WWDC 2014 videos and in the Advanced Swift session they illustrated its generic programming power by implementing a memoize function that can take any (A) -> B function and then memoize it, even if that function is recursive.

I got inspired to quickly try my hand at converting their memoize function to C++14 to see how close I could get. The Swift version is quite elegant in that the syntax of creating a memoized function is very cruft-free as they say.

In their example they use factorial but I'll use the fibonacci function here instead. First the Swift version:

func memoize<T: Hashable, U>(body: ((T)->U, T) -> U) -> (T)->U {
    var memo = Dictionary<T, U>()
    var result: ((T)->U)!
    result = { x in
        if let q = memo[x] { return q }
        let r = body(result, x)
        memo[x] = r
        return r
    }
    return result
}

let fib = memoize {
    fib, n in
    return n < 2 ? Double(n) : fib(n - 2) + fib(n - 1)
}

The memoize function is taken straight from the presentation. The code is by Dave Abrahams, watch the whole video, it's excellent1.

The only weird bit here is that he had to declare the result variable before he could assign the method to it, otherwise it's pretty clean. I like the if let construct to test and unwrap an optional value a lot, it's a concise and safe idiom.

As you can see, the fib closure can reference itself in its own parameters to allow the recursive calls to also use the memoized version of the function instead of the original one. Additionally, you do not have to explicitly specify the types used by the function anywhere. On the user end it's quite clean.

Without further ado, here's my current go at the C++14 version:

template <typename T, typename U>
auto memoize(std::function<U(T)>&& f) {
    return [memo = std::unordered_map<T, U>(), f](const T& t) mutable {
        auto have = memo.find(t);
        if (have != memo.end())
            return memo[t];
        U val = f(t);
        memo[t] = val;
        return val;
    };
}

std::function<double(int)> fib = memoize<int, double>([&fib](int n) {
    return n < 2 ? double(n) : fib(n - 2) + fib(n - 1);
});

Implementation Notes

The Swift version takes the closure itself as an argument whereas in C++14 the lambda is captured as a std::function to refer to itself. I've read some articles about lambdas referring to themselves and it's a major pain so I decided not to spend a lot of time "fixing" this. std::function is sometimes avoided because of the feared extra overhead over a lambda, but the compiler (clang 3.4) completely eliminates this calling overhead in optimized builds in the tests I've done so far.

Consequently, I chose to make the parameter to memoize a std::function<U(T)> instead of having another type parameter F and taking a F&&. The optimizer completely eliminates any overhead and until we have concepts this allows the compiler to check the parameter and give better errors messages and it also is clearer in general.

The lambda is mutable because the memo unordered_map can't be const, of course. I also make use of the extended capture syntax for lambdas to directly initialize the memo member in the lambda.

The fib result has to be typed explicitly because I'm capturing it in its own definition. The memoize function returns the generated function as a lambda but it will be captured here in a std::function.

C++ can't automatically deduce T and U from the invocation as those two types are only exposed in memoize inside the std::function<U(T)> parameter. Perhaps it is possible to extract these in some clever way but I haven't tried that hard yet.

The end result is that the syntax for the C++14 version of the fib definition suffers from not only having to declare the in and out type explicitly, but also from having to do so twice, but…

Speed

In the presentation he also uses fibonacci as an example of a function to memoize. The constant φ (the Golden Ratio) is calculated as follows:

let φ = fib(45) / fib(44)

The running time to calculate φ is mentioned to be 11 seconds for the naive recursive version and 0.1 seconds for the memoized version. That by itself is pretty good, a ~100x speedup. I timed my C++14 version with the following function:

template <typename F>
void timeTest(F&& f) {
    namespace k = std::chrono;
    auto t0 = k::high_resolution_clock::now();
    f();
    auto t1 = k::high_resolution_clock::now();
    std::cout << "time: " << k::duration_cast<k::microseconds>(t1 - t0).count() << "µs\n";
}

I then did a quick run of both methods and took an average of a couple of runs. Like so:

double fibslow(int n) {
    return n < 2 ? double(n) : fibslow(n - 2) + fibslow(n - 1);
}

...

timeTest([] {
    std::cout << "φ = " << fibslow(45) / fibslow(44) << "\n";
});

And, using the fib lambda above:

timeTest([&fib] {
    std::cout << "φ = " << fib(45) / fib(44) << "\n";
});

Note that these time tests were far from exhaustive or scientific, but given the low precision of the times reported in the slide, this would suffice for now. All tests were run with -O3 -fno-exceptions settings.

My home machine took about 13 seconds to run the non-memoized test, but the memoized version completed in ~55µs — and that's micro, not milliseconds — or a ~235,000x speedup. Now, when I see low timings like this I get suspicious that the compiler just eliminated the whole function but using the excellent Hopper Disassembler I confirmed that it was really still running. It does make sense as the memoized version really is not a lot more than a couple of dictionary inserts and lookups.

So, though more verbose, the C++14 memoized version is in the order of 1500-2000x faster than Swift's version. FACT2

Also, more hilariously, when I tried the Swift version in an Xcode 6 beta playground, just the presence of the let fib = ... statement made it very unhappy. Beta IDE is beta.

Adding Restraints

I think with more finagling and probably by uglifying the memoize function the call site code can be simplified, but I'm actually quite content with this version. With C++11 and now 14, the language has gotten more terse. When the Concepts TS is ready, we can also restrict the T typename as such:

template <Hashable T, typename U>
auto memoize(std::function<U(T)>&& f);

Or something equivalent, I'm assuming Hashable will be a standard type restraint.

I'm also wondering if a concept like Callable could exist such that the argument type and return type are retrievable from the concept itself so that the explicit passing of the type parameters becomes unnecessary. Right now, I lack the knowledge to really investigate this.

Source

I uploaded a simple source file to play around with the code as a gist so if I piqued your interest I encourage you to play around with it.

  1. In fact, watch all 3 Swift videos to get a good feel for the whole language from the basics to generic functional features.
  2. Not actually a fact. The level of rigorousness and overall scienceworthyness of this article is low to very low. But no one reads footnotes anyway. Yay C++!!

Writing a Modern C++ JSON Reader

Does the world need another JSON parser? Perhaps not, but I made one anyway.

The nice thing about a JSON engine, especially one written in a non-scripting language, is that it presents a couple of nice programming problems on a small enough scale that you can experiment with the implementation quickly:

For my implementation, named krystal, I chose to focus on the points below. I will elaborate on each in the coming posts.

Motivation

I was looking at writing a simple static blog generator and first made a version that is serverless, but which generates the site largely on the client in JavaScript. This is the site you're looking at now.

Moving things to a C++ generator would require me to work with JSON data that stores e.g. site generation settings and post metadata. I needed a C++ JSON DOM parser. It wasn't long until I found the json_benchmark project that compared a couple of the popular JSON engines on parsing speed.

rapidjson is by far the fastest of them all and to get that speed it pulls no punches: it creates objects in place inside the json text, all objects are custom and use a memory pool allocator, it has SSE3 and 4 optimized white-space skipping functions, etc.

Now, I like performance, I wrote a NES emulator in PowerPC assembly. I am down with low-level stuff. Right now though, I'm trying to learn me some C++.

I came back to C++ recently and I for now I like to keep things as "pure" as possible as to not let old coding styles and habits take hold in my new projects. There's a lot of "C+" code out there, C and C++ concepts mixed together with a ton of MACROS thrown in for good measure. That's not for me, so my goal became clear:

To make a JSON parser that gets as close as possible to rapidjson speed-wise while only using standard library storage and string classes. I required C++11 to use the things like std::unordered_map, lambdas, move semantics and other new goodies.

The Result

I will start at the end and show how you might use krystal. This will set the stage and hopefully make things clearer later on.

auto file = std::ifstream{"game_data.json"};
auto doc = krystal::parse_stream(file);

if (! doc.is_null()) {
    auto delay = doc["levels"][0]["zombie spawn delay"].number();
    auto name = doc["levels"][0]["level name"].string();

    for (auto kv : doc["levels"]) {
        auto level_index = kv.first.number_as<int>();
        auto& level_data = kv.second;
        ...
    }
}

parse_stream returns a document which is the root of the tree of values of the JSON data. Each value is a variant type. JSON supports 7 types of values: null, true, false, bool, number, string, array and object. The first 5 are pure values and the last 2 are containers. arrays map ints to values. objects map strings to values.

krystal allows direct indexing into arrays and objects. It also allows iteration over the contents of arrays and objects. Because krystal exposes begin and end methods for values, you can use C++11 ranged for loops. For consistency, each iteration returns a std::pair<value,value&>. The first value is the key (a number or a string) and second is a ref to the actual value.

Each value has value kind1 tests and data extraction methods. Calling the wrong data method will throw an exception.

// variant kind tests
value_kind type() const;
bool is_a(const value_kind type) const;
bool is_null() const;
bool is_false() const;
bool is_true() const;
bool is_bool() const;
bool is_number() const;
bool is_string() const;
bool is_array() const;
bool is_object() const;
bool is_container() const { return is_object() || is_array(); }

// data extractors
bool boolean();
double number();
template <typename T> T number_as() const;
std::string string();

Finally, there are methods that deal with containers. Again, calling container or array/object specific methods on values of a different kind will yield little but exceptions.

size_t size() const; // number of items in container
bool contains(const std::string& key) const; // object key check

const value& operator[](const std::string& key) const; // objects
const value& operator[](const size_t index) const; // arrays
iterator begin() const; // all iterators are const_iterators
iterator end() const;

By only having a std::string key param and no const char* alternative I avoid the gotcha present in rapidjson where the zeroth array item has to be indexed like val[size_t{0}] to avoid ambiguities with NULL. Type Win!

Please have a look at the project on github and, if you use Clang, try it out for yourself.

  1. I first named it value_type but changed this after I moved to templated code for reasons you may understand.

Random

At the local casino the roulette tables have a big sign next to them showing the last 20 winning numbers, colour-coded black and red. It entices people to join, because with all that knowledge of past wins, your odds must be so much better! "8 black numbers in a row…, why not try your hand on red?"

In reality of course, the effect of those numbers on your chance to win will be negligible, but I am not xkcd and will not try to prove that here. A fact is though that if that sign gave people any significant advantage, it would not be there. The casino is not your friend.

Tricks like these play on people's notion of random numbers. True random is not what most people consider random at all. If I were to say I'd pick a random number between 1 and 1000000 inclusive, most people will give me a funny look when I pick 1 and not something like 838417.
My “flawed” method of picking the number is the point: people expect “random numbers” to conform to how a human mind would come up with an arbitrary number, which must be “interesting” and have multiple digits, preferably all unique.

1 -> not random
123321 -> kinda random
486137 -> random
3871264059 -> really random
1234567890 -> not random

In my first real job I worked at a company on to-order screen savers for Windows. Many of these were made up of a series of animated sequences, played continuously in random order. It wasn't long until the comments came in that the scene selection wasn't really random. A scene had played twice in a row, clearly something was wrong.

After some sniggering about this honest, but just plain wrong statement, we added the following function1:

int customerRandom(int min, int max) {
    static int lastVal = 0;
    int val = lastVal;

    while (val == lastVal)
        val = rangeRandom(min, max);
    return val;
}

It was a minimal solution to the problem — a scene could still occur twice in a three-scene time span — but customers all agreed that this was random enough.

A better solution would be a series of permutations of the possible items. This just so happens to be the algorithm newer Tetris games use as mandated by The Tetris Corporation a.k.a. Henk Rogers.

The original Tetris games used simple pseudo-RNG2s that did not do much, or at all, to make things easier for the player. In Ecstasy of Order they call a long period of time without a long bar a “drought”, which is often fatal for less experienced players3. The simple RNG is the cause, it's inevitable.

So Henk's team sat down one day and came up with a better way to dispense Tetris blocks, which they creatively titled the “Random Generator”. It conceptually works like this:

A. There are 7 distinct pieces in Tetris, number them 1 through 7 and put one of each in a metaphorical bag.

1 2 3 4 5 6 7

B. One by one, pick a random number out the of the bag and add it to a list that we'll call the permuted list.

Bag:
1 2 3 4 5 6 7 -> 1 2 3 4 6 7 -> 1 3 4 6 7 -> 1 4 6 7 -> 1 6 7 -> 1 7 -> 7 ->

Permuted list:
-> 5 -> 5 2 -> 5 2 3 -> 5 2 3 4 -> 5 2 3 4 6 -> 5 2 3 4 6 1 -> 5 2 3 4 6 1 7

In the diagram, the arrows -> indicate a number being picked, taken out of the bag and being appended to the permuted list4.

C. Whenever the game needs a new piece, pop the first number off the permuted list and return that.

5 2 3 4 6 1 7 -> 2 3 4 6 1 7 -> 3 4 6 1 7 -> 4 6 1 7 -> 6 1 7 -> 1 7 -> 7 ->

D. If the permuted list is empty, go back to A and the sequence starts anew with another permutation of the 7 pieces.

This algorithm is simple but it guarantees equal availability of all the pieces and a maximum drought of 12 pieces. This change in how Tetris worked was of course criticised as making the game easier for n00bs.

In early builds of my Tetris clone Quadrae I used a pure RNG for piece selection and I sometimes felt it was just unfair. “4 squares in a row? You've got to be kidding me!” What was wrong with this thing?

I quickly added the Henk-approved piece selection method and the game immediately felt more balanced and dare I say, reasonable. Though the pure random piece selection certainly elicited more extreme reactions from me, this was just plain better. Is it easier? Yes, but Tetris has enough stress-inducing qualities already, getting a few guarantees makes it less extreme, not any less of a game.

Numerical sequences in games need to be not overly obvious but also not too random. Obvious repetition or "duplicates" will distract the player and can make things boring. When the player's life directly depends on the next random number though, you should add grouping or result bias to your RNG so that players will not rage-quit and trash your app.

Leave numerical honesty to Hard Mode.

  1. The real version was in Pascal because we used Delphi, but bollocks to Pascal.
  2. RNG is Random Number Generator. It's pseudo because pure software RNGs are incapable of generating true random number sequences.
  3. The Tetris championships still use the NES version of Tetris, which has a simple RNG and therefore, droughts.
  4. When looking at the permuted list, did/do you feel a sense that I could have come up with a more random permutation? I rest my case.