Exploring Tagged Unions

In this post we are going to take a closer look at tagged unions. What they are, how they work at a low level and what some of the trade-offs are.

A brief look at C unions

Before we look at the “tag” thing I want to briefly talk about how a union works by looking at a very simple C example. The example below defines two types. One ExampleStruct and one ExampleUnion. I annotated the code with the memory layout of the fields to illustrate what a union does.

struct ExampleStruct {
    int a; // <-- base address + 0
    int b; // <-- base address + 4
};

union ExampleUnion {
    int a; // <-- base address + 0
    int b; // <-- base address + 0
};

As you can see the struct will layout its fields in a linear fashion therefore resulting in a total size of 8 bytes. When you use the ExampleStruct you can access fields a and b independently (different locations in memory). The union on the other hand will “put” each field at the same base address, which allows you to refer to the same memory address using different names. Writing to a in this case will also modify b. This also means that the total size of ExampleUnion is only 4 bytes. The size of a union is equal to the size of the largest field inside the union.

Adding some tags

Now that we have a way to refer to the same memory location using different names we need some way to know at runtime what name we should actually use. In the simple example above it does not really matter as both fields are 4 byte integers, but in practice those fields could have different sizes and a completely different meaning. That is where the “tag” comes into play. A tag is just a number identifying the type of data we are talking about. We can check the tag at runtime to decide what data inside the union we want to read/write.

Example implementation in C

OK, enough talk, let’s look at a concrete example. The following illustrates how you might implement processing of user input events using a tagged union in C. The math operations inside the process_event function don’t have any meaning, I just wanted to illustrate the point that some kind of processing is going on and have an easy way do identify the different blocks in the compiled assembly code (more on that later).

enum EventTag {
    EventTag_KeyPress,
    EventTag_MouseClick,
    EventTag_MouseMove,
};

struct Event {
    enum EventTag tag;
    union {
        int key_code; // KeyPress
        struct { int x_pos; int y_pos; }; // MouseClick
        struct { int x_delta; int y_delta; }; // MouseMove
    };
};

int process_event(struct Event *evt) {
    switch (evt->tag) {
    case EventTag_KeyPress:
        return evt->key_code * 1234;
    case EventTag_MouseClick:
        return evt->x_pos + evt->y_pos * 457;
    case EventTag_MouseMove:
        return evt->x_delta + evt->y_delta * 987;
    }
    return 0;
}

The union now got a bit more complicated than our original example. We are using the EventTag enum to identify the different even types and the union contains a data entry for each event type. Remember how unions layout the memory? In this example key_code, x_pos and x_delta are all located at the same address.

The process_event function implements the core “tagged union” logic. It first checks the event tag and then processes the data corresponding to the tag. That’s it. We have now implemented a tagged union in C. Pretty simple, right?

And here an example on how you might use the process_event function.

#include <stdio.h>

int main() {
    struct Event events[] = {
        {.tag = EventTag_KeyPress, {.key_code = 100}},
        {.tag = EventTag_MouseClick, {.x_pos = 10, .y_pos = 20}},
        {.tag = EventTag_MouseMove, {.x_delta = 3, .y_delta = 1}},
    };

    // C arrays are just pointers without a length :(,
    // so we calculate the array count here
    int count = sizeof(events) / sizeof(events[0]);

    for (int i = 0; i < count; i++) {
        int result = process_event(&events[i]);
        printf("Event Result: %d\n", result);
    }

    return 0;
}

Even if you are not very familiar with C I hope its pretty clear what is happening here. We create a static array of events, loop over them and call our process_event function for each of the events.

Memory

I already mentioned that the size of the union is determined by the largest field inside the union. Let’s see what that means in practice for our example. Here is the definition of our tagged union annotated with some memory size infos.

struct Event { // total size: 12 bytes
    enum EventTag tag; // 4 bytes
    union {
        int key_code; // 4 bytes
        struct { int x_pos; int y_pos; }; // 8 bytes
        struct { int x_delta; int y_delta; }; // 8 bytes
    };
};

As we can see, the largest elements in the union are the 2 8-byte structs (2x 4-byte integer). Add that to the 4 byte tag and we get a total size of 12 bytes. You might be saying now, that this is wasting 4 bytes in case of the key press event where we just need the 4 byte key_code. And you would be correct. This is one of the trade-offs of using a union in C. Depending on how big the different data fields are, you might have some wasted/unused memory. This isn’t usually a problem in practice but if it becomes a problem there are different solutions with different trade-offs. You could choose not to use a union and just tightly pack pairs of an EventTag and the corresponding data into a linear block of memory (looses the ability to easily index into the array). Or you could store larger event types into a separate data structure (now iterating over all events becomes more complicated). The thing about trade-offs is, there are always two sides. Using a union with potentially some unused memory gives us something in return.

Performance

I’m not going to go too deep into the performance rabbit hole in this article. If you are interested in this topic check out “Clean” Code, -Horrible Performance by Casey Muratori and see if you can spot the tagged unions ;). Here just some high level points. As shown in the example code we can put a bunch of events into a simple C style array which effectively is just a linear block of memory. This is great for fast iteration due to good use of the CPU Cache. Indexing into an array of events is also super easy and fast. Because events are all the same size we can just use a regular array index like we do in the example above (events[i]).

Maintainability

OK, here we will see some of the weaknesses of C. But first let’s start with the positive. I would argue that once you are familiar with C syntax the tagged union approach to processing a bunch of related data is very easy to read and modify. Adding a new event just means adding a new entry to the EventTag enum, adding the corresponding data to the union and adding a new case to the switch inside the process_event function. We even get some help from the compiler (at least using clang on macOS). Adding a new EventTag without handling the new event in the switch will result in a compiler warning.

Enumeration value 'EventTag_YourNewEventName' not handled in switch

That’s nice, thanks! From a architecture and maintenance perspective this all sounds really good and if our example wasn’t using C, I could stop here. But we are talking about C so of course the compiler will let me do stuff which does not make any sense for our problem. I can for example access evt->y_pos inside the EventTag_KeyPress case which is kind of bad. Best case, the value is 0, worst case, we read a garbage value because the memory was not initialised. We can also create events that don’t even exist by setting the events tag field to a random integer like 500.

Alternative in Rust

I hope you now have a good understanding of what tagged unions are and how they can be implemented in a somewhat low level language like C. These days there are of course many alternatives to C if you still care about low level access to your hardware and performance but also appreciate a few more high level language features.

Let’s look at the same example in Rust, because…, well of course we will ;). Nice syntax for tagged unions is however by no means a rust exclusive. In fact they have been around in functional programming languages forever.

pub enum Event {
    KeyPress(i32),
    MouseClick(i32, i32),
    MouseMove(i32, i32),
}

pub fn process_event(evt: &Event) -> i32 {
    match *evt {
        Event::KeyPress(key_code) => key_code * 1234,
        Event::MouseClick(x_pos, y_pos) => x_pos + y_pos * 457,
        Event::MouseMove(x_delta, y_delta) => x_delta + y_delta * 987,
    }
}

The syntax looks a bit different but I hope it is clear that this code is doing the same as the C example earlier. It’s worth mentioning though that the concept of a tagged union is built-in to the rust enum type. You can just use regular enums only containing integers like in C, but you can also directly associate some data. Which is what we are doing here.

And here is the example on how you might use the function.

fn main() {
    let events = vec![
        Event::KeyPress(100),
        Event::MouseClick(10, 20),
        Event::MouseMove(3, 1),
    ];

    for evt in events.iter() {
        let result = process_event(evt);
        println!("Event Result: {}", result);
    }
}

That is a bit shorter than the C version, but more importantly the compiler is a lot more helpful if we do a mistake. Adding a new event will result in a compiler error until we handle it in the match statement. We can’t access y_pos from within the KeyPress case because we have to explicitly match an event and its fields. We also can’t create an event that does not exist. The best part is that we don’t even have to sacrifice performance for those higher level language constructs. The assembly is pretty much the same (at least for our example). Check out the Compiler Explorer Example for more details.

Closing thoughts

I initially just wanted to write a small post about using tagged unions in TypeScript. I got a bit sidetracked along the way, as might be obvious by the lack of TypeScript in the article. If you are interested in using the same concept in TypeScript, stay tuned for the next post.

Generally speaking I’m a big fan of tagged unions, especially if something like the Rust compiler can provide super helpful error messages. I especially enjoy the fact that I can look at the definition of the tagged union and immediately know all the events that can possibly flow through the system. Compare that to an object oriented approach where each event might be a separate class probably in a separate file.

I’m guessing there is a lot more to say on this topic but that’s it from me for now. Hope you enjoyed the article and maybe learned something.