Offsetof
Parsing structured data into structured data in C.
A common problem in programming is parsing different key-value pairs. These keys and values could easily be read into a map of some sort and each value lookup-ed when needed. But often these key-value pairs follow a schema. They have a known structure to them.
{
"instrument_id": 10,
"name": "ACME",
"long_name": "ACME Corporation"
}
C has a way to represent structured data. Its the common struct.
struct InstrumentInfo { // (1)
i64 instrument_id;
String name,
String long_name,
};
The problem however is how to write the structured JSON data into the structured C data. Without any meaningful type introspection in C it becomes quite hard to map the keys to the fields in the struct. You could have a bunch of if-statements to check but that gets old quite quickly especially when you consider that you probably have multiple different structures to parse.
Enter offsetof(3).
NAME
offsetof - offset of a structure member
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <stddef.h>
size_t offsetof(type, member);
DESCRIPTION
The macro offsetof() returns the offset of the field member from the start
of the structure type.
...
offsetof is great. offsetof(InstrumentInfo, name) can be used to find the the byte offset of the field name in the struct InstrumentInfo. But it does not get us all the way there. offsetof cannot be used on an arbitrary string at runtime. It only calculates the offsets for compile-time constants.
Let’s declare some structs.
struct Schema;
struct Key {
// The name of the key and field.
String key;
// Offset of the field in the struct given by offsetof.
i64 offset;
// A schema for the corresponding value.
Schema *schema;
};
enum Type {
INTEGER, OBJECT, STRING, // ...
};
struct Schema {
Type type;
// An array of keys.
Key *keys;
i64 num_keys;
// The size of the struct representing the object.
i64 size;
};
Each Schema has a list of keys and a size of the struct it represents. Each Key has the name of the key, its offset into the struct, and a separate schema describing its paired value.
Let’s define some schemas.
// Schemas for "primitive" types
Schema i64_schema = { .type = INTEGER };
Schema String_schema = { .type = STRING };
// Schema for InstrumentInfo
Key InstrumentInfo_keys[] = {
make_key3(InstrumentInfo, instrument_id, &i64_schema),
make_key3(InstrumentInfo, name, &String_schema),
make_key3(InstrumentInfo, long_name, &String_schema),
};
make_schema(InstrumentInfo);
In the section above some primitive schemas are defined and a schema InstrumentInfo_schema is defined from the InstrumentInfo struct. An array of keys is defined using the macro make_key3 which uses the stringizing operator # to convert the field name to a string constant and calls offsetof on the given field. The make_schema macro defines InstrumentInfo_schema with the defined keys IntrumentInfo_keys using the token-pasting operator ##.
#define make_key3(Struct, property, json_schema) \
{ \
.key = S(#property), \
.offset = offsetof(Struct, property), \
.schema = json_schema, \
}
#define make_schema(Struct) \
Schema Struct ## _schema = { \
.type = OBJECT, \
.keys = Struct ## _keys, \
.num_keys = countof(Struct ## _keys), \
.size = sizeof(Struct), \
}
But this is quite silly. Why are we defining a Schema for InstrumentInfo when the schema already exist? We wrote the schema when we declared the struct in (1)! We should use that.
A key insight is that InstrumentInfo_keys and InstrumentInfo_schema can and should be automatically generated at compile-time. If the schemas are written into a header file schemas.h then a simple parser (emphasis on simple) can be executed using the header file as input at build time. This parser would extract all of the structs and their fields in the header and output a separate file containing the definitions for _keys and _schema.
Parsing Values
Let’s consider parsing values.
// Read into (compile-time) known struct.
InstrumentInfo info = {};
parse_value(stream, &InstrumentInfo_schema, (char*) &info);
// Read into arbitrary struct given by schema.
char *buffer = (char*) malloc(schema->size);
parse_value(stream, schema, buffer);
parse_value takes an input stream of tokens, a schema, and a buffer as parameters. An underlying assumption is that the data is structured and it is known what is expected, therefore the parse function is given a schema as a parameter. The buffer is a pointer to a location that is at least schema->size large and the parsed values are written to it.
void parse_value(Stream *stream, Schema *schema, char *buffer) {
switch (schema->type) {
case INTEGER: return parse_integer(stream, schema, buffer);
case OBJECT: return parse_object(stream, schema, buffer);
case STRING: return parse_string(stream, schema, buffer);
default: error();
}
}
parse_value simply switches on schema type.
void parse_integer(Stream *stream, Schema *schema, char *buffer) {
// Expect integer value from the input stream.
i64 integer = expect_integer(stream);
i64 *field = (i64*) buffer;
*field = integer;
}
In the case of integers, strings, and other primitives the value is just read from the input stream and written directly into the given buffer. It checks that the read value actually is an integer and that the read data conforms to the expected schema.
void parse_object(Stream *stream, Schema *schema, char *buffer) {
// Loop over the input stream for as long as the object is open
while (!object_ended(stream)) {
// Read key from stream and lookup in schema->keys
Key *key = lookup(expect_key(stream), schema);
if (!key) error();
// This is where the parsed value should be stored.
char *field = &buffer[key->offset];
// Call parse_value with the key's schema and field.
parse_value(stream, key->schema, field);
}
}
When parsing objects it reads the key and parses the next value according to the key’s schema. The storage for the value is inside the given buffer. Then continue with the next key-value pair.
The example can easily be extended to cover nested objects.
struct Price {
String price;
i64 decimals;
};
struct PriceInfo {
Price last;
Price open;
Price close;
i64 tick_timestamp;
};
struct UnderlyingDetails {
InstrumentInfo info;
PriceInfo price_info;
};
// Define the _keys and _schema variables...
Demo
Check out demo with git clone https://mvidell.se/offsetof.git
. Build it with make.
The demo features
- two large JSON files parsed using different schemas.
- more types including arrays.
- both input and output types for on-the-fly conversions, e.g. a floating point number in JSON can be stored as a String in the struct.
- a flag for each key whether the value should be stored directly or as a pointer.
- a simple recursive descent parser for header files that generates the schemas at build time.
- an object equality function according to schema in a function similar to
parse_value. - pdjson, a streaming JSON parser (more of a lexer), by Chris Wellons.
- memory arenas inspired by Casey Muratori’s N+2 programmer and Chris Wellons’s blog.
Possible Extensions
- Use macro magic to define
_keysand_schemasinstead of the parser. - Expected keys, i.e. keys that must be in the read object or its an error.
- Default values for keys and a flag to check whether the key was found in the object.
- Arrays that support more than one type of values.
- Infer the schema automatically from JSON examples.
Conclusion
offsetof can clearly be used for the purpose of parsing structured data into structured data.
Are there any downsides? The structs in the demo can be quite large, some nearing 1 KiB in size. The sizes can be somewhat reduced by storing the data indirectly using pointers but on the other hand, should you? The data has to be stored somewhere and storing it as a map probably uses an even larger memory footprint, with worse locality, and with more expensive lookups compared to structs. Using and chasing too many pointers will convert the struct to a linked list with linked list performance (serial cache misses).
One worry is that there is some hidden footgun or undefined behaviour somewhere but it seems fine. UBSan complained initially about misaligned pointers but that was solved by tweaking the arena memory allocations.
On the other hand we should wonder why it is so hard to read structured data into structured data. We should expect programming languages for the 21st century to be able to solve this much easier using type introspection, reflective programming, and/or metaprogramming.