Revised mjlib serialization design (diagnostics part 2)
As discussed previously, I recently significantly revised the serialization format used by the mjbots quad A1 based on experience in previous professional domains, and from studying newer external projects like Apache AVRO. Here I’ll describe the design of the serialized representation, which is more completely defined at: mjlib/telemetry/README.md
Refresher and definitions
As a brief refresher, this serialization format is intended to be used primarily to record telemetry from embedded systems, where that telemetry data may be persisted on disk for a long time. Secondarily, it can be used to inspect the results of a live system. The primitive it operates on is a “record”, which is logically a structure of elements which is emitted at some intervals over time. For any given record, it logically breaks it up into a “schema” and a “data” portion. The schema describes what types of elements are present in the structure, their names and relationships. The “data” portion contains the minimum amount of information necessary to communicate one instance of the structure, assuming that the receiver already has a copy of the schema.
Schemas
A schema consists of one “type”. There exist a number of “primitive” types which directly, or close to directly, map to machine storage. For instance an abbreviated subset:
booleancan be true or falsefloat64is a 64 bit floating point valuefixeduintis an unsigned integer of size 1, 2, 4, or 8varuintis an unsigned integer of dynamic encoding lengthstringis a sequence of UTF-8 charactersbytesis a sequence of arbitrary bytes
After that, there are “complex” types, which consist of:
objectis a list of fields, each with its own typeenumis an unsigned integer, along with a mapping from those integers to stringsarrayis a variable length array of some other typefixedarrayis a fixed length array of some other typemapis a mapping from strings to another typeunionis an index discriminated union between multiple types
Data
The data associated with each type is a direct mapping for the primitive types. For the “complex” types, the associated data is as follows:
objectthe data consists of the data from each field in orderenumthe data consists of a single unsigned integerarraythe data consists of a size, followed by that many instances of the types datafixedarrayconsists of the types data repeated the number of times from the schemamapjust consists of the keys and values from the mapunioncontains a single unsigned integer index, followed by the selected type’s data
Encoding
For both the schema and the data there are two encodings defined, a JSON* one, and a binary one. The JSON data encoding is what would be traditionally exchanged in Javascript applications. It is not completely minimal, since field names and object and list delimiters are present. For example, a simple object type consisting of a boolean, a string, and a list of fixedint might have a data representation in JSON like:
{
"field1" : true,
"field2" : "my string data",
"field3" : [4, 5, 6],
}
The JSON schema encoding contains the entirety of the information from the schema. For the above record it might look like:
{
"type" : "object",
"name" : "MyObject",
"aliases" : ["AnOldName"],
"fields" : [
{ "name" : "field1", "type" : "boolean" },
{ "name" : "field2", "type" : "string" },
{ "name" : "field3", "type" : "array", "items" : "fixedint32" }
],
}
A binary encoding for both the schema and the data is defined as well. The schema is straightforward, if uninteresting and can be found in the README. The data encoding for the primitive types for those which have direct machine analogs are the little endian machine representation. The object data binary representation is merely the concatenation of all the field’s data fields. This makes it possible to construct record definitions that exactly match a useful set of in memory structures to make serialization for those structures be a noop.
Next steps
In the next issue of this series, I’ll describe the C++ API for serializing and deserializing objects.
*Actually JSON5, which supports comments and final trailing commas among other improvements for human readability.