What is Diamond?
An explanation of Diamond's purpose, importance, and how it integrates with the Property Graph data model.
Diamond is an SDK for producing highly compressed, binary encodings of Labeled Property Graphs (LPGs). By combining schema inference, database-style normalization and custom compression, Diamond routinely shrinks large labeled property graphs to a fraction of their original size—outperforming state-of-the-art graph representations.
When Diamond was applied to PrimeKG (a popular biomedical knowledge graph that describes 17,080 diseases with 4,050,249 relationships), the resulting file size was reduced by up to 29.8x, consuming just 8.9% of the original text formats and 3.4% of a pure JSONL dump.
The name diamond comes from compressing graph(ite)s into a single binary file.
The Property Graph Data Model
A Labeled Property Graph is built from nodes and edges, each of which may carry any number of labels (to classify the entity or relationship) and an arbitrary set of key-value properties. This flexibility has made LPGs the de facto storage format for modern graph databases such as Neo4j and ArangoDB, even though each system applies its own variant of the model. For example, Neo4j permits multiple labels on a single node but only one label per edge, while ArangoDB enforces exactly one label on both nodes and edges.
To unify these differences, the Property Graph Exchange (PG) Format specification defines a single, text-based syntax for exporting and sharing LPGs across tools and platforms. Diamond consumes any export that conforms to the PG spec—whether it comes from a .pg file, a newline-delimited JSONL dump, a DOT diagram or any other PG-compatible source—and transforms it into a purpose-built binary archive.
The PG Format is superset of the Neo4j and ArangoDB Graph Model, so you can use Diamond with files created by these databases.
Why Diamond Matters
When your labeled property graph grows to millions of nodes and relationships, verbose text files become increasingly problematic: parsing performance degrades, storage requirements expand dramatically, and downstream analytics become inefficient. Diamond solves these challenges through a sophisticated, fully lossless, multi-stage pipeline:
-
Data Ingestion: Read the LP graph in its original text-based form (PG, JSONL, DOT, etc.), loading nodes, edges, labels, and properties with a chunking mechanism.
-
Schema Inference: Infer a schema, represented as a minimal
metagraph, containing all node types and the edges that connect them. -
Normalization: Replace every repeated fields (labels, properties, literal values) with compact integer IDs. This step mirrors relational database normalization, eliminating redundancy at the source.
Note that edges are segregated into distinct Parquet tables.
-
Binary Encoding: Compress each table—whether it represents nodes, edge lists or lookup dictionaries—into efficient binary records.
-
Bundling: Combine all
Parquettables into a single on-disktar-compressed file with a.diamondextension.
How is this guide?