A high-performance data serialization format optimized for AI/ML applications
Every API call to GPT-5.2 costs $1.75 per million input tokens. Every context window has limits. Every byte transmitted adds latency.
HEDL (Hierarchical Entity Data Language) solves the fundamental tradeoff between token efficiency and data comprehension. While CSV is compact but loses structure, and JSON is expressive but verbose, HEDL delivers both: 62.2% LLM comprehension (nearly matching JSON's 65%) while using half the tokens.
The result? When LLMs process HEDL, they get 93% more correct answers per token than JSON. For high-volume AI applications, this isn't just an optimization—it's the difference between viable and prohibitively expensive.
HEDL combines CSV-style tabular efficiency with hierarchical structure, schema validation, and first-class support for references and relationships. It's what you'd design if you started from "how do LLMs actually parse data?" instead of "how did we do this in 1999?"
The Efficiency Story: Test across GPT-5.1, Mistral Large 3, and DeepSeek v3.2. Ask 65 questions about structured data in different formats. HEDL delivers 23.89 correct answers per 1,000 tokens. JSON? 12.36. That's 93% more value per token. At scale, this compounds into dramatic cost savings.
The Accuracy Story: HEDL achieves 62.2% comprehension accuracy—only 2.7 percentage points behind JSON's 65%. But here's the key: HEDL does this with half the tokens (2,605 vs 5,253). It's not about choosing between accuracy and efficiency anymore.
The Developer Story: Schema validation catches errors before they reach production. LSP integration means autocomplete, validation, and hover docs in your editor. Type-safe references prevent broken relationships. It's the tooling you'd expect from a modern format, not a verbose interchange format from the '90s.
The Ecosystem Story: Parse, validate, lint, canonicalize, and convert to JSON, YAML, XML, CSV, Parquet, Neo4j Cypher, and TOON. Streaming for multi-GB files. FFI bindings for C/C++/Python. WASM for browsers. MCP server for AI agents. Built for real systems, not toy examples.
Install: Add HEDL to your project in 30 seconds:
[dependencies]
hedl = "1.2.0"
# Or with all format converters
hedl = { version = "1.2.0", features = ["all-formats"] }Use: Four core operations get you 90% of the way:
use hedl::{parse, to_json, canonicalize, validate};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let hedl_text = r#"
%VERSION: 1.0
%STRUCT: Product: [id, name, price, category]
---
products: @Product
| laptop, ThinkPad X1, 1299.99, electronics
| mouse, Wireless Mouse, 29.99, accessories
| keyboard, Mechanical Keyboard, 149.99, accessories
store_name: Tech Depot
location: Amsterdam
"#;
// Parse to AST (~1.5µs per record)
let doc = parse(hedl_text)?;
// Validate schema (0.5% overhead)
validate(hedl_text)?;
// Convert to JSON for existing APIs
let json = to_json(&doc)?;
// Canonicalize for version control
let canonical = canonicalize(&doc)?;
Ok(())
}That's it. Parse, validate, convert, format. Everything else builds on these primitives.
Here's the same data in HEDL and JSON. Notice how HEDL uses structured types and table syntax to eliminate JSON's repetitive key names:
%VERSION: 1.0
%STRUCT: User: [id, name, email, age]
%STRUCT: Post: [id, author, title, tags]
---
# Users with type-scoped IDs
users: @User
| alice, Alice Smith, alice@example.com, 28
| bob, Bob Johnson, bob@example.com, 35
# Posts with references to users
posts: @Post
| p1, @User:alice, First Post, [tech, rust]
| p2, @User:bob, Another Post, [programming, web]
# Ditto operator for repeated values
events: @Event
| e1, 2024-01-15, conference, Berlin
| e2, ^, workshop, ^
| e3, 2024-01-16, meetup, ^
{
"users": [
{"id": "alice", "name": "Alice Smith", "email": "alice@example.com", "age": 28},
{"id": "bob", "name": "Bob Johnson", "email": "bob@example.com", "age": 35}
],
"posts": [
{"id": "p1", "author": {"@ref": "User:alice"}, "title": "First Post", "tags": ["tech", "rust"]},
{"id": "p2", "author": {"@ref": "User:bob"}, "title": "Another Post", "tags": ["programming", "web"]}
],
"events": [
{"id": "e1", "date": "2024-01-15", "type": "conference", "location": "Berlin"},
{"id": "e2", "date": "2024-01-15", "type": "workshop", "location": "Berlin"},
{"id": "e3", "date": "2024-01-16", "type": "meetup", "location": "Berlin"}
]
}Token Savings: 373 tokens (HEDL) vs 557 tokens (JSON) = 33% reduction for this example
Notice what's eliminated: In JSON, you repeat "id":, "name":, "email": for every user. HEDL declares the structure once with %STRUCT, then each row is just values. The ditto operator (^) removes repetition in the events table. References use clean @Type:id syntax instead of verbose {"@ref": "..."} objects.
Across real datasets, HEDL saves 46.7% tokens on average vs JSON (see Performance section)
HEDL isn't a walled garden. Your data probably exists in multiple formats already. Your systems speak different protocols. You need a format that converts seamlessly without losing information.
| Format | Read | Write | Streaming | When You Need It |
|---|---|---|---|---|
| JSON | ✅ | ✅ | ✅ | REST APIs, web services, JavaScript frontends |
| YAML | ✅ | ✅ | ❌ | Kubernetes configs, CI/CD pipelines, human-edited files |
| XML | ✅ | ✅ | ✅ | SOAP services, enterprise systems, regulatory formats |
| CSV | ✅ | ✅ | ❌ | Excel exports, data analysis, simple tabular data |
| Parquet | ✅ | ✅ | ❌ | Data lakes, analytics pipelines, columnar storage |
| Neo4j Cypher | ✅ | ✅ | ✅ | Loading graph databases, relationship-heavy data |
| TOON | ✅ | ✅ | ❌ | Maximum token efficiency for LLM contexts |
Convert in both directions. Preserve semantics. Stream when you can, batch when you need to. The format adapts to your workflow.
Install once, use everywhere:
cargo install hedl-cliValidation in CI/CD: Catch schema errors before they reach production. hedl validate returns non-zero exit codes for invalid documents—perfect for pre-commit hooks and CI pipelines.
hedl validate config/*.hedl && echo "✓ All configs valid"Format Conversion at Scale: Converting 1,000 files from HEDL to JSON? Batch operations maintain 98.6% efficiency:
hedl to-json data/*.hedl -o json/Deterministic Formatting: Code review diffs are noisy when everyone's editor formats differently. Canonicalization solves this:
hedl format document.hedl -o canonical.hedl
# Same input always produces identical outputLinting Best Practices: Catch unused schemas, inconsistent naming, and anti-patterns:
hedl lint document.hedl
# Warning: unused struct definition 'OldUser'
# Warning: unqualified reference 'alice' (use @User:alice)Quick Stats: How big is this document? How many entities? What's the nesting depth?
hedl stats large-file.hedl
# 1,247 entities, 3 levels deep, 15 struct typesHEDL is written in Rust, but you're not locked into the Rust ecosystem. Use it from any language, any platform, any environment.
Your Python service needs to parse HEDL. Your C++ backend needs to export HEDL. FFI overhead is 3.65%, negligible in real workloads.
#include "hedl.h"
#include <stdio.h>
int main() {
const char* hedl_text = "%VERSION: 1.0\n---\nkey: value\n";
HedlDocument* doc = NULL;
char* json = NULL;
// Parse HEDL (returns error code)
if (hedl_parse(hedl_text, -1, 1, &doc) != HEDL_OK) {
fprintf(stderr, "Parse error: %s\n", hedl_get_last_error());
return 1;
}
// Convert to JSON
if (hedl_to_json(doc, 0, &json) != HEDL_OK) {
fprintf(stderr, "Conversion error: %s\n", hedl_get_last_error());
hedl_free_document(doc);
return 1;
}
// Process JSON in your existing C/C++/Python code
printf("%s\n", json);
// Clean up memory
hedl_free_string(json);
hedl_free_document(doc);
return 0;
}No memory leaks. Thread-safe. Production-tested.
Your web app needs client-side HEDL parsing. Your Node.js service needs format conversion without spawning processes.
import init, { parse, toJson } from 'hedl-wasm';
await init();
const doc = parse(hedlText);
const json = toJson(doc);
// Use in React, Vue, Angular, or vanilla JSWASM module loads in milliseconds. Zero-copy where possible. Same parser as the native Rust implementation.
You're editing a 5,000-line HEDL config file. You mistype a struct name. You want autocomplete for entity IDs. You need to jump to a definition.
HEDL's LSP server gives you:
- Syntax highlighting: Distinguish structs, references, and values at a glance
- Auto-completion: Type
@Usand get@User:suggestions - Real-time validation: Red squiggles on invalid references before you save
- Go-to-definition: Click
@User:aliceto jump to alice's definition - Hover documentation: See struct schemas without scrolling
- Quick fixes: "Unqualified reference—add @User prefix?"
Configure your editor (VSCode, Neovim, Emacs, Helix) to use hedl-lsp for .hedl files. Latency under 10ms for typical operations.
You're building an AI agent that needs to read, transform, and validate structured data. MCP makes HEDL a first-class citizen in LLM tool use.
hedl-mcp --port 8080Your agent can now:
- Parse and validate HEDL documents
- Convert between formats (HEDL ↔ JSON/YAML/CSV)
- Infer schemas from untyped data
- Query and transform structured data
MCP server handles 50+ concurrent requests. 2ms average latency. Cache hit rate 85%.
Performance isn't just about raw speed—it's about scalability, predictability, and real-world throughput. HEDL is designed for production workloads where both latency and efficiency matter:
Core Operations:
| Operation | Throughput | Latency (Small Doc) |
|---|---|---|
| Parsing | 54.6 MB/s | 142 µs |
| Canonicalization | N/A | 30 µs |
| Linting | 72-931 MB/s | 3.67 µs |
Format Conversion (HEDL → Other):
| Target Format | Throughput | Latency |
|---|---|---|
| JSON | 1,549 MB/s | 291.73 µs |
| YAML | 246 MB/s | 1,834 µs |
| XML | 2,964 MB/s | 153 µs |
| CSV | Fast | Low overhead |
Format Conversion (Other → HEDL):
| Source Format | Throughput | Latency |
|---|---|---|
| JSON | 2,883 MB/s | 442 µs |
| YAML | 377 MB/s | 3,011 µs |
| XML | 953 MB/s | 1,130 µs |
| CSV | Fast | Low overhead |
We tested 6 formats across 3 leading LLMs (GPT-5.1, Mistral Large 3, DeepSeek v3.2) with 65 questions about structured data. The results reveal a clear pattern: accuracy costs tokens, but HEDL breaks that tradeoff.
The Results:
| Format | Avg Accuracy | Tokens/Question | Accuracy per 1k tokens |
|---|---|---|---|
| JSON | 64.95% 🥇 | 5,253 | 12.36 |
| HEDL | 62.23% 🥈 | 2,605 | 23.89 🥇 |
| YAML | 61.53% | 5,367 | 11.46 |
| TOON | 58.79% | 2,904 | 20.24 |
| XML | 50.42% | 4,599 | 10.96 |
| CSV | 23.56% | 1,188 | 19.83 |
What This Means:
JSON wins on raw accuracy (64.95%), but pays a steep token tax. HEDL achieves 62.2% accuracy—only 2.7 points behind—while using less than half the tokens. The efficiency metric tells the real story: 23.89 vs 12.36 correct answers per 1k tokens. That's 93% better efficiency.
TOON, the previous token-efficiency champion, uses slightly more tokens than HEDL (2,904 vs 2,605) but scores 3.4 points lower on accuracy. YAML and XML are verbose without improving comprehension. CSV is compact but structurally impoverished—only 23.6% accuracy.
HEDL vs TOON detailed comparison:
| LLM Model | HEDL Accuracy | TOON Accuracy | HEDL Advantage |
|---|---|---|---|
| GPT-5.1 | 71.8% ± 3.2% | 68.2% ± 0.7% | +3.6 points 🥇 |
| Mistral Large 3 | 51.8% ± 0.7% | 45.1% ± 1.5% | +6.7 points 🥇 |
| DeepSeek v3.2 | 63.1% ± 0.0% | 63.1% ± 0.0% | Tie |
| Average | 62.2% | 58.8% | +3.4 points 🥇 |
Why HEDL wins: HEDL's structured format (typed columns, references, hierarchy) enables LLMs to parse and comprehend data more reliably than minimalist formats, while maintaining exceptional token efficiency.
Real data from 12 production-style datasets, tokenized with tiktoken (OpenAI's tokenizer). The savings compound across different data structures:
| Dataset Type | HEDL Tokens | JSON Tokens | Token Savings |
|---|---|---|---|
| users_flat | 3,409 | 6,478 | 47.4% |
| products_flat | 3,106 | 6,219 | 50.1% |
| blog_nested | 5,201 | 9,738 | 46.6% |
| orders_nested | 835 | 1,600 | 47.8% |
| config_simple | 237 | 476 | 50.2% |
| Average | - | - | 46.7% |
Size Efficiency - Storage and bandwidth:
| Dataset Type | HEDL Bytes | JSON Bytes | Size Savings | Ratio |
|---|---|---|---|---|
| users_flat | ~81KB | ~180KB | 55.0% | 2.2x smaller |
| products_flat | ~89KB | ~200KB | 55.5% | 2.2x smaller |
| blog_nested | ~71KB | ~155KB | 54.5% | 2.2x smaller |
| Average | - | - | 57.7% | 2.4x smaller |
Conversion Performance (bidirectional):
| Direction | Throughput | Latency |
|---|---|---|
| HEDL → JSON | 1,549 MB/s | 291.73 µs avg |
| JSON → HEDL | 2,883 MB/s | 442.43 µs avg |
The Cost Calculation: At GPT-5.2 pricing ($1.75/1M input tokens), a 1M token context in JSON costs $1.75. That same data in HEDL? $0.93. Scale that across millions of API calls, and token efficiency isn't academic—it's bottom-line impact. For a service processing 1B tokens monthly, HEDL saves approximately $820/month compared to JSON.
- Linear Scaling: O(1) per document, O(depth) for nesting - no exponential blowup
- Zero-Copy Optimizations: 5,550 allocations saved (33% reduction) for simple strings
- Parallel Processing: 6.19x speedup @ 8 threads, 98.6% batch efficiency
- Streaming Support: 1.2-2.1x faster than full parse for incremental processing
- Peak Throughput: 78.8 MB/s, LSP latency <10ms, MCP latency 2ms
HEDL is built as 19 specialized crates, not a monolith. Need JSON conversion but not XML? Only pay for what you use. Building a web service? Skip the CLI. Embedding in Python? Just the FFI layer.
Core Components - Start here:
hedl-core: The parser. Zero dependencies beyond Rust std. Parse to AST in ~1.5µs per record.hedl: High-level API. Parse, validate, convert. Most users import only this.hedl-stream: Streaming parser for multi-GB files. 1.2-2.1x faster than full-document parsing.
Format Converters - Pick your targets:
hedl-json,hedl-yaml,hedl-xml: Bidirectional conversion for web standardshedl-csv: Export tables to Excel-compatible formatshedl-parquet: Columnar storage for analytics pipelineshedl-neo4j: Generate Cypher for graph database loading (0.7ms per 1K nodes)hedl-toon: Token-optimized output for LLM contexts
Developer Tools - Catch errors early:
hedl-c14n: Deterministic formatting. Same input → identical output. Git-friendly diffs.hedl-lint: Best-practice enforcement. 1% false positive rate. Sub-microsecond per rule.hedl-cli: Command-line Swiss Army knife. Validate, convert, lint, format, analyze.hedl-test: Shared test fixtures and property testing utilities.
Cross-Language Integration - Use everywhere:
hedl-ffi: C ABI for Python/C++/Go/Node.js. 3.65% overhead. Zero memory leaks.hedl-wasm: Browser and Node.js. Same parser, compiles to WebAssembly.hedl-lsp: Editor integration. <10ms latency. Works with VSCode, Vim, Emacs.hedl-mcp: AI agent protocol. 50+ concurrent requests. 85% cache hit rate.
Quality Assurance - Trust but verify:
hedl-bench: Criterion-based benchmarks. Regression detection. All numbers in this README are from these benchmarks.
Start here: Language Specification - Complete HEDL syntax with examples and rationale for design decisions.
Going deeper:
- API Documentation - Rust API reference with examples
- CLI Reference - All command-line flags and batch operations
- FFI Guide - Integrate with C/C++/Python/Go
- WASM Guide - Use HEDL in browsers and Node.js
Questions? Open an issue or join the discussion.
You're building a RAG system that processes millions of API calls monthly. Every retrieval includes structured metadata—user context, document attributes, relationship graphs. With JSON, you're burning tokens on repetitive key names. With HEDL, you cut token usage by 46.7% while maintaining near-JSON comprehension (62.2% vs 65%). The efficiency gain (93% more answers per token) means your service scales further before hitting cost constraints.
Your ETL pipeline processes streaming data from multiple sources, converts formats, validates schemas, and exports to Neo4j and Parquet. HEDL's streaming parser is 1.2-2.1x faster than full document parsing. Batch processing maintains 98.6% efficiency at 100x scale. Parallel processing delivers 6.19x speedup on 8 cores. Convert to Neo4j Cypher at 0.7ms for 1,000 nodes. The format adapts to your infrastructure, not the other way around.
You're building a graph database from heterogeneous sources. HEDL's typed references (@Type:id syntax) and first-class relationship support make entity resolution straightforward. Reference resolution processes 40 refs/ms with linear scaling—no exponential blowup as your graph grows. Export directly to Neo4j Cypher without intermediate transformations.
Your service config spans multiple environments with complex nested structures. HEDL's schema validation catches typos and type errors before deployment. LSP integration provides autocomplete and hover docs in your editor. The ditto operator (^) eliminates repetition in tabular config. Deterministic canonicalization makes diffs readable. It's infrastructure-as-code that doesn't fight your workflow.
Maximum accuracy is non-negotiable: JSON scores 2.7 percentage points higher on LLM comprehension. If those 2.7 points matter more than 50% token savings, stick with JSON.
Ecosystem lock-in matters: JSON has decades of tooling, libraries, and developer familiarity. Every language has multiple battle-tested JSON parsers. HEDL is new. If you need maximum ecosystem compatibility today, JSON is the safe choice.
HEDL is open source (Apache 2.0) and welcomes contributions. Found a bug? Have an idea for a new format converter? Want to optimize the parser?
Quick start:
git clone https://github.com/dweve-ai/hedl.git
cd hedl
cargo build --all-features
cargo test --all-featuresQuality bar: We care about correctness, performance, and maintainability. Every PR goes through:
- Unit tests (all public APIs tested)
- Integration tests (cross-crate workflows)
- Property tests (fuzz testing with
proptest) - Benchmark regression checks (no performance regressions)
- Security review (input validation, resource limits)
Areas we need help:
- Format converters for your favorite format
- Performance optimization (SIMD, zero-copy, cache efficiency)
- Language bindings (Dart, Ruby, Zig, etc.)
- LSP features (refactoring, code actions)
- Documentation and examples
See an issue labeled "good first issue" or "help wanted"? That's a great place to start.
Maintained by Dweve B.V. and the open-source community.
Licensed under Apache License 2.0. Copyright © 2025 Dweve IP B.V. and contributors.
Find us: Homepage · GitHub · Crates.io · API Docs · Issues
Built on: Rust stdlib · serde · quick-xml · parquet · criterion · tower-lsp
Built with Rust 🦀 · Optimized for AI 🤖