Making room for intelligence at the edge

14 Apr, 2026 min read

A Bytecode VM for liblognorm

And why it unlocks a new generation of threat detection at the edge

Jérémie Jourdin – CTO, Advens – March 2026

Normalize first, normalize at the edge

Before diving into benchmarks and bytecode, let me address a question that some readers may have in mind: why bother normalizing logs at ingest time?

The industry largely agrees that normalization matters. The days of dumping raw syslog into flat files and grepping through them are behind us. Splunk pioneered “schema on read” – store the raw event, extract fields at search time – but even Splunk has been drifting toward ingest-time extraction for years. Elastic built an entire ecosystem around it: Logstash, ingest pipelines, Filebeat modules, and ECS (Elastic Common Schema) are all designed to produce structured, typed, schema-compliant documents before they hit the index. The data lake crowd (raw-to-S3, query with Athena/Trino) still defers to query time, but for security operations, the consensus is clear: normalize early, normalize once.

The real question is where.

Elastic normalizes on centralized infrastructure – Logstash nodes and Elasticsearch ingest pipelines, JVM-based, horizontally scaled. The logs leave the source site as raw text and get structured somewhere in the middle. Elastic Agent can apply lightweight processors at the source, but ECS mapping, grok parsing, and enrichment run on Elasticsearch ingest nodes. This works, but it means every byte of raw log crosses the network before anyone knows what’s in it. Enrichment – GeoIP, CTI reputation, CMDB correlation – happens centrally too, if it happens at all.

Rainer Gerhards, creator of rsyslog, took a different bet over fifteen years ago. His thesis – Efficient Normalization of IT Log Messages under Realtime Conditions (2016) – formalized what he had been building since rsyslog’s early days: a normalization engine fast enough to structure every log line at the edge, on the collector itself, in C, in real time, before the log ever crosses the network. The result was liblognorm v2 and its PDAG (Parse Directed Acyclic Graph) architecture, designed to achieve near-O(1) parsing regardless of rulebase size.

This is not a historical curiosity. It is an architectural choice that compounds. When you normalize at the edge, you know that 10.0.0.1 is an IP address – not a string that happens to look like one – before the log leaves the site. Which means you can perform GeoIP lookups, CTI reputation checks, and asset correlation right there, on the cheapest possible hardware, at ingest time. The enriched, structured, schema-compliant document is what crosses the network – not the raw text that someone will have to re-parse later.

rsyslog and liblognorm got this right. The question was never whether to normalize, or even when, but whether the normalization engine could keep up with the volume and leave enough headroom for what comes after.

Because normalization alone is not the end goal. Once you have structured fields at the edge, the natural next step is enrichment: GeoIP resolution on every IP field, CTI reputation lookups, CMDB asset correlation, ECS field mapping… Each of these is an additional action in the rsyslog pipeline – and a production configuration for a complex log source like a Stormshield SNS firewall can easily reach 30 actions per message, spread across hundreds of lines of scripting. At that point, the edge collector is no longer just parsing – it is running a full detection-grade enrichment pipeline on constrained hardware.

And that is where we hit the ceiling. Not because rsyslog was the wrong tool – it was the only tool that could even attempt this at the edge – but because the sheer volume of per-message work exposed performance characteristics in the normalization layer that had never been a problem when parsing was the only job.

So I did what any engineer should do first: I profiled it.

Profiling the hot path

I profiled liblognorm processing one million Stormshield SNS firewall logs – a representative, complex production workload with around 50 fields per message. What I was looking for was straightforward: where does the CPU time actually go?

CategoryCPU %What it does
JSON field construction42.6%Hash inserts, strcmp, strdup for every field
Memory allocation18.9%malloc/free per field, per message
JSON serialization18.2%Incremental buffer growth with realloc
Actual parsing work11.8%Scanning fields, matching delimiters
Everything else8.5%Recursive dispatch, string creation

Only 11.8% of CPU time goes to actual parsing. The rest – over 80% – is spent building, populating, and serializing JSON objects.

Now, this is not a flaw in liblognorm. When Rainer designed the PDAG walker, JSON was the natural output format. The library’s job is to produce structured data, and json-c objects are structured data. The walker threads a json_object pointer through its recursive traversal, and each parser emits fields directly into that object. This coupling made perfect sense: the parsing and the output representation were the same concern.

But when you want to build an enrichment pipeline on top of the parser – when the normalized fields need to feed into GeoIP lookups, CTI matching, CMDB correlation, and ECS mapping before they reach the indexer – that coupling becomes a bottleneck. Every downstream module that needs to read a parsed field must navigate the json-c hash table: acquire a mutex, compute a hash, walk a linked list, copy a string. And every module that adds enrichment data must insert into the same tree. With 30+ enrichment actions per message, the json-c tree becomes the central contention point of the entire pipeline.

The insight is not “JSON is slow.” The insight is that parsing and output representation are two different concerns, and they should be decoupled. The parser’s job is to identify field boundaries and types in the input stream. The output’s job is to present those fields in whatever format the downstream consumer needs. When these two are fused – when the parser must build a json-c tree as it works – you cannot optimize either one independently.

That is the problem I set out to solve.

Decoupling parsing from representation

The solution has two parts, designed together:

A new execution model for parsing – the bytecode VM – that replaces the recursive PDAG walker with a flat instruction loop. No recursion, no function pointer dispatch, no json_object threaded through the call chain.

A new output format – the flat result structure – that stores parse results as an arena-allocated array of (name, value) pointer pairs. No malloc per field, no hash table, no key duplication. JSON is only produced if and when something explicitly asks for it.

These two are co-designed. The flat result is not just “a faster JSON alternative” – it is the API through which downstream modules (including the enrichment engine I’ll introduce later) can access parsed fields at near-zero cost. And the bytecode VM is not just “a faster PDAG walker” – it is an execution model that naturally produces flat results, supports SIMD acceleration, and enables hardware-aware optimizations that a recursive tree walker cannot.

The restaurant analogy

Imagine you’re a chef in a restaurant. The legacy approach is like reading the recipe book from scratch for every single order – flipping pages, cross-referencing, going back and forth. Even if you’ve made the same dish 10,000 times today, you still pay the full page-flipping overhead.

The bytecode VM approach is a prep sheet. Before the restaurant opens, the head chef reads all the recipes once and writes a flat, numbered checklist. No more page-flipping. Just follow the numbered steps from top to bottom, with occasional jumps. The information is the same, but the format is optimized for fast, sequential execution.

Compile once, execute many

Compilation (once, at startup): The PDAG tree is traversed once and flattened into a linear bytecode stream. Custom type references become CALL / RET subroutine pairs. Backtracking alternatives become FORK instructions with a static save / restore stack. The compiler is a single DFS pass – straightforward and deterministic.

Execution (per message): A flat dispatch loop reads instructions from a program counter. Results go into the arena-backed flat array. No recursion, no malloc, no JSON. The instruction set has 42 opcodes across 8 categories (control flow, literal matching, typed field extraction, whitespace skipping, metadata, nested context, assertions, protocol-specific parsers). Each instruction is exactly 64 bytes – one CPU cache line.

The dispatch loop uses computed goto (GCC / Clang extension) for maximum throughput: each opcode handler jumps directly to the next, with a unique branch site per handler for better CPU branch prediction. On compilers without this extension, the code falls back to a standard switch / case. The dispatch table (256 entries × 8 bytes = 2 KB) fits in L1 data cache.

SIMD: processing 16 bytes at once

Log parsing is fundamentally about scanning strings for delimiters: find the next space, the next equals sign, the next quote. The scalar approach checks one byte at a time. SIMD (Single Instruction, Multiple Data) examines 16 bytes in a single CPU instruction.

The turbo VM includes architecture-specific SIMD implementations for all performance-critical primitives: character search, character-set matching, whitespace skipping, word extraction, and delimiter-bounded field extraction. On x86-64, it uses SSE4.2’s PCMPISTRI hardware string instructions. On ARM64, where no equivalent exists, I implemented the nibble-parallel technique – the same algorithm used by Intel’s Hyperscan regex engine and the simdjson JSON parser – using vqtbl1q_u8 table lookups.

The ARM NEON implementation reaches 83% of SSE4.2 throughput despite having no dedicated string search instruction – and 8× the scalar baseline. Every SIMD path has a pure C scalar fallback, so the code compiles and runs correctly on any architecture.

Arena allocator: zero malloc, zero free

Instead of calling malloc and free for every extracted field, the VM uses an arena allocator: a pre-allocated 16 KB buffer (fits entirely in L1 cache) with O(1) bump-pointer allocation and O(1) bulk reset between messages. Field values shorter than 48 bytes are stored inline in the result struct – no allocation at all. Between messages, a single pointer reset reclaims all memory. Backtracking uses mark/restore to snapshot and roll back the arena in O(1).

Benchmark results

Important: absolute numbers depend on hardware. What matters are the relative comparisons between the legacy engine and the turbo VM on identical hardware with identical workloads.

Raw parsing: 7.2× faster

Measured on 1 million Stormshield SNS firewall logs using the lognormalizer CLI (no rsyslog overhead):

MetricLegacyTurbo VM
User CPU Time22.37 s3.12 s
Throughput~45K msg/s~320K msg/s
Speedup7.2×

The speedup breaks down as: eliminating per-field malloc (~2×), removing JSON object construction (~1.5×), computed goto dispatch with prefetch (~1.4×), and SIMD acceleration (~1.3×). These are consistent with published literature on interpreter optimization (LuaJIT 5–10×, PyPy 5–50×).

In an full rsyslog pipeline: Amdahl’s law bites

Here is where it gets honest. In a full rsyslog pipeline (TCP ingestion, parsing, template rendering, file output), the 7.2× raw parsing speedup translates to only 1.09× end-to-end. Parsing is roughly 30% of total pipeline time; the remaining 70% (I/O, template rendering, queue management) is unaffected.

If the turbo VM’s only contribution were faster parsing, this would be a disappointing result.

But remember: the point was never just faster parsing. The point was decoupling parsing from representation – and the payoff of that decoupling shows up downstream.

The real unlock: what the decoupling enables

The turbo VM’s flat result structure, its snapshot mechanism, and its zero-copy field access API are not just performance tricks. They are a new interface contract between the parser and everything that comes after it.

In the legacy pipeline, every module that needs parsed data must go through the json-c tree: mutex lock, hash lookup, string copy. A typical production pipeline for a complex firewall like Stormshield SNS requires around 30 distinct enrichment actions per message: many GeoIP lookups across multiple databases, many CMDB and reputation table lookups, complex conditional logic per IP field, and a template that accesses 34 individual fields one by one. Each action pays the json-c tax.

With the turbo VM, a downstream module can read any parsed field directly from the flat snapshot: O(1) array scan, no mutex, no allocation, no JSON. This is what made it possible to build mmthreat.

mmthreat: 30 actions collapsed into 1

mmthreat is a new rsyslog module – a single action that replaces the entire enrichment chain: CTI reputation lookups, GeoIP resolution, CMDB asset correlation, and ECS field mapping. It reads field values directly from the turbo snapshot, performs all lookups against a pre-compiled threat database, and writes enrichment results back to the message in a single pass.

This design is only possible because the turbo VM decoupled parsing from json-c. In the legacy pipeline, there is no way to access a parsed field without navigating the json-c tree. mmthreat depends on the snapshot’s flat, immutable, zero-copy field array – an interface that did not exist before.

mmthreat will be the subject of a dedicated article. For now, what matters is the result.

$!all-json: 34 template accesses collapsed into 1

The same decoupling enables a switch from per-field ECS templates (34 individual property lookups, each with its own mutex lock and hash navigation) to rsyslog’s $!all-json property, which serializes the entire message tree in a single call. Combined with the turbo VM’s lazy JSON materialization (the json-c tree is only built when something explicitly needs it), this eliminates the template rendering bottleneck.

Production pipeline: 2.44× end-to-end

When all three optimizations combine – turbo parsing, mmthreat enrichment, and $!all-json serialization – the production pipeline benchmark on 1 million Stormshield SNS logs tells a different story:

MetricLegacy PipelineOptimized Pipeline
Actions per message~312
Throughput (8 threads)51,867 msg/s126,342 msg/s
Processing time (1M logs)19.28 s7.92 s
End-to-end speedup2.44×
Config complexity~200 lines~25 lines

The most telling number: the optimized pipeline at 1 CPU core (97,847 msg/s) already outperforms the legacy pipeline at 8 cores (51,867 msg/s). Same hardware, same logs, same output. Equivalent throughput with a fraction of the CPU cost.

Where the savings come from

StageLegacyOptimizedSavings
Parsing~3,000 ns~400 ns2,600 ns
Enrichment (30 vs 1 action)~8,000 ns~510 ns7,490 ns
Template rendering~2,800 ns~700 ns2,100 ns
TCP I/O (unchanged)~5,071 ns~5,071 ns

The enrichment module (mmthreat) contributes the largest single chunk – 39% of total savings. The turbo VM’s parsing speedup contributes 21%. The $!all-json template optimization contributes 17%. The remaining savings come from reduced overhead across the pipeline (fewer actions, fewer queue transitions, less contention).

Crucially, the enrichment and template optimizations depend on the turbo VM’s architecture. mmthreat uses the flat snapshot API. The lazy JSON materialization only works because the parser no longer forces a json-c tree into existence. The 2.44× result is not three independent optimizations added together – it is a single architectural change (decoupling parsing from representation) whose benefits compound across the pipeline.

Design principles

Additive, not replacement. The turbo VM does not modify, replace, or deprecate any existing liblognorm code. The legacy PDAG walker remains the default. Turbo mode is opt-in at both compile time (–enable-turbo) and runtime (turbo=”on” in mmnormalize). If a rulebase contains unsupported parsers (3 out of 31: repeat, alternative, interpret), the compiler logs a warning and falls back transparently. No existing configuration breaks.

Minimal footprint in core liblognorm. The turbo code lives in 22 isolated source files (~8,200 lines) with 3 public entry points. Changes to core liblognorm: a 20-line wrapper in liblognorm.c, one struct field in lognorm.h, build system integration. The rsyslog integration uses opaque void * callbacks on the message struct – msg.h has zero dependency on liblognorm headers.

Portable by design. Every SIMD path has a scalar fallback. Architecture detection is compile-time via preprocessor. No runtime CPUID. The code compiles on x86-64 (SSE4.2), ARM64 (NEON), and any other architecture with the pure C scalar path.

Honest about trade-offs. The VM + SIMD layer represents ~80% of implementation effort for ~20% of additional raw speedup beyond what the arena + flat result alone would provide. But the VM architecture is what enabled mmthreat and the $!all-json optimization. Without it, the production gain would have been ~1.5×, not 2.44×.

Upstream contribution

This work is being proposed for upstream integration into liblognorm. A merge request is in progress. The turbo VM for liblognorm/mmnormalize is the first piece being submitted; mmthreat will follow separately.

The design was deliberately shaped for upstream acceptance: additive, self-contained, gated behind a configure flag, passing all existing tests unchanged. This is not a fork. It is a contribution back to the project that made this work possible in the first place.

If you use rsyslog and liblognorm at scale, this is directly relevant to your log pipeline performance.

What’s next

mmthreat deep dive. A dedicated article will cover the threat intelligence enrichment engine that builds on the turbo VM: native MISP integration, multi-source IOC matching, GeoIP resolution, CMDB correlation, ECS field mapping, and research directions around context correlation and identity mapping at the edge.

Large-payload benchmarks. EDR logs (2–8 KB), WAF logs with full HTTP bodies, JSON-structured cloud telemetry. These are where SIMD’s 16-bytes-at-a-time advantage truly compounds. The benchmarks presented here have a conservative bias – short, uniform log lines that favor the legacy engine.

FreeBSD TCP stack optimization. The current benchmarks are I/O-bound at ~197K msg/s. The next round targets a highly optimized TCP stack to push the I/O ceiling and expose more of the turbo VM’s throughput advantage.

Conclusion

Fifteen years ago, Rainer Gerhards bet that normalizing logs at ingest time was the right architecture. He was right. liblognorm and rsyslog gave the world a normalization engine fast enough for real-time processing, and an ecosystem that thousands of production pipelines depend on.

What I’ve built is not a replacement for that foundation. It is the next layer: a high-performance field access architecture that decouples parsing from JSON representation, eliminates json-c from the hot path, and provides a zero-copy API for downstream modules to access parsed fields at near-zero cost.

The raw parsing speedup is 7.2×. But the pitch is not « faster parsing ».

The pitch is: « because parsing and representation are now decoupled, we can build an enrichment engine (mmthreat) and a serialization path ($!all-json) that together deliver 2.44× end-to-end production speedup – with zero changes to existing configurations, and a 10× reduction in pipeline complexity. »

For SOC operators, this means same hardware, double the capacity – or same capacity, half the CPU. For the rsyslog community, it means a new foundation for building real-time intelligence into the log pipeline.

The turbo VM is the engine. What we build on top of it – starting with mmthreat – defines the real-world impact.

Jérémie Jourdin – CTO, Advens

liblognorm: https://github.com/rsyslog/liblognorm

rsyslog: https://www.rsyslog.com

Rainer Gerhards’ thesis: https://www.researchgate.net/publication/310545144_Efficient_Normalization_of_IT_Log_Messages_under_Realtime_Conditions