The problem
Hardware data has two competing demands — and most formats fail both
Organizations dealing with physical hardware data constantly wrestle with two competing challenges: storage and latency. As physical systems grow more complex, the number of sensors — and their polling frequencies — will only increase. This creates a compounding effect: more data streams per unit of time inevitably leads to exploding log file sizes.
For hardware engineering and compliance, this data is the ultimate asset. When dealing with costly hardware and expensive testing environments, organizations cannot afford to waste the telemetry generated during every run. This data must be collected and stored efficiently — minimizing storage bloat while keeping query latency low enough for engineers to seamlessly use it.
Dataset used in this study: 504,123 samples (~504 seconds at 1,000 Hz), 12 sensor channels + time. Source data generated via asammdf in MF4 format, then converted to Parquet. Two variants tested: realistic low-entropy drive-cycle data and random high-entropy noise representing worst-case conditions.
The current data landscape
A fragmented world of proprietary formats — optimized for writing, not reading
Historically, different hardware domains have siloed themselves into specialized, proprietary log formats optimized primarily for embedded writing rather than analytical reading. These formats create severe bottlenecks when engineers attempt to analyze thousands of test runs.
| Industry | Common Log Formats |
|---|---|
| Aerospace & Automotive | MF4TDMSCCSDS |
| Robotics & UAVs | PX4logsDataFlashMCAPROSbagsUlog |
| Test Stands / Hardware Labs | MF4MATTDMSCSV |
To demonstrate why formats like MF4 create analytical bottlenecks, we conducted a rigorous study comparing MF4 + asammdf against Apache Parquet paired with the DuckDB analytical engine. Identical query workloads were run on both low-entropy and high-entropy data variants.
Storage efficiency
The suitcase analogy — how formats pack your data
When it comes to file size, the format you choose dictates how efficiently you can pack your data. Think of it like packing clothes for a trip. Legacy formats like MF4 place every single item in its own rigid box, leaving you with a massive suitcase. Parquet rolls and vacuum-packs similar items together, vastly reducing the footprint.
The takeaway: For realistic sensor data, Parquet's built-in compression shrinks a 50.0 MB file to just 1.8 MB. Crucially, this is a lossless conversion — every single original value is preserved exactly as recorded. For standard deep tech applications, these storage savings compound dramatically over time across hundreds of test runs.
Query design
Five queries that mirror the daily reality of a hardware engineer
Saving storage space is necessary, but it doesn't matter if engineers have to wait minutes for a dashboard to load. We designed five standard queries to reflect real analytical workflows on hardware telemetry.
Tests ran on macOS 14.6, Apple Silicon (8 CPU cores), Python 3.13, asammdf 8.8.13, DuckDB 1.5.3. DuckDB used 4 threads (vectorized multithreading). 3 cold repetitions + 5 warm repetitions per query — medians reported.
Benchmark results
In almost every scenario, DuckDB on Parquet wins — and it isn't close
We tested speed across both low-entropy and high-entropy data with both test modes. Times below are medians across repetitions. DuckDB on Parquet was dramatically faster across the board.
| Query | DuckDB (Parquet) | asammdf (MF4) | Speedup |
|---|---|---|---|
| Row count | 0.006 s | 0.019 s | 2.9× faster |
| Time window (10–20 s) | 0.009 s | 0.044 s | 5.1× faster |
| Single channel sum | 0.008 s | 0.020 s | 2.6× faster |
| Four channel sum | 0.009 s | 0.084 s | 9.1× faster |
| Per-second average | 0.010 s | 0.053 s | 5.5× faster |
| Query | DuckDB (Parquet) | asammdf (MF4) | Speedup |
|---|---|---|---|
| Row count | 0.0002 s | 0.009 s | 48× faster |
| Time window (10–20 s) | 0.002 s | 0.033 s | 16× faster |
| Single channel sum | 0.001 s | 0.009 s | 7.8× faster |
| Four channel sum | 0.002 s | 0.068 s | 27× faster |
| Per-second average | 0.004 s | 0.039 s | 10.8× faster |
| Query | DuckDB (Parquet) | asammdf (MF4) | Speedup |
|---|---|---|---|
| Row count | 0.007 s | 0.022 s | 3.3× faster |
| Time window (10–20 s) | 0.009 s | 0.046 s | 5.2× faster |
| Single channel sum | 0.008 s | 0.020 s | 2.4× faster |
| Four channel sum | 0.012 s | 0.104 s | 8.9× faster |
| Per-second average | 0.011 s | 0.049 s | 4.5× faster |
| Query | DuckDB (Parquet) | asammdf (MF4) | Speedup |
|---|---|---|---|
| Row count | 0.0003 s | 0.0010 s | 31× faster |
| Time window (10–20 s) | 0.002 s | 0.033 s | 13× faster |
| Single channel sum | 0.002 s | 0.009 s | 4.4× faster |
| Four channel sum | 0.006 s | 0.086 s | 14× faster |
| Per-second average | 0.004 s | 0.038 s | 9× faster |
Notice the exponential performance jump for DuckDB during multi-channel queries. Because Parquet stores each channel as a separate column on disk, DuckDB only reads the specific columns requested. MF4 forces the system to navigate its internal structure channel by channel.
The bottom line
Legacy formats are a tax on your engineering velocity
For organizations building the next generation of physical systems, sticking to legacy log formats means accepting bloated storage costs and sluggish engineering workflows. The numbers are unambiguous: for typical vehicle or hardware sensor logs, converting to Parquet and querying with an analytical engine like DuckDB yields dramatically smaller files and massively faster analytics.
Embracing these modern formats is a necessary shift away from outdated paradigms — laying the groundwork for true Deep Tech Data Infrastructure that scales alongside your hardware's ambition.
In this series
More in this series
This benchmark is the first in a series unpacking the infrastructure layer that modern hardware engineering deserves.