Engineering · Hardware Data

Why Parquet is the right format to store hardware data

Legacy formats choke hardware engineering. Discover how modern data formats solve storage and latency with zero data loss..

96.4%
storage reduction on realistic sensor data
48×
faster row count queries in a warm session
27×
faster multi-channel analytics

Hardware data has two competing demands — and most formats fail both

Organizations dealing with physical hardware data constantly wrestle with two competing challenges: storage and latency. As physical systems grow more complex, the number of sensors — and their polling frequencies — will only increase. This creates a compounding effect: more data streams per unit of time inevitably leads to exploding log file sizes.

For hardware engineering and compliance, this data is the ultimate asset. When dealing with costly hardware and expensive testing environments, organizations cannot afford to waste the telemetry generated during every run. This data must be collected and stored efficiently — minimizing storage bloat while keeping query latency low enough for engineers to seamlessly use it.

Dataset used in this study: 504,123 samples (~504 seconds at 1,000 Hz), 12 sensor channels + time. Source data generated via asammdf in MF4 format, then converted to Parquet. Two variants tested: realistic low-entropy drive-cycle data and random high-entropy noise representing worst-case conditions.


A fragmented world of proprietary formats — optimized for writing, not reading

Historically, different hardware domains have siloed themselves into specialized, proprietary log formats optimized primarily for embedded writing rather than analytical reading. These formats create severe bottlenecks when engineers attempt to analyze thousands of test runs.

IndustryCommon Log Formats
Aerospace & AutomotiveMF4TDMSCCSDS
Robotics & UAVsPX4logsDataFlashMCAPROSbagsUlog
Test Stands / Hardware LabsMF4MATTDMSCSV

To demonstrate why formats like MF4 create analytical bottlenecks, we conducted a rigorous study comparing MF4 + asammdf against Apache Parquet paired with the DuckDB analytical engine. Identical query workloads were run on both low-entropy and high-entropy data variants.


The suitcase analogy — how formats pack your data

When it comes to file size, the format you choose dictates how efficiently you can pack your data. Think of it like packing clothes for a trip. Legacy formats like MF4 place every single item in its own rigid box, leaving you with a massive suitcase. Parquet rolls and vacuum-packs similar items together, vastly reducing the footprint.

Realistic Sensor Data · Low Entropy
MF4
50 MB
Parquet
1.8 MB
96.4% smaller
27× compression — lossless
Random Noise · High Entropy
MF4
50 MB
Parquet
44.4 MB
11.3% smaller
Similar size — entropy limits compression

The takeaway: For realistic sensor data, Parquet's built-in compression shrinks a 50.0 MB file to just 1.8 MB. Crucially, this is a lossless conversion — every single original value is preserved exactly as recorded. For standard deep tech applications, these storage savings compound dramatically over time across hundreds of test runs.


Five queries that mirror the daily reality of a hardware engineer

Saving storage space is necessary, but it doesn't matter if engineers have to wait minutes for a dashboard to load. We designed five standard queries to reflect real analytical workflows on hardware telemetry.

01
Row Count
"How many total sensor samples are recorded in this file?"
02
Time Window (10–20 s)
"What was the average, minimum, and maximum engine speed between the 10-second and 20-second marks?"
03
Single Channel Sum
"Add up all the engine speed readings across the entire recording."
04
Four Channel Sum
"Read and process engine speed, vehicle speed, coolant temperature, and oil pressure simultaneously."
05
Per-Second Average
"For every whole second of the drive, what was the average engine speed?"

Tests ran on macOS 14.6, Apple Silicon (8 CPU cores), Python 3.13, asammdf 8.8.13, DuckDB 1.5.3. DuckDB used 4 threads (vectorized multithreading). 3 cold repetitions + 5 warm repetitions per query — medians reported.

Mode A — Cold Start
Fresh process, cleared cache
New Python process every repetition, memory cache purged. Reflects the "I double-click a file and ask one question" scenario.
Mode B — Warm Session
File and engine stay open
One process, one practice query (untimed), then 5 timed repetitions. Reflects an analyst running many queries in one active session.

In almost every scenario, DuckDB on Parquet wins — and it isn't close

We tested speed across both low-entropy and high-entropy data with both test modes. Times below are medians across repetitions. DuckDB on Parquet was dramatically faster across the board.

Low-Entropy · Cold Start
Low-Entropy · Cold Start — Query Speed (seconds, lower is better)
QueryDuckDB (Parquet)asammdf (MF4)Speedup
Row count0.006 s0.019 s2.9× faster
Time window (10–20 s)0.009 s0.044 s5.1× faster
Single channel sum0.008 s0.020 s2.6× faster
Four channel sum0.009 s0.084 s9.1× faster
Per-second average0.010 s0.053 s5.5× faster
Low-Entropy · Warm Session
Low-Entropy · Warm Session — Query Speed (seconds, lower is better)
QueryDuckDB (Parquet)asammdf (MF4)Speedup
Row count0.0002 s0.009 s48× faster
Time window (10–20 s)0.002 s0.033 s16× faster
Single channel sum0.001 s0.009 s7.8× faster
Four channel sum0.002 s0.068 s27× faster
Per-second average0.004 s0.039 s10.8× faster
High-Entropy · Cold Start
High-Entropy · Cold Start — Query Speed (seconds, lower is better)
QueryDuckDB (Parquet)asammdf (MF4)Speedup
Row count0.007 s0.022 s3.3× faster
Time window (10–20 s)0.009 s0.046 s5.2× faster
Single channel sum0.008 s0.020 s2.4× faster
Four channel sum0.012 s0.104 s8.9× faster
Per-second average0.011 s0.049 s4.5× faster
High-Entropy · Warm Session
High-Entropy · Warm Session — Query Speed (seconds, lower is better)
QueryDuckDB (Parquet)asammdf (MF4)Speedup
Row count0.0003 s0.0010 s31× faster
Time window (10–20 s)0.002 s0.033 s13× faster
Single channel sum0.002 s0.009 s4.4× faster
Four channel sum0.006 s0.086 s14× faster
Per-second average0.004 s0.038 s9× faster
48×
Row count warm, low entropy
27×
4-channel sum warm, low entropy
9.1×
4-channel sum cold, low entropy
16×
Time window warm, low entropy
Notice the exponential performance jump for DuckDB during multi-channel queries. Because Parquet stores each channel as a separate column on disk, DuckDB only reads the specific columns requested. MF4 forces the system to navigate its internal structure channel by channel.

Legacy formats are a tax on your engineering velocity

For organizations building the next generation of physical systems, sticking to legacy log formats means accepting bloated storage costs and sluggish engineering workflows. The numbers are unambiguous: for typical vehicle or hardware sensor logs, converting to Parquet and querying with an analytical engine like DuckDB yields dramatically smaller files and massively faster analytics.

Embracing these modern formats is a necessary shift away from outdated paradigms — laying the groundwork for true Deep Tech Data Infrastructure that scales alongside your hardware's ambition.

More in this series

This benchmark is the first in a series unpacking the infrastructure layer that modern hardware engineering deserves.

upcomingCost Impact of Parquet with DuckDBPart 2
upcomingHow do Parquet and DuckDB actually work?Part 3