Engineering · Hardware Data

Why Parquet is the right format to store hardware data

Legacy formats choke hardware engineering. Discover how modern data formats solve storage and latency with zero data loss..

96.4%

storage reduction on realistic sensor data

48×

faster row count queries in a warm session

27×

faster multi-channel analytics

The problem

Hardware data has two competing demands — and most formats fail both

Organizations dealing with physical hardware data constantly wrestle with two competing challenges: storage and latency. As physical systems grow more complex, the number of sensors — and their polling frequencies — will only increase. This creates a compounding effect: more data streams per unit of time inevitably leads to exploding log file sizes.

For hardware engineering and compliance, this data is the ultimate asset. When dealing with costly hardware and expensive testing environments, organizations cannot afford to waste the telemetry generated during every run. This data must be collected and stored efficiently — minimizing storage bloat while keeping query latency low enough for engineers to seamlessly use it.

Dataset used in this study: 504,123 samples (~504 seconds at 1,000 Hz), 12 sensor channels + time. Source data generated via asammdf in MF4 format, then converted to Parquet. Two variants tested: realistic low-entropy drive-cycle data and random high-entropy noise representing worst-case conditions.

The current data landscape

A fragmented world of proprietary formats — optimized for writing, not reading

Historically, different hardware domains have siloed themselves into specialized, proprietary log formats optimized primarily for embedded writing rather than analytical reading. These formats create severe bottlenecks when engineers attempt to analyze thousands of test runs.

Industry	Common Log Formats
Aerospace & Automotive	MF4TDMSCCSDS
Robotics & UAVs	PX4logsDataFlashMCAPROSbagsUlog
Test Stands / Hardware Labs	MF4MATTDMSCSV

To demonstrate why formats like MF4 create analytical bottlenecks, we conducted a rigorous study comparing MF4 + asammdf against Apache Parquet paired with the DuckDB analytical engine. Identical query workloads were run on both low-entropy and high-entropy data variants.

Storage efficiency

The suitcase analogy — how formats pack your data

When it comes to file size, the format you choose dictates how efficiently you can pack your data. Think of it like packing clothes for a trip. Legacy formats like MF4 place every single item in its own rigid box, leaving you with a massive suitcase. Parquet rolls and vacuum-packs similar items together, vastly reducing the footprint.

Realistic Sensor Data · Low Entropy

MF4

50 MB

Parquet

1.8 MB

96.4% smaller

27× compression — lossless

Random Noise · High Entropy

MF4

50 MB

Parquet

44.4 MB

11.3% smaller

Similar size — entropy limits compression

The takeaway: For realistic sensor data, Parquet's built-in compression shrinks a 50.0 MB file to just 1.8 MB. Crucially, this is a lossless conversion — every single original value is preserved exactly as recorded. For standard deep tech applications, these storage savings compound dramatically over time across hundreds of test runs.

Query design

Five queries that mirror the daily reality of a hardware engineer

Saving storage space is necessary, but it doesn't matter if engineers have to wait minutes for a dashboard to load. We designed five standard queries to reflect real analytical workflows on hardware telemetry.

Row Count

"How many total sensor samples are recorded in this file?"

Time Window (10–20 s)

"What was the average, minimum, and maximum engine speed between the 10-second and 20-second marks?"

Single Channel Sum

"Add up all the engine speed readings across the entire recording."

Four Channel Sum

"Read and process engine speed, vehicle speed, coolant temperature, and oil pressure simultaneously."

Per-Second Average

"For every whole second of the drive, what was the average engine speed?"

Tests ran on macOS 14.6, Apple Silicon (8 CPU cores), Python 3.13, asammdf 8.8.13, DuckDB 1.5.3. DuckDB used 4 threads (vectorized multithreading). 3 cold repetitions + 5 warm repetitions per query — medians reported.

Mode A — Cold Start

Fresh process, cleared cache

New Python process every repetition, memory cache purged. Reflects the "I double-click a file and ask one question" scenario.

Mode B — Warm Session

File and engine stay open

One process, one practice query (untimed), then 5 timed repetitions. Reflects an analyst running many queries in one active session.

Benchmark results

In almost every scenario, DuckDB on Parquet wins — and it isn't close

We tested speed across both low-entropy and high-entropy data with both test modes. Times below are medians across repetitions. DuckDB on Parquet was dramatically faster across the board.

Low-Entropy · Cold Start

Low-Entropy · Cold Start — Query Speed (seconds, lower is better)

Query	DuckDB (Parquet)	asammdf (MF4)	Speedup
Row count	0.006 s	0.019 s	2.9× faster
Time window (10–20 s)	0.009 s	0.044 s	5.1× faster
Single channel sum	0.008 s	0.020 s	2.6× faster
Four channel sum	0.009 s	0.084 s	9.1× faster
Per-second average	0.010 s	0.053 s	5.5× faster

Low-Entropy · Warm Session

Low-Entropy · Warm Session — Query Speed (seconds, lower is better)

Query	DuckDB (Parquet)	asammdf (MF4)	Speedup
Row count	0.0002 s	0.009 s	48× faster
Time window (10–20 s)	0.002 s	0.033 s	16× faster
Single channel sum	0.001 s	0.009 s	7.8× faster
Four channel sum	0.002 s	0.068 s	27× faster
Per-second average	0.004 s	0.039 s	10.8× faster

High-Entropy · Cold Start

High-Entropy · Cold Start — Query Speed (seconds, lower is better)

Query	DuckDB (Parquet)	asammdf (MF4)	Speedup
Row count	0.007 s	0.022 s	3.3× faster
Time window (10–20 s)	0.009 s	0.046 s	5.2× faster
Single channel sum	0.008 s	0.020 s	2.4× faster
Four channel sum	0.012 s	0.104 s	8.9× faster
Per-second average	0.011 s	0.049 s	4.5× faster

High-Entropy · Warm Session

High-Entropy · Warm Session — Query Speed (seconds, lower is better)

Query	DuckDB (Parquet)	asammdf (MF4)	Speedup
Row count	0.0003 s	0.0010 s	31× faster
Time window (10–20 s)	0.002 s	0.033 s	13× faster
Single channel sum	0.002 s	0.009 s	4.4× faster
Four channel sum	0.006 s	0.086 s	14× faster
Per-second average	0.004 s	0.038 s	9× faster

48×

Row count warm, low entropy

27×

4-channel sum warm, low entropy

9.1×

4-channel sum cold, low entropy

16×

Time window warm, low entropy

Notice the exponential performance jump for DuckDB during multi-channel queries. Because Parquet stores each channel as a separate column on disk, DuckDB only reads the specific columns requested. MF4 forces the system to navigate its internal structure channel by channel.

The bottom line

Legacy formats are a tax on your engineering velocity

For organizations building the next generation of physical systems, sticking to legacy log formats means accepting bloated storage costs and sluggish engineering workflows. The numbers are unambiguous: for typical vehicle or hardware sensor logs, converting to Parquet and querying with an analytical engine like DuckDB yields dramatically smaller files and massively faster analytics.

Embracing these modern formats is a necessary shift away from outdated paradigms — laying the groundwork for true Deep Tech Data Infrastructure that scales alongside your hardware's ambition.

In this series

More in this series

This benchmark is the first in a series unpacking the infrastructure layer that modern hardware engineering deserves.