# Why Parquet is the right format to store hardware data

Category: Engineering
Published: May 2026
Read time: 12 min read
URL: /resources/parquet-hardware-data

Legacy formats choke hardware engineering. Discover how modern data formats solve storage and latency with zero data loss.

---
Engineering · Hardware Data

# Why Parquet is the right format to store hardware data

 Legacy formats choke hardware engineering. Discover how modern data formats solve storage and latency with zero data loss..
 96.4%
 storage reduction on realistic sensor data

 48×
 faster row count queries in a warm session

 27×
 faster multi-channel analytics

 The problem

## Hardware data has two competing demands — and most formats fail both

 Organizations dealing with physical hardware data constantly wrestle with two competing challenges: **storage** and **latency**. As physical systems grow more complex, the number of sensors — and their polling frequencies — will only increase. This creates a compounding effect: more data streams per unit of time inevitably leads to exploding log file sizes.
 For hardware engineering and compliance, this data is the ultimate asset. When dealing with costly hardware and expensive testing environments, organizations cannot afford to waste the telemetry generated during every run. This data must be collected and stored efficiently — minimizing storage bloat while keeping query latency low enough for engineers to seamlessly use it.
 **Dataset used in this study:** 504,123 samples (~504 seconds at 1,000 Hz), 12 sensor channels + time. Source data generated via asammdf in MF4 format, then converted to Parquet. Two variants tested: *realistic low-entropy* drive-cycle data and *random high-entropy* noise representing worst-case conditions.

 The current data landscape

## A fragmented world of proprietary formats — optimized for writing, not reading

 Historically, different hardware domains have siloed themselves into specialized, proprietary log formats optimized primarily for embedded writing rather than analytical reading. These formats create severe bottlenecks when engineers attempt to analyze thousands of test runs.

Industry | Common Log Formats
Aerospace & Automotive | MF4 TDMS CCSDS
Robotics & UAVs | PX4logs DataFlash MCAP ROSbags Ulog
Test Stands / Hardware Labs | MF4 MAT TDMS CSV

 To demonstrate why formats like MF4 create analytical bottlenecks, we conducted a rigorous study comparing MF4 + asammdf against Apache Parquet paired with the DuckDB analytical engine. Identical query workloads were run on both low-entropy and high-entropy data variants.

 Storage efficiency

## The suitcase analogy — how formats pack your data

 When it comes to file size, the format you choose dictates how efficiently you can pack your data. Think of it like packing clothes for a trip. Legacy formats like MF4 place every single item in its own rigid box, leaving you with a massive suitcase. Parquet rolls and vacuum-packs similar items together, vastly reducing the footprint.
 Realistic Sensor Data · Low Entropy
 MF4

 50 MB
 Parquet

 1.8 MB
 96.4% smaller
 27× compression — lossless

 Random Noise · High Entropy
 MF4

 50 MB
 Parquet

 44.4 MB
 11.3% smaller
 Similar size — entropy limits compression

 **The takeaway:** For realistic sensor data, Parquet's built-in compression shrinks a 50.0 MB file to just 1.8 MB. Crucially, **this is a lossless conversion** — every single original value is preserved exactly as recorded. For standard deep tech applications, these storage savings compound dramatically over time across hundreds of test runs.

 Query design

## Five queries that mirror the daily reality of a hardware engineer

 Saving storage space is necessary, but it doesn't matter if engineers have to wait minutes for a dashboard to load. We designed five standard queries to reflect real analytical workflows on hardware telemetry.
 01
 Row Count
 "How many total sensor samples are recorded in this file?"

 02
 Time Window (10–20 s)
 "What was the average, minimum, and maximum engine speed between the 10-second and 20-second marks?"

 03
 Single Channel Sum
 "Add up all the engine speed readings across the entire recording."

 04
 Four Channel Sum
 "Read and process engine speed, vehicle speed, coolant temperature, and oil pressure simultaneously."

 05
 Per-Second Average
 "For every whole second of the drive, what was the average engine speed?"

 Tests ran on macOS 14.6, Apple Silicon (8 CPU cores), Python 3.13, asammdf 8.8.13, DuckDB 1.5.3. DuckDB used 4 threads (vectorized multithreading). 3 cold repetitions + 5 warm repetitions per query — medians reported.
 Mode A — Cold Start Fresh process, cleared cache
 New Python process every repetition, memory cache purged. Reflects the "I double-click a file and ask one question" scenario.

 Mode B — Warm Session File and engine stay open
 One process, one practice query (untimed), then 5 timed repetitions. Reflects an analyst running many queries in one active session.

 Benchmark results

## In almost every scenario, DuckDB on Parquet wins — and it isn't close

 We tested speed across both low-entropy and high-entropy data with both test modes. Times below are medians across repetitions. DuckDB on Parquet was dramatically faster across the board.
 Low-Entropy · Cold Start
 Low-Entropy · Cold Start — Query Speed (seconds, lower is better)

Query | DuckDB (Parquet) | asammdf (MF4) | Speedup
Row count | 0.006 s | 0.019 s | 2.9× faster
Time window (10–20 s) | 0.009 s | 0.044 s | 5.1× faster
Single channel sum | 0.008 s | 0.020 s | 2.6× faster
Four channel sum | 0.009 s | 0.084 s | 9.1× faster
Per-second average | 0.010 s | 0.053 s | 5.5× faster

 Low-Entropy · Warm Session
 Low-Entropy · Warm Session — Query Speed (seconds, lower is better)

Query | DuckDB (Parquet) | asammdf (MF4) | Speedup
Row count | 0.0002 s | 0.009 s | 48× faster
Time window (10–20 s) | 0.002 s | 0.033 s | 16× faster
Single channel sum | 0.001 s | 0.009 s | 7.8× faster
Four channel sum | 0.002 s | 0.068 s | 27× faster
Per-second average | 0.004 s | 0.039 s | 10.8× faster

 High-Entropy · Cold Start
 High-Entropy · Cold Start — Query Speed (seconds, lower is better)

Query | DuckDB (Parquet) | asammdf (MF4) | Speedup
Row count | 0.007 s | 0.022 s | 3.3× faster
Time window (10–20 s) | 0.009 s | 0.046 s | 5.2× faster
Single channel sum | 0.008 s | 0.020 s | 2.4× faster
Four channel sum | 0.012 s | 0.104 s | 8.9× faster
Per-second average | 0.011 s | 0.049 s | 4.5× faster

 High-Entropy · Warm Session
 High-Entropy · Warm Session — Query Speed (seconds, lower is better)

Query | DuckDB (Parquet) | asammdf (MF4) | Speedup
Row count | 0.0003 s | 0.0010 s | 31× faster
Time window (10–20 s) | 0.002 s | 0.033 s | 13× faster
Single channel sum | 0.002 s | 0.009 s | 4.4× faster
Four channel sum | 0.006 s | 0.086 s | 14× faster
Per-second average | 0.004 s | 0.038 s | 9× faster

 48×
 Row count
warm, low entropy

 27×
 4-channel sum
warm, low entropy

 9.1×
 4-channel sum
cold, low entropy

 16×
 Time window
warm, low entropy

> Notice the exponential performance jump for DuckDB during multi-channel queries. Because Parquet stores each channel as a separate column on disk, DuckDB only reads the specific columns requested. MF4 forces the system to navigate its internal structure channel by channel.

 The bottom line

## Legacy formats are a tax on your engineering velocity

 For organizations building the next generation of physical systems, sticking to legacy log formats means accepting bloated storage costs and sluggish engineering workflows. The numbers are unambiguous: for typical vehicle or hardware sensor logs, converting to Parquet and querying with an analytical engine like DuckDB yields **dramatically smaller files** and **massively faster analytics**.
 Embracing these modern formats is a necessary shift away from outdated paradigms — laying the groundwork for true Deep Tech Data Infrastructure that scales alongside your hardware's ambition.
 In this series

## More in this series

 This benchmark is the first in a series unpacking the infrastructure layer that modern hardware engineering deserves.
 upcoming Cost Impact of Parquet with DuckDB Part 2
 upcoming How do Parquet and DuckDB actually work? Part 3