# Why Parquet is the right format to store hardware data Category: Engineering Published: May 2026 Read time: 12 min read URL: /resources/parquet-hardware-data Legacy formats choke hardware engineering. Discover how modern data formats solve storage and latency with zero data loss. --- Engineering · Hardware Data # Why Parquet is the right format to store hardware data Legacy formats choke hardware engineering. Discover how modern data formats solve storage and latency with zero data loss.. 96.4% storage reduction on realistic sensor data 48× faster row count queries in a warm session 27× faster multi-channel analytics The problem ## Hardware data has two competing demands — and most formats fail both Organizations dealing with physical hardware data constantly wrestle with two competing challenges: **storage** and **latency**. As physical systems grow more complex, the number of sensors — and their polling frequencies — will only increase. This creates a compounding effect: more data streams per unit of time inevitably leads to exploding log file sizes. For hardware engineering and compliance, this data is the ultimate asset. When dealing with costly hardware and expensive testing environments, organizations cannot afford to waste the telemetry generated during every run. This data must be collected and stored efficiently — minimizing storage bloat while keeping query latency low enough for engineers to seamlessly use it. **Dataset used in this study:** 504,123 samples (~504 seconds at 1,000 Hz), 12 sensor channels + time. Source data generated via asammdf in MF4 format, then converted to Parquet. Two variants tested: *realistic low-entropy* drive-cycle data and *random high-entropy* noise representing worst-case conditions. The current data landscape ## A fragmented world of proprietary formats — optimized for writing, not reading Historically, different hardware domains have siloed themselves into specialized, proprietary log formats optimized primarily for embedded writing rather than analytical reading. These formats create severe bottlenecks when engineers attempt to analyze thousands of test runs. Industry | Common Log Formats Aerospace & Automotive | MF4 TDMS CCSDS Robotics & UAVs | PX4logs DataFlash MCAP ROSbags Ulog Test Stands / Hardware Labs | MF4 MAT TDMS CSV To demonstrate why formats like MF4 create analytical bottlenecks, we conducted a rigorous study comparing MF4 + asammdf against Apache Parquet paired with the DuckDB analytical engine. Identical query workloads were run on both low-entropy and high-entropy data variants. Storage efficiency ## The suitcase analogy — how formats pack your data When it comes to file size, the format you choose dictates how efficiently you can pack your data. Think of it like packing clothes for a trip. Legacy formats like MF4 place every single item in its own rigid box, leaving you with a massive suitcase. Parquet rolls and vacuum-packs similar items together, vastly reducing the footprint. Realistic Sensor Data · Low Entropy MF4 50 MB Parquet 1.8 MB 96.4% smaller 27× compression — lossless Random Noise · High Entropy MF4 50 MB Parquet 44.4 MB 11.3% smaller Similar size — entropy limits compression **The takeaway:** For realistic sensor data, Parquet's built-in compression shrinks a 50.0 MB file to just 1.8 MB. Crucially, **this is a lossless conversion** — every single original value is preserved exactly as recorded. For standard deep tech applications, these storage savings compound dramatically over time across hundreds of test runs. Query design ## Five queries that mirror the daily reality of a hardware engineer Saving storage space is necessary, but it doesn't matter if engineers have to wait minutes for a dashboard to load. We designed five standard queries to reflect real analytical workflows on hardware telemetry. 01 Row Count "How many total sensor samples are recorded in this file?" 02 Time Window (10–20 s) "What was the average, minimum, and maximum engine speed between the 10-second and 20-second marks?" 03 Single Channel Sum "Add up all the engine speed readings across the entire recording." 04 Four Channel Sum "Read and process engine speed, vehicle speed, coolant temperature, and oil pressure simultaneously." 05 Per-Second Average "For every whole second of the drive, what was the average engine speed?" Tests ran on macOS 14.6, Apple Silicon (8 CPU cores), Python 3.13, asammdf 8.8.13, DuckDB 1.5.3. DuckDB used 4 threads (vectorized multithreading). 3 cold repetitions + 5 warm repetitions per query — medians reported. Mode A — Cold Start Fresh process, cleared cache New Python process every repetition, memory cache purged. Reflects the "I double-click a file and ask one question" scenario. Mode B — Warm Session File and engine stay open One process, one practice query (untimed), then 5 timed repetitions. Reflects an analyst running many queries in one active session. Benchmark results ## In almost every scenario, DuckDB on Parquet wins — and it isn't close We tested speed across both low-entropy and high-entropy data with both test modes. Times below are medians across repetitions. DuckDB on Parquet was dramatically faster across the board. Low-Entropy · Cold Start Low-Entropy · Cold Start — Query Speed (seconds, lower is better) Query | DuckDB (Parquet) | asammdf (MF4) | Speedup Row count | 0.006 s | 0.019 s | 2.9× faster Time window (10–20 s) | 0.009 s | 0.044 s | 5.1× faster Single channel sum | 0.008 s | 0.020 s | 2.6× faster Four channel sum | 0.009 s | 0.084 s | 9.1× faster Per-second average | 0.010 s | 0.053 s | 5.5× faster Low-Entropy · Warm Session Low-Entropy · Warm Session — Query Speed (seconds, lower is better) Query | DuckDB (Parquet) | asammdf (MF4) | Speedup Row count | 0.0002 s | 0.009 s | 48× faster Time window (10–20 s) | 0.002 s | 0.033 s | 16× faster Single channel sum | 0.001 s | 0.009 s | 7.8× faster Four channel sum | 0.002 s | 0.068 s | 27× faster Per-second average | 0.004 s | 0.039 s | 10.8× faster High-Entropy · Cold Start High-Entropy · Cold Start — Query Speed (seconds, lower is better) Query | DuckDB (Parquet) | asammdf (MF4) | Speedup Row count | 0.007 s | 0.022 s | 3.3× faster Time window (10–20 s) | 0.009 s | 0.046 s | 5.2× faster Single channel sum | 0.008 s | 0.020 s | 2.4× faster Four channel sum | 0.012 s | 0.104 s | 8.9× faster Per-second average | 0.011 s | 0.049 s | 4.5× faster High-Entropy · Warm Session High-Entropy · Warm Session — Query Speed (seconds, lower is better) Query | DuckDB (Parquet) | asammdf (MF4) | Speedup Row count | 0.0003 s | 0.0010 s | 31× faster Time window (10–20 s) | 0.002 s | 0.033 s | 13× faster Single channel sum | 0.002 s | 0.009 s | 4.4× faster Four channel sum | 0.006 s | 0.086 s | 14× faster Per-second average | 0.004 s | 0.038 s | 9× faster 48× Row count warm, low entropy 27× 4-channel sum warm, low entropy 9.1× 4-channel sum cold, low entropy 16× Time window warm, low entropy > Notice the exponential performance jump for DuckDB during multi-channel queries. Because Parquet stores each channel as a separate column on disk, DuckDB only reads the specific columns requested. MF4 forces the system to navigate its internal structure channel by channel. The bottom line ## Legacy formats are a tax on your engineering velocity For organizations building the next generation of physical systems, sticking to legacy log formats means accepting bloated storage costs and sluggish engineering workflows. The numbers are unambiguous: for typical vehicle or hardware sensor logs, converting to Parquet and querying with an analytical engine like DuckDB yields **dramatically smaller files** and **massively faster analytics**. Embracing these modern formats is a necessary shift away from outdated paradigms — laying the groundwork for true Deep Tech Data Infrastructure that scales alongside your hardware's ambition. In this series ## More in this series This benchmark is the first in a series unpacking the infrastructure layer that modern hardware engineering deserves. upcoming Cost Impact of Parquet with DuckDB Part 2 upcoming How do Parquet and DuckDB actually work? Part 3