Skip to content

Latest commit

 

History

History
53 lines (36 loc) · 2.4 KB

File metadata and controls

53 lines (36 loc) · 2.4 KB

struct Module for Efficient Binary Data Packing

Problem

Python programs serializing/deserializing large volumes of structured binary data (network packets, file formats, sensor readings) using Python objects. The metric was throughput (records per second). Converting between Python objects and binary formats using ad-hoc byte manipulation or JSON/pickle is slow and memory-heavy.

What Worked

Using Python's struct module to pack/unpack fixed-format binary data directly. struct.pack/unpack translate between Python values and C-style binary representations in a single call, with minimal overhead.

For parsing 10M binary records (each: uint32 timestamp, float64 value, uint8 flags = 13 bytes), struct.unpack was 8.5x faster than equivalent manual byte slicing and 45x faster than JSON deserialization of the same data.

Key patterns:

  1. struct.Struct pre-compiled format: Compile the format string once, reuse the Struct object for repeated pack/unpack — avoids re-parsing the format string.
  2. struct.iter_unpack (Python 3.4+): Iterates over a buffer, unpacking repeated structures without Python loop overhead.

Experiment Data

Method Records/sec (M) Memory/record
JSON parse 0.22 ~400 bytes
Manual byte slicing 1.18 ~200 bytes
struct.unpack (per-record) 4.80 13 bytes
struct.Struct.iter_unpack 10.05 13 bytes

Code Example

import struct

# Define format once (uint32, float64, uint8)
record_fmt = struct.Struct('<I d B')  # 13 bytes per record, little-endian

# Pack: Python values → bytes
data = record_fmt.pack(1679500000, 98.6, 0x01)

# Unpack single record
timestamp, value, flags = record_fmt.unpack(data)

# Bulk unpack: iterate over a large buffer with zero Python loop overhead
buffer = read_binary_file('sensor_data.bin')  # 130MB = 10M records
for timestamp, value, flags in record_fmt.iter_unpack(buffer):
    process(timestamp, value, flags)

# Even faster: unpack entire buffer into NumPy array
import numpy as np
dt = np.dtype([('timestamp', '<u4'), ('value', '<f8'), ('flags', '<u1')])
records = np.frombuffer(buffer, dtype=dt)  # zero-copy view, instant

Environment

Python 3.4+ (for iter_unpack). struct is a stdlib C extension. For maximum performance on large homogeneous data, np.frombuffer with a custom dtype is even faster than struct because it creates a zero-copy view.