Skip to content

Fast parallel random access to bzip2 and gzip files in Python

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT
Notifications You must be signed in to change notification settings

mxmlnkn/indexed_bzip2

Repository files navigation

Parallel Random Access to bzip2 and gzip

License C++ Code Checks codecov C++17 Discord Telegram

This repository contains the code for the indexed_bzip2 and rapidgzip Python modules. Both are built upon the same basic architecture to enable block-parallel decoding based on prefetching and caching.

rapidgzip

Changelog PyPI version Python Version PyPI Platforms Downloads

This module provides:

  • a rapidgzip command line tool for parallel decompression of gzip files with a similar command line interface to gzip so that it can be used as a replacement.
  • a rapidgzip.open Python method for reading and seeking inside gzip files using multiple threads for a speedup of 21 over the built-in gzip module using a 12-core processor.

The random seeking support is similar to the one provided by indexed_gzip and the parallel capabilities are effectively a working version of pugz, which is only a concept and only works with a limited subset of file contents, namely non-binary (ASCII characters 0 to 127) compressed files.

Module Bandwidth / (MB/s) Speedup
gzip 250 1
rapidgzip with parallelization = 1 488 1.9
rapidgzip with parallelization = 2 902 3.6
rapidgzip with parallelization = 12 4463 17.7
rapidgzip with parallelization = 24 5240 20.8

See here for the extended Readme.

There also exists a dedicated repository for rapidgzip here. It was created for visibility reasons and in order to keep indexed_bzip2 and rapidgzip releases separate. The main development will take place in this repository while the rapidgzip repository will be updated at least for each release. Issues regarding rapidgzip should be opened at its repository.

A paper describing the implementation details and showing the scaling behavior with up to 128 cores has been submitted to and accepted in ACM HPDC'23, The 32nd International Symposium on High-Performance Parallel and Distributed Computing. If you use this software for your scientific publication, please cite it as stated here. The author's version can be found here and the accompanying presentation here.

indexed_bzip2

Changelog PyPI version Python Version PyPI Platforms Downloads
Conda Platforms Conda Platforms

This module provides:

  • an ibzip2 command line tool to decompress bzip2 files in parallel with a similar command line interface to bzip2 so that it can be used as a replacement.
  • an ibzip2.open Python method for reading and seeking inside bzip2 files using multiple threads for a speedup of 6 over the built-in bzip2 module using a 12-core processor.

The parallel decompression capabilities are similar to lbzip2 but with a more permissive license and with support to be used as a library with random seeking capabilities similar to seek-bzip2.

Module Runtime / s Bandwidth / (MB/s) Speedup
bz2 386 5.2 1
indexed_bzip2 with parallelization = 1 472 4.2 0.8
indexed_bzip2 with parallelization = 2 265 7.6 1.5
indexed_bzip2 with parallelization = 12 64 31.4 6.1
indexed_bzip2 with parallelization = 24 63 31.8 6.1

See here for the extended Readme.

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.