Public bug reported: It seems that the read_csv() suffers the same symptoms as eg the early boost implementations, see https://svn.boost.org/trac/boost/ticket/3853 for details. The bz2 files can namely be composed of many concatenated bz2 blocks which have to be treated as a continuous stream.
How to test: create large csv file, much larger than 900k. Compress with pbzip2 (each process creates one bz2 block). Alternatively create many such csv files, bzip2 them individually and then cat *.bz2 >joined.bz2 read_csv() will uncompress and read only the first block. Note that this is a severe bug since the parallel bzip2 is getting increasingly common on multi-core systems. ProblemType: Bug DistroRelease: Ubuntu 16.10 Package: python-pandas 0.17.1-3ubuntu2 ProcVersionSignature: Ubuntu 4.8.0-42.45-generic 4.8.17 Uname: Linux 4.8.0-42-generic x86_64 ApportVersion: 2.20.3-0ubuntu8.2 Architecture: amd64 CurrentDesktop: XFCE Date: Mon Apr 17 18:42:52 2017 InstallationDate: Installed on 2014-10-21 (909 days ago) InstallationMedia: Ubuntu 14.04.1 LTS "Trusty Tahr" - Release amd64 (20140722.2) PackageArchitecture: all SourcePackage: pandas UpgradeStatus: Upgraded to yakkety on 2016-10-20 (179 days ago) ** Affects: pandas (Ubuntu) Importance: Undecided Status: New ** Tags: amd64 apport-bug yakkety ** Summary changed: - read_csv on bzip2 file unzips only the first bucket + read_csv on bzip2 file unzips only the first block -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1683428 Title: read_csv on bzip2 file unzips only the first block To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/pandas/+bug/1683428/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs