Public bug reported:

It seems that the read_csv() suffers the same symptoms as eg the early
boost implementations, see https://svn.boost.org/trac/boost/ticket/3853
for details. The bz2 files can namely be composed of many concatenated
bz2 blocks which have to be treated as a continuous stream.

How to test: create large csv file, much larger than 900k. Compress with
pbzip2 (each process creates one bz2 block). Alternatively create many
such csv files, bzip2 them individually and then cat *.bz2 >joined.bz2

read_csv() will uncompress and read only the first block.

Note that this is a severe bug since the parallel bzip2 is getting
increasingly common on multi-core systems.

ProblemType: Bug
DistroRelease: Ubuntu 16.10
Package: python-pandas 0.17.1-3ubuntu2
ProcVersionSignature: Ubuntu 4.8.0-42.45-generic 4.8.17
Uname: Linux 4.8.0-42-generic x86_64
ApportVersion: 2.20.3-0ubuntu8.2
Architecture: amd64
CurrentDesktop: XFCE
Date: Mon Apr 17 18:42:52 2017
InstallationDate: Installed on 2014-10-21 (909 days ago)
InstallationMedia: Ubuntu 14.04.1 LTS "Trusty Tahr" - Release amd64 (20140722.2)
PackageArchitecture: all
SourcePackage: pandas
UpgradeStatus: Upgraded to yakkety on 2016-10-20 (179 days ago)

** Affects: pandas (Ubuntu)
     Importance: Undecided
         Status: New


** Tags: amd64 apport-bug yakkety

** Summary changed:

- read_csv on bzip2 file unzips only the first bucket
+ read_csv on bzip2 file unzips only the first block

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1683428

Title:
  read_csv on bzip2 file unzips only the first block

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/pandas/+bug/1683428/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to