Daniel Lescohier created BEAM-6952:
--------------------------------------

             Summary: concatenated compressed files bug with python sdk
                 Key: BEAM-6952
                 URL: https://issues.apache.org/jira/browse/BEAM-6952
             Project: Beam
          Issue Type: Bug
          Components: sdk-py-core
    Affects Versions: 2.11.0
            Reporter: Daniel Lescohier


The Python apache_beam.io.filesystem module has a bug handling concatenated 
compressed files.

The PR I will create has two commits:
 # a new unit test that shows the problem 
 # a fix to the problem.

The unit test is added to the apache_beam.io.filesystem_test module. It was 
added to this module because the test: 
apache_beam.io.textio_test.test_read_gzip_concat does not encounter the problem 
in the Beam 2.11 and earlier code base because the test data is too small: the 
data is smaller than read_size, so it goes through logic in the code that 
avoids the problem in the code. So, this test sets read_size smaller and test 
data bigger, in order to encounter the problem. It would be difficult to test 
in the textio_test module, because you'd need very large test data because 
default read_size is 16MiB, and the ReadFromText interface does not allow you 
to modify the read_size.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to