The best solution would be to find a compression format that is splittable and add support for that to Apache Beam and use it. The issue with compressed files is that you can't read from an arbitrary offset. This stack overflow post[1] has some suggestions on seekable compression libraries.
A much easier solution would be to split up your data to 100s of gzip files. This would give you most of the compression benefit and would also give you a lot of parallelization benefit during reading. 1: https://stackoverflow.com/questions/2046559/any-seekable-compression-library On Fri, May 10, 2019 at 1:25 PM Allie Chen <yifangc...@google.com> wrote: > Yes, that is correct. > > *From: *Allie Chen <yifangc...@google.com> > *Date: *Fri, May 10, 2019 at 4:21 PM > *To: * <dev@beam.apache.org> > *Cc: * <u...@beam.apache.org> > > Yes. >> >> *From: *Lukasz Cwik <lc...@google.com> >> *Date: *Fri, May 10, 2019 at 4:19 PM >> *To: *dev >> *Cc: * <u...@beam.apache.org> >> >> When you had X gzip files and were not using Reshuffle, did you see X >>> workers read and process the files? >>> >>> On Fri, May 10, 2019 at 1:17 PM Allie Chen <yifangc...@google.com> >>> wrote: >>> >>>> Yes, I do see the data after reshuffle are processed in parallel. But >>>> Reshuffle transform itself takes hours or even days to run, according to >>>> one test (24 gzip files, 17 million lines in total) I did. >>>> >>>> The file format for our users are mostly gzip format, since >>>> uncompressed files would be too costly to store (It could be in hundreds of >>>> GB). >>>> >>>> Thanks, >>>> >>>> Allie >>>> >>>> >>>> *From: *Lukasz Cwik <lc...@google.com> >>>> *Date: *Fri, May 10, 2019 at 4:07 PM >>>> *To: *dev, <u...@beam.apache.org> >>>> >>>> +u...@beam.apache.org <u...@beam.apache.org> >>>>> >>>>> Reshuffle on Google Cloud Dataflow for a bounded pipeline waits till >>>>> all the data has been read before the next transforms can run. After the >>>>> reshuffle, the data should have been processed in parallel across the >>>>> workers. Did you see this? >>>>> >>>>> Are you able to change the input of your pipeline to use an >>>>> uncompressed file or many compressed files? >>>>> >>>>> On Fri, May 10, 2019 at 1:03 PM Allie Chen <yifangc...@google.com> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> >>>>>> I am trying to load a gzip file to BigQuey using Dataflow. Since the >>>>>> compressed file is not splittable, one worker is allocated to read the >>>>>> file. The same worker will do all the other transforms since Dataflow >>>>>> fused >>>>>> all transforms together. There are a large amount of data in the file, >>>>>> and >>>>>> I expect to see more workers spinning up after reading transforms. I >>>>>> tried >>>>>> to use Reshuffle Transform >>>>>> <https://github.com/apache/beam/blob/release-2.3.0/sdks/python/apache_beam/transforms/util.py#L516> >>>>>> to prevent the fusion, but it is not scalable since it won’t proceed >>>>>> until >>>>>> all data arrived at this point. >>>>>> >>>>>> Is there any other ways to allow more workers working on all the >>>>>> other transforms after reading? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Allie >>>>>> >>>>>>