It's unlikely that Reshuffle itself takes hours. It's more likely that simply reading and decompressing all that data was very slow when there was no parallelism.
*From: *Allie Chen <yifangc...@google.com> *Date: *Fri, May 10, 2019 at 1:17 PM *To: * <dev@beam.apache.org> *Cc: * <u...@beam.apache.org> Yes, I do see the data after reshuffle are processed in parallel. But > Reshuffle transform itself takes hours or even days to run, according to > one test (24 gzip files, 17 million lines in total) I did. > > The file format for our users are mostly gzip format, since uncompressed > files would be too costly to store (It could be in hundreds of GB). > > Thanks, > > Allie > > > *From: *Lukasz Cwik <lc...@google.com> > *Date: *Fri, May 10, 2019 at 4:07 PM > *To: *dev, <u...@beam.apache.org> > > +u...@beam.apache.org <u...@beam.apache.org> >> >> Reshuffle on Google Cloud Dataflow for a bounded pipeline waits till all >> the data has been read before the next transforms can run. After the >> reshuffle, the data should have been processed in parallel across the >> workers. Did you see this? >> >> Are you able to change the input of your pipeline to use an uncompressed >> file or many compressed files? >> >> On Fri, May 10, 2019 at 1:03 PM Allie Chen <yifangc...@google.com> wrote: >> >>> Hi, >>> >>> >>> I am trying to load a gzip file to BigQuey using Dataflow. Since the >>> compressed file is not splittable, one worker is allocated to read the >>> file. The same worker will do all the other transforms since Dataflow fused >>> all transforms together. There are a large amount of data in the file, and >>> I expect to see more workers spinning up after reading transforms. I tried >>> to use Reshuffle Transform >>> <https://github.com/apache/beam/blob/release-2.3.0/sdks/python/apache_beam/transforms/util.py#L516> >>> to prevent the fusion, but it is not scalable since it won’t proceed until >>> all data arrived at this point. >>> >>> Is there any other ways to allow more workers working on all the other >>> transforms after reading? >>> >>> Thanks, >>> >>> Allie >>> >>>