Identifying issue with simple Apache Beam setup

Logan G Engstrom Tue, 26 Sep 2023 21:53:48 -0700

Hi,


I am new to learning Apache Beam and have had trouble with performance; despite 
searching around I have not been able to identify a solution. My issue is that 
even though I am using parallelizable operations + converting to PCollections, 
Beam is incredibly slow. Furthermore, when running on larger amounts of data 
(i.e., more than a few hundred thousand rows) I get issues around data size 
(i.e., exceeds maximum protobuf size of 2GB errors for operations).

I am just trying to reproduce code from another research group (copied from an 
online reproduction of the C4 
dataset<https://github.com/tensorflow/datasets/blame/master/tensorflow_datasets/text/c4_utils.py>),
 so while the individual operations are likely ok, I am worried about a simple 
setup issue that I can't figure out.

I think the issue might be with starting the pipeline, or with how I am passing 
data to the pipeline. Help would be greatly appreciated! Here is a skeleton of 
my code

```

  pipeline = beam.Pipeline()
  line_to_selected_url = (
      pipeline
      | beam.Create(pages) # pages is entirely Python objects: a list of tuples 
(string, dict)
      | beam.FlatMap(_emit_url_to_lines) # this is copied so it should be ok
      | beam.combiners.Top.PerKey(1, key=_hash_text, reverse=True)) #  # this 
is copied so it should be ok

  lines_to_keep = line_to_selected_url | beam.Map(lambda x: (x[1][0], x[0]))

  final_docs = ({ "features": pages, "lines": lines_to_keep }
                | "group_features_and_lines_by_url" >> beam.CoGroupByKey()
                | beam.FlatMap(
                    _remove_lines_from_text,
                    min_num_sentences=min_num_sentences)
                | beam.combiners.ToList())
  res = final_docs | ("Write to JSON" >> beam.Map(hacky_save))
  result = res.pipeline.run()
  result.wait_until_finish()

```

I think the issue could also lie with how I run the pipeline. To run the code I 
am passing the parameters: `--autoscalingAlgorithm=BASIC --maxWorkers=32` to 
the script.

Thank you so much for reading this far!

- L

Identifying issue with simple Apache Beam setup

Reply via email to