Hi! I have inherited code using Apache Beam and I'm trying to figure out the right way to use it.

This seemed like it would be a good page for me: https://beam.apache.org/get-started/wordcount-example/

I jumped in with the MinimalWordCount example <https://beam.apache.org/get-started/wordcount-example/#minimalwordcount-example>, which says it's explaining this python code <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount_minimal.py>. However, the explanation doesn't match the code.

The code from the text goes like this:


   |pipeline = beam.Pipeline(options=beam_options)|

   |pipeline | beam.io.ReadFromText(input_file)|

   || 'ExtractWords' >> beam.FlatMap(lambda x:
   re.findall(r'[A-Za-z\']+', x)) |

   || beam.combiners.Count.PerElement() |

   || beam.MapTuple(lambda word, count: '%s: %s' % (word, count)) |

   || beam.io.WriteToText(output_path)|

||

||

||

Fair enough, I'd say. However, the code in the example file is quite different:

   |with beam.Pipeline(options=pipeline_options) as p: # Read the text
   file[pattern] into a PCollection. lines = p |
   ReadFromText(known_args.input) # Count the occurrences of each word.
   counts = ( lines | 'Split' >> ( beam.FlatMap( lambda x:
   re.findall(r'[A-Za-z\']+', x)).with_output_types(str)) |
   'PairWithOne' >> beam.Map(lambda x: (x, 1)) | 'GroupAndSum' >>
   beam.CombinePerKey(sum)) # Format the counts into a PCollection of
   strings. def format_result(word_count): (word, count) = word_count
   return '%s: %s' % (word, count) output = counts | 'Format' >>
   beam.Map(format_result) # Write the output using a "Write" transform
   that has side effects. # pylint: disable=expression-not-assigned
   output | WriteToText(known_args.output)|


I assume this is also probably good? But there are a number of differences here in both structure and content. This confusion is exactly what I don't need in intro documentation, so I'd love it if somebody made this consistent.

Thanks,

William

Reply via email to