Hi! I have inherited code using Apache Beam and I'm trying to figure out
the right way to use it.
This seemed like it would be a good page for me:
https://beam.apache.org/get-started/wordcount-example/
I jumped in with the MinimalWordCount example
<https://beam.apache.org/get-started/wordcount-example/#minimalwordcount-example>,
which says it's explaining this python code
<https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount_minimal.py>.
However, the explanation doesn't match the code.
The code from the text goes like this:
|pipeline = beam.Pipeline(options=beam_options)|
|pipeline | beam.io.ReadFromText(input_file)|
|| 'ExtractWords' >> beam.FlatMap(lambda x:
re.findall(r'[A-Za-z\']+', x)) |
|| beam.combiners.Count.PerElement() |
|| beam.MapTuple(lambda word, count: '%s: %s' % (word, count)) |
|| beam.io.WriteToText(output_path)|
||
||
||
Fair enough, I'd say. However, the code in the example file is quite
different:
|with beam.Pipeline(options=pipeline_options) as p: # Read the text
file[pattern] into a PCollection. lines = p |
ReadFromText(known_args.input) # Count the occurrences of each word.
counts = ( lines | 'Split' >> ( beam.FlatMap( lambda x:
re.findall(r'[A-Za-z\']+', x)).with_output_types(str)) |
'PairWithOne' >> beam.Map(lambda x: (x, 1)) | 'GroupAndSum' >>
beam.CombinePerKey(sum)) # Format the counts into a PCollection of
strings. def format_result(word_count): (word, count) = word_count
return '%s: %s' % (word, count) output = counts | 'Format' >>
beam.Map(format_result) # Write the output using a "Write" transform
that has side effects. # pylint: disable=expression-not-assigned
output | WriteToText(known_args.output)|
I assume this is also probably good? But there are a number of
differences here in both structure and content. This confusion is
exactly what I don't need in intro documentation, so I'd love it if
somebody made this consistent.
Thanks,
William