Re: Beam Website Feedback: Apache Beam WordCount Examples

Ahmet Altay via dev Tue, 25 Oct 2022 19:31:58 -0700

Thank you for reporting William.

Adding @Valentyn Tymofieiev <valen...@google.com>. Perhaps we could start
with creating an issue?


On Mon, Oct 24, 2022 at 9:25 AM William Pietri <will...@williampietri.com>
wrote:

> Hi! I have inherited code using Apache Beam and I'm trying to figure out
> the right way to use it.
>
> This seemed like it would be a good page for me:
> https://beam.apache.org/get-started/wordcount-example/
>
> I jumped in with the MinimalWordCount example
> <https://beam.apache.org/get-started/wordcount-example/#minimalwordcount-example>,
> which says it's explaining this python code
> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount_minimal.py>.
> However, the explanation doesn't match the code.
>
> The code from the text goes like this:
>
>
> pipeline = beam.Pipeline(options=beam_options)
>
> pipeline| beam.io.ReadFromText(input_file)
>
> | 'ExtractWords' >> beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x))
>
> | beam.combiners.Count.PerElement()
>
> | beam.MapTuple(lambda word, count: '%s: %s' % (word, count))
>
> | beam.io.WriteToText(output_path)
>
> Fair enough, I'd say. However, the code in the example file is quite
> different:
>
>   with beam.Pipeline(options=pipeline_options) as p:
>     # Read the text file[pattern] into a PCollection.
>     lines = p | ReadFromText(known_args.input)
>
>     # Count the occurrences of each word.
>     counts = (
>         lines
>         | 'Split' >> (
>             beam.FlatMap(
>                 lambda x: re.findall(r'[A-Za-z\']+', 
> x)).with_output_types(str))
>         | 'PairWithOne' >> beam.Map(lambda x: (x, 1))
>         | 'GroupAndSum' >> beam.CombinePerKey(sum))
>
>     # Format the counts into a PCollection of strings.
>     def format_result(word_count):
>       (word, count) = word_count
>       return '%s: %s' % (word, count)
>
>     output = counts | 'Format' >> beam.Map(format_result)
>
>     # Write the output using a "Write" transform that has side effects.
>     # pylint: disable=expression-not-assigned
>     output | WriteToText(known_args.output)
>
>
> I assume this is also probably good? But there are a number of differences
> here in both structure and content. This confusion is exactly what I don't
> need in intro documentation, so I'd love it if somebody made this
> consistent.
>
> Thanks,
>
> William
>

Re: Beam Website Feedback: Apache Beam WordCount Examples

Reply via email to