Thank you! On Mon, Feb 13, 2023 at 5:55 AM Dian Fu <dian0511...@gmail.com> wrote:
> Thanks Andrew, I think this is a valid advice. I will update the > documentation~ > > Regards, > Dian > > , > > On Fri, Feb 10, 2023 at 10:08 PM Andrew Otto <o...@wikimedia.org> wrote: > >> Question about side outputs and OutputTags in pyflink. The docs >> <https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/side_output/> >> say we are supposed to >> >> yield output_tag, value >> >> Docs then say: >> > For retrieving the side output stream you use getSideOutput(OutputTag) on >> the result of the DataStream operation. >> >> From this, I'd expect that calling datastream.get_side_output would be >> optional. However, it seems that if you do not call >> datastream.get_side_output, then the main datastream will have the record >> destined to the output tag still in it, as a Tuple(output_tag, value). >> This caused me great confusion for a while, as my downstream tasks would >> break because of the unexpected Tuple type of the record. >> >> Here's an example of the failure using side output and ProcessFunction >> in the word count example. >> <https://gist.github.com/ottomata/001df5df72eb1224c01c9827399fcbd7#file-pyflink_sideout_fail_word_count_example-py-L86-L100> >> >> I'd expect that just yielding an output_tag would make those records be >> in a different datastream, but apparently this is not the case unless you >> call get_side_output. >> >> If this is the expected behavior, perhaps the docs should be updated to >> say so? >> >> -Andrew Otto >> Wikimedia Foundation >> >> >> >> >>