Re: Backup event from Kafka to S3 in parquet format every minute

Wiśniowski Piotr Fri, 17 Feb 2023 00:49:12 -0800

Hi,

Sounds like exact problem that I have few emails before -https://lists.apache.org/thread/q929lbwp8ylchbn8ngypfqlbvrwpfzph

Does this mean that Parquet IO does not support partitioning, and weneed to do some workarounds? Like explicitly mapping each window to aseparate Parquet file? This could be a solution in Your case, if itworks (just idea worth trying but did not test it and do not have enoughexperience with Beam), but I am limited only to pure SQL and not surehow I can do it.

Hope This helps with Your problem and Beam support could find somesolution to my case too.


Best

Wiśniowski Piotr

On 17.02.2023 02:00, Lydian wrote:

I want to make a simple Beam pipeline which will store the events fromkafka to S3 in parquet format every minute.
Here's a simplified version of my pipeline:
|def add_timestamp(event: Any) -> Any: from datetime import datetimefrom apache_beam import window return window.TimestampedValue(event,datetime.timestamp(event[1].timestamp)) # Actual Pipeline ( pipeline |"Read from Kafka" >> ReadFromKafka(consumer_config, topics,with_metadata=False) | "Transformed" >> beam.Map(my_transform) | "Addtimestamp" >> beam.Map(add_timestamp) | "window" >>beam.WindowInto(window.FixedWindows(60)) # 1 mins | "writing toparquet" >> beam.io.WriteToParquet('s3://test-bucket/', pyarrow_schema) ) |
However, the pipeline failed with
|GroupByKey cannot be applied to an unbounded PCollection with globalwindowing and a default trigger |
This seems to be coming fromhttps://github.com/apache/beam/blob/v2.41.0/sdks/python/apache_beam/io/iobase.py#L1145-L1146 whichalways add a |GlobalWindows| and thus causing this error. Wonderingwhat I should do to correctly backup the event from Kafka (Unbounded)to S3. Thanks!
btw, I am running with |portableRunner| with Flink. Beam Version is2.41.0 (the latest version seems to have the same code) and Flinkversion is 1.14.5
Sincerely,
Lydian Lee

Re: Backup event from Kafka to S3 in parquet format every minute

Reply via email to