Re: Backup event from Kafka to S3 in parquet format every minute

Wiśniowski Piotr Mon, 20 Feb 2023 13:07:32 -0800

Hi Alexey,

I am just learning Beam and doing POC that requires fetching stream datafrom PubSub and partitioning it on gs as parquet files with constantwindow. The thing is I have additional requirement to use ONLY SQL.

I did not manage to do it. My solutions either worked indefinitely orfailed with `GroupByKey cannot be applied to an unbounded PCollectionwith global windowing and a default trigger` despite having windowdefinition in the exact same CTE. Exactly what I tried You can findhere: https://lists.apache.org/thread/q929lbwp8ylchbn8ngypfqlbvrwpfzph

Here my knowledge of Beam ends. My hypothesis: is that `WriteToParquet`supports only bounded data and does not automatically partition data. Soin OP could use it by batching the data in memory (into individualwindows) and then applying `WriteToParquet` to each collected batchindividually. But this is more like guess than knowledge. Please let meknow if this is not correct.

On my solution I cannot test it as I am limited only to pure SQL, whereI can only play with a table definition. But did not see any tableparameters that could be responsible for partitioning. If there areplease let me know.

If You remember that It is possible to read or write to partitionedParquet files as just `PTransform` that's great! I probably must havemade some minor mistake in my trials. But eager to learn what was themistake.


Best regards

Wiśniowski Piotr

I did find solution like this one:https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/WindowedWordCount.java.This probably could help OP since OP tried to save to lvl `PCollection`as `Parquet` instead of saving each partition separately like stated in`WriteOneFilePer


On 17.02.2023 18:38, Alexey Romanenko wrote:

Piotr,
On 17 Feb 2023, at 09:48, Wiśniowski Piotr<contact.wisniowskipi...@gmail.com> wrote:
Does this mean that Parquet IO does not support partitioning, and weneed to do some workarounds? Like explicitly mapping each window to aseparate Parquet file?
Could you elaborate a bit more on this? IIRC, we used to readpartitioned Parquet files with ParquetIO while running TPC-DS benchmark.
—
Alexey

Re: Backup event from Kafka to S3 in parquet format every minute

Reply via email to