Hi Alexey,
I am just learning Beam and doing POC that requires fetching stream data
from PubSub and partitioning it on gs as parquet files with constant
window. The thing is I have additional requirement to use ONLY SQL.
I did not manage to do it. My solutions either worked indefinitely or
failed with `GroupByKey cannot be applied to an unbounded PCollection
with global windowing and a default trigger` despite having window
definition in the exact same CTE. Exactly what I tried You can find
here: https://lists.apache.org/thread/q929lbwp8ylchbn8ngypfqlbvrwpfzph
Here my knowledge of Beam ends. My hypothesis: is that `WriteToParquet`
supports only bounded data and does not automatically partition data. So
in OP could use it by batching the data in memory (into individual
windows) and then applying `WriteToParquet` to each collected batch
individually. But this is more like guess than knowledge. Please let me
know if this is not correct.
On my solution I cannot test it as I am limited only to pure SQL, where
I can only play with a table definition. But did not see any table
parameters that could be responsible for partitioning. If there are
please let me know.
If You remember that It is possible to read or write to partitioned
Parquet files as just `PTransform` that's great! I probably must have
made some minor mistake in my trials. But eager to learn what was the
mistake.
Best regards
Wiśniowski Piotr
I did find solution like this one:
https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/WindowedWordCount.java.
This probably could help OP since OP tried to save to lvl `PCollection`
as `Parquet` instead of saving each partition separately like stated in
`WriteOneFilePer
On 17.02.2023 18:38, Alexey Romanenko wrote:
Piotr,
On 17 Feb 2023, at 09:48, Wiśniowski Piotr
<contact.wisniowskipi...@gmail.com> wrote:
Does this mean that Parquet IO does not support partitioning, and we
need to do some workarounds? Like explicitly mapping each window to a
separate Parquet file?
Could you elaborate a bit more on this? IIRC, we used to read
partitioned Parquet files with ParquetIO while running TPC-DS benchmark.
—
Alexey