PyFlink and parallelism

John Tipper Fri, 15 Jul 2022 08:44:49 -0700

Hi all,

I have a processing topology using PyFlink and SQL where there is data skew: 
I'm splitting a stream of heterogenous data into separate streams based on the 
type of data that's in it and some of these substreams have very many more 
events than others and this is causing issues when checkpointing (checkpoints 
are timing out). I'd like to increase parallelism for these problematic 
streams, I'm just not sure how I do that and target just those elements. Do I 
need to use the datastream API here? What does this look like please?



I have a table defined and I duplicate a stream from that table, then filter so 
that my substream has only the events I'm interested in:


events_table = table_env.from_path(MY_SOURCE_TABLE)

filtered_table = events_table.filter(
    col("event_type") == "event_of_interest"
)

table_env.create_temporary_view(MY_FILTERED_VIEW, filtered_table)

# now execute SQL on MY_FILTERED_VIEW
table_env.execute_sql(...)


The default parallelism of the overall table env is 1. Is there a way to 
increase the parallelism for just this stream?

Many thanks,

John

PyFlink and parallelism

Reply via email to