Parallelize Kafka Deserialization of a single partition?

Theo Diefenthal Mon, 17 Feb 2020 04:15:08 -0800

Hi, 

As for most pipelines, our flink pipeline start with parsing source kafka 
events into POJOs. We perform this step within a KafkaDeserizationSchema so 
that we properly extract the event itme timestamp for the downstream 
Timestamp-Assigner.


Now it turned out that parsing is currently the most CPU intensive task in our 
pipeline and thus CPU bounds the number of elements we can ingest per second. 
Further splitting up the partitions will be hard as we need to maintain the 
exact order of events per partition and would also required quite some 
architectural changes for producers and the flink job. 

Now I had the idea to put the parsing task into ordered Async-IO. But AsyncIO 
can only be plugged in into an existing Stream, not into the deserialization 
schema, as far as I see. So the best idea I currently have is to keep parsing 
in the DeserializationSchema as minimal as possible to extract the Event 
timestamp and do the full parsing downstream in Async IO. This however, seems 
to be a bit tedious, especially as we have to deal with multiple input formats 
and would need to develop two parsers for the heavy load once: a timestamp only 
and a full parser. 

Do you know if it is somehow possible to parallelize / async IO the parsing 
within the KafkaDeserializationSchema? I don't have state access in there and I 
don't have a "collector" object in there so that one element as input needs to 
produce exactly one output element. 

Another question: My parsing produces Java POJO objects via "new", which are 
sent downstream (reusePOJO setting set) and finally will be garbage collected 
once they reached the sink. Is there some mechanism in Flink so that I could 
reuse "old" sinked POJOs in the source? All tasks are chained so that 
theoretically, that could be possible? 

Best regards 
Theo

Parallelize Kafka Deserialization of a single partition?

Reply via email to