Hi, We have a Flink job that reads data from an input stream, then converts each event from JSON string Avro object, finally writes to parquet files using StreamingFileSink with OnCheckPointRollingPolicy of 5 mins. Basically a stateless job. Initially, we use one map operator to convert Json string to Avro object, Inside the map function, it goes form String -> JsonObject -> Avro object.
DataStream<AvroSchema> avroData = data.map(new JsonToAVRO()); When we try to break the map operator to two, one for String to JsonObject, another for JsonObject to Avro. DataStream<JsonObject> JsonData = data.map(new StringToJson()); DataStream<AvroSchema> avroData = rawDataAsJson.map(new JsonToAvroSchema()) The benchmark shows significant performance hit when breaking down to two operators. We try to understand the Flink internal on why such a big difference. The setup is using state backend = filesystem. Checkpoint = s3 bucket. Our event object has 300+ attributes. Thanks Ivan