Hello, I have a simple streaming app that get data from a source and store it to HDFS using a sink similar to the bucketing file sink. Checkpointing mode is “exactly once”. Everything is fine on a “normal” course as the sink is faster than the source; but when we stop the application for a while and then restart it, we have a catch-up burst to get all the messages emitted in the meanwhile. During this burst, the source is faster than the sink, and all checkpoints fail (time out) until the source has been totally caught up. This is annoying because the sink does not “commit” the data before a successful checkpoint is made, and so the app release all the “catch up” data as a atomic block that can be huge if the streaming app was stopped for a while, adding an unwanted stress to all the following hive treatments that use the data provided in micro batches and to the Hadoop cluster.
How should I handle the situation? Is there something special to do to get checkpoints even during heavy load? The problem does not seem to be new, but I was unable to find any practical solution in the documentation. Best regards, Arnaud ________________________________ L'intégrité de ce message n'étant pas assurée sur internet, la société expéditrice ne peut être tenue responsable de son contenu ni de ses pièces jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous n'êtes pas destinataire de ce message, merci de le détruire et d'avertir l'expéditeur. The integrity of this message cannot be guaranteed on the Internet. The company that sent this message cannot therefore be held liable for its content nor attachments. Any unauthorized use or dissemination is prohibited. If you are not the intended recipient of this message, then please delete it and notify the sender.