Re: Behavior of the cancel command

Jürgen Thomann Sat, 29 Apr 2017 00:16:08 -0700

Hi Aljoscha,

In my case the valid-length file created contains a value which e.g.says 100 MB are valid to read for exactly once but the file with thedata is only 95 MB large. As I understand it the valid-length filecontains the length of the file at the checkpoint. This does also nothappen for all files (3 HDFS sinks each with a parallelism of 2). Forsome parts the file size and the value in the valid-length file matchexactly.

After looking now over the checkpoint code in BucketingSink I lookedinto the hsync behavior again and found the following page:http://stackoverflow.com/questions/32231105/why-is-hsync-not-flushing-my-hdfs-fileAfter this I downloaded the file with the hdfs dfs tool and actually thefile is now even larger than the valid-length file. I checked thisagainst the things I did before (Impala and hive select count query, andHue download of files and wc -l) and this 3 ways result in the sameamount of lines but hdfs dfs -cat <filename> | wc -l gives a much largervalue.

So my conclusion would be that the data is written and not exactly lostas I thought, but for my use case not visible because the files are notproperly closed during cancel and the namenode is not aware of theflushed data. So I could imagine 2 ways out of this: 1. implement thehsync as stated at the Stack Overflow page or 2. ensure that files areproperly closed during cancel.


Best,
Jürgen

On 28.04.2017 17:38, Aljoscha Krettek wrote:

Hi Jürgen,
Is there missing data with respect to what should have been written atthe time of the cancel or when the last checkpoint (or in that case,the savepoint) was performed. I’m asking because the cancel command isonly sent out once the savepoint has been completed, as can be seen at[1]. If the savepoint is complete this also means that the snapshotmethod of the BucketingSink must have done it’s work, i.e. that italso flushed all files, which is done in [2]. There’s always thepossibility of a bug, however, so we’ll have to look into this together.
Best,
Aljoscha
[1]https://github.com/apache/flink/blob/c22efce098c14e8f08bad1e0065dbd02df6e4dbb/flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala#L607-L607
[2]https://github.com/apache/flink/blob/b4c60a942fe07e355dd49ed2aab3c0a7ae94285d/flink-connectors/flink-connector-filesystem/src/main/java/org/apache/flink/streaming/connectors/fs/bucketing/BucketingSink.java
On 27. Apr 2017, at 10:27, Jürgen Thomann<juergen.thom...@innogames.com<mailto:juergen.thom...@innogames.com>> wrote:
Hi,
I had some time ago problems with writing data to Hadoop with theBucketingSink and losing data in case of cancel with savepointbecause flush/sync command was interrupted. I tried changing Hadoopsettings as suggested but had no luck at the end and looked into theFlink code. If I understand the code correctly it behaves thefollowing way:
1. Start a Watchdog thread if we have a cancellation timeout set
2. invoke cancel on the sink/task, but do not wait for it to finish
3. destroy buffer pool and a release resources
4. send initial interrupt to the sink/task
5. call join on the sink/task and ignore InterruptedException
6. let the watchdog send more interrupts if needed and throw fatalerror if timeout is reached
In my case the BucketingSink does not has enough time to flusheverything before the initial interrupt is sent and some files arenot closed properly which causes the missing data in Hadoop in myunderstanding.
Is my understanding correct and if yes, do you know a way to getaround this behavior to let the close function finish the sync forall files?
Best,
Jürgen

Re: Behavior of the cancel command

Reply via email to