Hello,

I am working on a project where we are integrating Samza and Hive. As part of 
this project, we ran into an issue where sequence files written from Samza were 
taking a long time (hours) to completely sync with HDFS.

After some Googling and digging into the code, it appears that the issue is 
here:
https://github.com/apache/samza/blob/master/samza-hdfs/src/main/scala/org/apache/samza/system/hdfs/writer/SequenceFileHdfsWriter.scala#L111

Writer.stream(dfs.create(path)) implies that the caller of dfs.create(path) is 
responsible for closing the created stream explicitly. This doesn't happen, and 
the SequenceFileHdfsWriter call to close will only flush the stream.

I believe the correct line should be:

Writer.file(path)

Or, SequenceFileHdfsWriter should explicitly track and close the stream.

Thanks!

Ben

Refernece material:
http://stackoverflow.com/questions/27916872/why-the-sequencefile-is-truncated
https://apache.googlesource.com/hadoop-common/+/HADOOP-6685/src/java/org/apache/hadoop/io/SequenceFile.java#1238

    

Reply via email to