This may have been discussed in the past, but I haven't been able to find
one...

It seems as though much work has been done to make distcp from 1.0 to 2.0
work with checksum enabled (
https://issues.apache.org/jira/browse/HADOOP-8060). And I do see all the
work has been merged to the 2.0 releases. However, it seems that distcp
from 1.0 to 2.0 still doesn't work if the CRC check is enabled. Is that a
correct understanding?

I took a quick look at the distcp code (mostly around CopyMapper and
RetriableFileCopyCommand), and I don't see how the source checksum type is
passed into creating the file with DFSClient. And also it doesn't look like
dfs.checksum.type is being set upon discovering the source checksum type
(which would have been another mechanism). And this is consistent with my
testing. And I can also confirm that it works if I pass in command line
option "-Ddfs.checksum.type=CRC32".

Is this understanding accurate? If so, is there a reason this was not done
in distcp? Curious...

Thanks,
Sangjin

Reply via email to