Siyao Meng created HADOOP-18564:
-----------------------------------
Summary: Use file-level checksum by default when copying between
two different file systems
Key: HADOOP-18564
URL: https://issues.apache.org/jira/browse/HADOOP-18564
Project: Hadoop Common
Issue Type: Improvement
Reporter: Siyao Meng
h2. Goal
Reduce user friction
h2. Background
When distcp'ing between two different file systems, distcp still uses
block-level checksum by default, even though the two file systems can be very
different in how they manage blocks, so that a block-level checksum no longer
makes sense between these two.
e.g. distcp between HDFS and Ozone without overriding
{{dfs.checksum.combine.mode}} throws IOException because the blocks of the same
file on two FSes are different (as expected):
{code}
$ hadoop distcp -i -pp /test o3fs://buck-test1.vol1.ozone1/
java.lang.Exception: java.io.IOException: File copy failed:
hdfs://duong-1.duong.root.hwx.site:8020/test/test.bin -->
o3fs://buck-test1.vol1.ozone1/test/test.bin
at
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:492)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:552)
Caused by: java.io.IOException: File copy failed:
hdfs://duong-1.duong.root.hwx.site:8020/test/test.bin -->
o3fs://buck-test1.vol1.ozone1/test/test.bin
at
org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:262)
at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:219)
at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:48)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
at
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:271)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Couldn't run retriable-command: Copying
hdfs://duong-1.duong.root.hwx.site:8020/test/test.bin to
o3fs://buck-test1.vol1.ozone1/test/test.bin
at
org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:101)
at
org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:258)
... 11 more
Caused by: java.io.IOException: Checksum mismatch between
hdfs://duong-1.duong.root.hwx.site:8020/test/test.bin and
o3fs://buck-test1.vol1.ozone1/.distcp.tmp.attempt_local1346550241_0001_m_000000_0.Source
and destination filesystems are of different types
Their checksum algorithms may be incompatible You can choose file-level
checksum validation via -Ddfs.checksum.combine.mode=COMPOSITE_CRC when
block-sizes or filesystems are different. Or you can skip checksum-checks
altogether with -skipcrccheck.
{code}
And it works when we use a file-level checksum like {{COMPOSITE_CRC}}:
{code:title=With -Ddfs.checksum.combine.mode=COMPOSITE_CRC}
$ hadoop distcp -i -pp /test o3fs://buck-test2.vol1.ozone1/
-Ddfs.checksum.combine.mode=COMPOSITE_CRC
22/10/18 19:07:42 INFO mapreduce.Job: Job job_local386071499_0001 completed
successfully
22/10/18 19:07:42 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=219900
FILE: Number of bytes written=794129
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=0
HDFS: Number of bytes written=0
HDFS: Number of read operations=13
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
HDFS: Number of bytes read erasure-coded=0
O3FS: Number of bytes read=0
O3FS: Number of bytes written=0
O3FS: Number of read operations=5
O3FS: Number of large read operations=0
O3FS: Number of write operations=0
..
{code}
h2. Alternative
(if changing global defaults could potentially break distcp'ing between
HDFS/S3/etc.)
Don't touch the global default, and make it a client-side config.
e.g. add a config to allow automatically usage of COMPOSITE_CRC
(dfs.checksum.combine.mode) when distcp'ing between HDFS and Ozone, which would
be the equivalent of specifying {{-Ddfs.checksum.combine.mode=COMPOSITE_CRC}}
on the distcp command but the end user won't have to specify it every single
time.
cc [~duongnguyen] [~weichiu]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]