Have you tried with org.apache.hadoop.util.DataChecksum and
org.apache.hadoop.util.PureJavaCrc32 ?
- Milind
On Jan 5, 2011, at 3:42 PM, Da Zheng wrote:
> I'm not sure of that. I wrote a small checksum program for testing. After the
> size of a block gets to larger than 8192 bytes, I don't see much performance
> improvement. See the code below. I don't think 64MB can bring us any benefit.
> I did change io.bytes.per.checksum to 131072 in hadoop, and the program ran
> about 4 or 5 minutes faster (the total time for reducing is about 35 minutes).
>
> import java.util.zip.CRC32;
> import java.util.zip.Checksum;
>
>
> public class Test1 {
> public static void main(String args[]) {
> Checksum sum = new CRC32();
> byte[] bs = new byte[512];
> final int tot_size = 64 * 1024 * 1024;
> long time = System.nanoTime();
> for (int k = 0; k < tot_size / bs.length; k++) {
> for (int i = 0; i < bs.length; i++)
> bs[i] = (byte) i;
> sum.update(bs, 0, bs.length);
> }
> System.out.println("takes " + (System.nanoTime() - time) / 1000 /
> 1000);
> }
> }
>
>
> On 01/05/2011 05:03 PM, Milind Bhandarkar wrote:
>> I agree with Jay B. Checksumming is usually the culprit for high CPU on
>> clients and datanodes. Plus, a checksum of 4 bytes for every 512, means for
>> 64MB block, the checksum will be 512KB, i.e. 128 ext3 blocks. Changing it to
>> generate 1 ext3 checksum block per DFS block will speedup read/write without
>> any loss of reliability.
>>
>> - milind
>>
>> ---
>> Milind Bhandarkar
>> ([email protected])
>> (650-776-3236)
>>
>>
>>
>>
>>
>>
>
---
Milind Bhandarkar
([email protected])
(650-776-3236)