gaozhan ding created HDFS-15732: ----------------------------------- Summary: EC client will not retry get block token when block token expired in kerberized cluster Key: HDFS-15732 URL: https://issues.apache.org/jira/browse/HDFS-15732 Project: Hadoop HDFS Issue Type: Bug Components: dfsclient, ec, erasure-coding Affects Versions: 3.1.1 Environment: hadoop 3.1.1
kerberos ec RS-3-2-1024k Reporter: gaozhan ding When enable ec policy on hbase, we got some issues. Once block token was expired in datanode side, client side will not identify the InvalidToken error because of the SASL negotiation. As a result, ec client will not do retry by refetch token when create blockreader. Then the peer datanode was added to DeadNodes, and all calls to function createBlockReader aim at this datanode in current DFSStripedInputStream will consider this datanode was dead and return false. The finally result is a read failure. Some logs : hbase regionserver: 2020-12-17 10:00:24,291 WARN [RpcServer.default.FPBQ.Fifo.handler=15,queue=0,port=16020] hdfs.DFSClient: Failed to connect to /10.65.19.41:9866 for blockBP-1601568648-10.65.19.12-1550823043026:blk_-9223372036813273566_672859566 java.io.IOException: DIGEST-MD5: IO error acquiring password at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:421) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:479) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:393) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:267) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:215) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160) at org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:647) at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2936) at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:821) at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:746) at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:379) at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:647) at org.apache.hadoop.hdfs.DFSStripedInputStream.createBlockReader(DFSStripedInputStream.java:272) at org.apache.hadoop.hdfs.StripeReader.readChunk(StripeReader.java:333) at org.apache.hadoop.hdfs.StripeReader.readStripe(StripeReader.java:365) at org.apache.hadoop.hdfs.DFSStripedInputStream.fetchBlockByteRange(DFSStripedInputStream.java:514) at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1354) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1318) at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:92) at org.apache.hadoop.hbase.io.hfile.HFileBlock.positionalReadWithExtra(HFileBlock.java:808) at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readAtOffset(HFileBlock.java:1568) at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1772) at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1597) at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1496) at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$CellBasedKeyBlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:340) at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:856) at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:806) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:327) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:228) at org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:395) at org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:250) at org.apache.hadoop.hbase.regionserver.HStore.createScanner(HStore.java:2031) at org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:2022) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.initializeScanners(HRegion.java:6408) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:6388) at org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:2926) datanode log: 2020-12-17 10:00:24,290 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hadoop-btdn0005.eniot.io:9866:DataXceiver error processing unknown operation src: /10.65.19.42:53894 dst: /10.65.19.41:9866 javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Block token with block_token_identifier (expiryDate=1608192045256, keyId=1449537289, userId=hbase, blockPoolId=BP-1601568648-10.65.19.12-1550823043026, blockId=-9223372036813273566, access modes=[READ], storageTypes= [DISK, DISK, DISK, DISK, DISK], storageIds= [DS-604a9eaf-94ba-4127-b70d-436361c49ecd, DS-b53bf503-84c1-4e9e-aede-a19f88e3fc9a, DS-a1ee0117-7430-4181-b279-3f1b16541567, DS-e8459904-e1e1-44a6-affa-9ce9fc56f877, DS-9f54936c-96cc-4c85-91e6-a6f8e00f246a]) is expired.] at com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598) at com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Block token with block_token_identifier (expiryDate=1608192045256, keyId=1449537289, userId=hbase, blockPoolId=BP-1601568648-10.65.19.12-1550823043026, blockId=-9223372036813273566, access modes=[READ], storageTypes= [DISK, DISK, DISK, DISK, DISK], storageIds= [DS-604a9eaf-94ba-4127-b70d-436361c49ecd, DS-b53bf503-84c1-4e9e-aede-a19f88e3fc9a, DS-a1ee0117-7430-4181-b279-3f1b16541567, DS-e8459904-e1e1-44a6-affa-9ce9fc56f877, DS-9f54936c-96cc-4c85-91e6-a6f8e00f246a]) is expired. at org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager.retrievePassword(BlockTokenSecretManager.java:479) at org.apache.hadoop.hdfs.security.token.block.BlockPoolTokenSecretManager.retrievePassword(BlockPoolTokenSecretManager.java:81) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.buildServerPassword(SaslDataTransferServer.java:318) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$100(SaslDataTransferServer.java:73) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$2.apply(SaslDataTransferServer.java:297) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$SaslServerCallbackHandler.handle(SaslDataTransferServer.java:241) at com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589) ... 7 more final result java.io.IOException: Could not iterate StoreFileScanner[HFileScanner for reader reader=hdfs://azbeta/hbase/data/o15427722038191/RAW_569dfc35-10dc-4fe4-bb8c-5c403368d6c1/ccf146fb5c852d751b6b2f8307f81fa0/d/386bdb6978d348f4b4e168e728762842, compression=snappy, cacheConf=blockCache=LruBlockCache\{blockCount=7915, currentSize=489.13 MB, freeSize=2.71 GB, maxSize=3.19 GB, heapSize=489.13 MB, minSize=3.03 GB, minFactor=0.95, multiSize=1.52 GB, multiFactor=0.5, singleSize=775.81 MB, singleFactor=0.25}, cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false, firstKey=Optional[\x01\x00\x00\x13\xBD\x00\x00\x00T\x070\x97\xF7/d:arm_xingneng_test_model@AI1/1603445985858/Put/seqid=0], lastKey=Optional[\x01\x00\x00\x16\xF9\x00\x00\x00T\x073-8/d:arm_xingneng_test_model@AI99/1603446144699/Put/seqid=0], avgKeyLen=53, avgValueLen=4, entries=2016557, length=22957603, cur=\x01\x00\x00\x14#\x00\x00\x00T\x072\xCC\xF7/d:arm_xingneng_test_model@AI84/1603446131146/Put/vlen=4/seqid=17230816] at org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:217) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:120) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:653) at org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:388) at org.apache.hadoop.hbase.regionserver.compactions.Compactor.compact(Compactor.java:327) at org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:65) at org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:126) at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1410) at org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:2187) at org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.doCompaction(CompactSplit.java:596) at org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.run(CompactSplit.java:638) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: 3 missing blocks, the stripe is: AlignedStripe(Offset=1048576, length=1048576, fetchedChunksNum=0, missingChunksNum=3); locatedBlocks is: LocatedBlocks\{; fileLength=22957603; underConstruction=false; blocks=[LocatedStripedBlock{BP-1197414916-10.27.20.30-1535978156945:blk_-9223372036830849056_317784750; getBlockSize()=22957603; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[10.27.20.42:9866,DS-d0d5a7ce-8280-45fe-b910-1e5c7b579367,DISK], DatanodeInfoWithStorage[10.27.22.86:9866,DS-b018c729-66d5-4953-94f1-ec7664a46cb7,DISK], DatanodeInfoWithStorage[10.27.22.79:9866,DS-d79c402a-1845-4b3f-893e-f84d94085b2a,DISK], DatanodeInfoWithStorage[10.27.20.41:9866,DS-b07ba1e0-9a34-4caf-ad51-6d5f302c08ea,DISK], DatanodeInfoWithStorage[10.27.20.39:9866,DS-8a059d87-f122-4fed-a6a2-e45662692305,DISK]]; indices=[0, 1, 2, 3, 4]}]; lastLocatedBlock=LocatedStripedBlock\{BP-1197414916-10.27.20.30-1535978156945:blk_-9223372036830849056_317784750; getBlockSize()=22957603; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[10.27.20.42:9866,DS-d0d5a7ce-8280-45fe-b910-1e5c7b579367,DISK], DatanodeInfoWithStorage[10.27.22.86:9866,DS-b018c729-66d5-4953-94f1-ec7664a46cb7,DISK], DatanodeInfoWithStorage[10.27.22.79:9866,DS-d79c402a-1845-4b3f-893e-f84d94085b2a,DISK], DatanodeInfoWithStorage[10.27.20.41:9866,DS-b07ba1e0-9a34-4caf-ad51-6d5f302c08ea,DISK], DatanodeInfoWithStorage[10.27.20.39:9866,DS-8a059d87-f122-4fed-a6a2-e45662692305,DISK]]; indices=[0, 1, 2, 3, 4]}; isLastBlockComplete=true; ecPolicy=ErasureCodingPolicy=[Name=RS-3-2-1024k, Schema=[ECSchema=[Codec=rs, numDataUnits=3, numParityUnits=2]], CellSize=1048576, Id=2]} at org.apache.hadoop.hdfs.StripeReader.checkMissingBlocks(StripeReader.java:177) at org.apache.hadoop.hdfs.StripeReader.readStripe(StripeReader.java:372) at org.apache.hadoop.hdfs.DFSStripedInputStream.readOneStripe(DFSStripedInputStream.java:324) at org.apache.hadoop.hdfs.DFSStripedInputStream.readWithStrategy(DFSStripedInputStream.java:397) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:820) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.hbase.io.hfile.HFileBlock.readWithExtra(HFileBlock.java:765) at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readAtOffset(HFileBlock.java:1562) at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1772) at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1597) at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1496) at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.readNextDataBlock(HFileReaderImpl.java:931) at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.isNextBlock(HFileReaderImpl.java:1064) at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.positionForNextBlock(HFileReaderImpl.java:1058) at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl._next(HFileReaderImpl.java:1076) at org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.next(HFileReaderImpl.java:1097) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:208) ... 13 more -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org