[ https://issues.apache.org/jira/browse/HADOOP-17975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Stephen O'Donnell resolved HADOOP-17975. ---------------------------------------- Resolution: Fixed > Fallback to simple auth does not work for a secondary DistributedFileSystem > instance > ------------------------------------------------------------------------------------ > > Key: HADOOP-17975 > URL: https://issues.apache.org/jira/browse/HADOOP-17975 > Project: Hadoop Common > Issue Type: Bug > Components: ipc > Reporter: István Fajth > Assignee: István Fajth > Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.2.4, 3.3.3 > > Time Spent: 10h 10m > Remaining Estimate: 0h > > The following code snippet demonstrates what is necessary to cause a failure > in connection to a non secure cluster with fallback to SIMPLE auth allowed > from a secure cluster. > {code:java} > Configuration conf = new Configuration(); > conf.setBoolean("ipc.client.fallback-to-simple-auth-allowed", true); > URI fsUri = new URI("hdfs://<nn_uri>"); > conf.setBoolean("fs.hdfs.impl.disable.cache", true); > FileSystem fs = FileSystem.get(fsUri, conf); > FSDataInputStream src = fs.open(new Path("/path/to/a/file")); > FileOutputStream dst = new FileOutputStream(File.createTempFile("foo", > "bar")); > IOUtils.copyBytes(src, dst, 1024); > // The issue happens even if we re-enable cache at this point > //conf.setBoolean("fs.hdfs.impl.disable.cache", false); > // The issue does not happen when we close the first FileSystem object > // before creating the second. > //fs.close(); > FileSystem fs2 = FileSystem.get(fsUri, conf); > FSDataInputStream src2 = fs2.open(new Path("/path/to/a/file")); > FileOutputStream dst2 = new FileOutputStream(File.createTempFile("foo", > "bar")); > IOUtils.copyBytes(src2, dst2, 1024); > {code} > The problem is that when the DfsClient is created it creates an instance of > AtomicBoolean, which is propagated down into the IPC layer, where the > Client.Connection instance in setupIOStreams sets its value. This connection > object is cached and re-used to multiplex requests against the same DataNode. > In case of creating a second DfsClient, the AtomicBoolean reference in the > client is a new AtomicBoolean, but the Client.Connection instance is the > same, and as it has a socket already open to the DataNode, it returns > immediatelly from setupIOStreams, leaving the fallbackToSimpleAuth > AtomicBoolean false as it is created in the DfsClient. > This AtomicBoolean on the other hand controls how the SaslDataTransferClient > handles the connection in the above level, and with this value left on the > default false, the SaslDataTransferClient of the second DfsClient will not > fall back to SIMPLE authentication but will try to send a SASL handshake when > connecting to the DataNode. > > The access to the FileSystem via the second DfsClient fails with exceptions > like the following one, then fails the read with a BlockMissingException like > below: > {code} > WARN hdfs.DFSClient: Failed to connect to /<dn_ip>:<dn_port> for file <file> > for block BP-531773307-<nn_ip>-1634685133591:blk_1073741826_1002, add to > deadNodes and continue. > java.io.EOFException: Unexpected EOF while trying to read response from server > at > org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:552) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessage(DataTransferSaslUtil.java:215) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:455) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:393) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:267) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:215) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160) > at > org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:648) > at > org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2980) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:822) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:747) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:380) > at > org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:658) > at > org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:589) > at > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:771) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:840) > at java.io.DataInputStream.read(DataInputStream.java:100) > at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:94) > at DfsClientTest3.main(DfsClientTest3.java:30) > {code} > {code} > org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: > BP-813026743-<nn_ip>-1495248833293:blk_1139767762_66027405 file=/path/to/file > {code} > > The DataNode in the meantime logs the following: > {code} > ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: > <dn_host>:<dn_port>:DataXceiver error processing unknown operation src: > /<client_ip>:<client_port> dst: /<dn_ip>:<dn_port> > java.io.IOException: Version Mismatch (Expected: 28, Received: -8531 ) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.readOp(Receiver.java:70) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:222) > at java.lang.Thread.run(Thread.java:748) > {code} > This happens only if the second client is connecting to the same DataNode as > the first one did, so might seem intermittent in case the clients are reading > different files, but happens always if the two client reads the same file > with replication factor 1. > We ran into this issue during running HBase ExportSnapshot tool to move a > snapshot from a non-secure to a secure cluster, the issue is loosely related > to HBASE-12819 and HBASE-20433 and similar problems, I am linking these so > that HBase team will see how this is relevant for them. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org