[jira] [Created] (HDFS-15485) Fix outdated properties of JournalNode when performing rollback
Deegue created HDFS-15485: - Summary: Fix outdated properties of JournalNode when performing rollback Key: HDFS-15485 URL: https://issues.apache.org/jira/browse/HDFS-15485 Project: Hadoop HDFS Issue Type: Bug Reporter: Deegue When rollback HDFS cluster, properties in JNStorage won't be refreshed after the storage dir changed. It leads to exceptions when starting namenode. The exception like: {code:java} 2020-07-09 19:04:12,810 FATAL [IPC Server handler 105 on 8022] org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [10.0.118.217:8485, 10.0.117.208:8485, 10.0.118.179:8485], stream=null)) org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown: 10.0.118.217:8485: Incompatible namespaceID for journal Storage Directory /mnt/vdc-11176G-0/dfs/jn/nameservicetest1: NameNode has nsId 647617129 but storage has nsId 0 at org.apache.hadoop.hdfs.qjournal.server.JNStorage.checkConsistentNamespace(JNStorage.java:236) at org.apache.hadoop.hdfs.qjournal.server.Journal.newEpoch(Journal.java:300) at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.newEpoch(JournalNodeRpcServer.java:136) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.newEpoch(QJournalProtocolServerSideTranslatorPB.java:133) at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25417) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2278) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2274) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2274) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-15486) Costly sendResponse operation slows down async editlog handling
Yiqun Lin created HDFS-15486: Summary: Costly sendResponse operation slows down async editlog handling Key: HDFS-15486 URL: https://issues.apache.org/jira/browse/HDFS-15486 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.7.0 Reporter: Yiqun Lin Attachments: Async-profile-(2).jpg, async-profile-(1).jpg When our cluster NameNode in a very high load, we find it often stuck in Async-editlog handling. We use async-profile tool to get the flamegraph. !Async-profile-(2).jpg! This happened in that async editlog thread consumes Edit from the queue and triggers the sendResponse call. But here the sendResponse call is a little expensive since our cluster enabled the security env and will do some encode operations when doing the return response operation. We often catch some moments of costly sendResponse operation when rpc call queue is fulled. !async-profile-(1).jpg! Slowness on consuming Edit in async editlog will make Edit pending Queue in the fulled state, then block its enqueue operation that is invoked in writeLock type methods in FSNamesystem class. Here the enhancement is that we can use multiple thread to parallel execute sendResponse call. sendResponse doesn't need use the write lock to do protection, so this change is safe. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-15487) ScriptBasedMapping lead 100% cpu utilization
Ryan Wu created HDFS-15487: -- Summary: ScriptBasedMapping lead 100% cpu utilization Key: HDFS-15487 URL: https://issues.apache.org/jira/browse/HDFS-15487 Project: Hadoop HDFS Issue Type: Improvement Reporter: Ryan Wu We found that sometimes NameNode cpu utilization rate of 90% leading to NameNode hang up. The reason is that flink apps on k8s access HDFS at the same time, however their ip and host name is not fixed. So that run topology script at the same time. From jstack file, also found it started several hundreds python processes. {code:java} // "process reaper" #36159 daemon prio=10 os_prio=0 tid=0x7fa7a33fa7a0 nid=0xa3cb waiting on condition [0x7fa7a61dc000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x7fb4094a0398> (a java.util.concurrent.SynchronousQueue$TransferStack) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460) at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362) at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
Re: [VOTE] Release Apache Hadoop 3.1.4 (RC3)
Thank you all for the suggestions and testing. As there's a data loss issue in the release, I've created a new RC with the patch included. I'll send the update soon. Regards, Gabor Bota On Thu, Jul 16, 2020 at 1:29 PM Stephen O'Donnell wrote: > > Hi Gabor, > > We recently discovered a HDFS data loss issue in any build which uses > snapshots containing HDFS-13101 but not including HDFS-15313. Unfortunately > 3.1.4 falls into this category: > > git log origin/branch-3.1.4 | egrep "HDFS-(15313|13101)" > HDFS-15012. NN fails to parse Edit logs after applying HDFS-13101. > Contributed by Shashikant Banerjee. > HDFS-13101. Yet another fsimage corruption related to snapshot. > Contributed by Shashikant Banerjee. > > See this comment for more information on the bug: > > https://issues.apache.org/jira/browse/HDFS-15313?focusedCommentId=17158140&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17158140 > > I think we should not make a release when we have a known data loss bug in > it. What do you think? > > I am going to commit HDFS-15313 onto branch-3.1 shortly, so maybe we should > cut a new RC after including that? > > Thanks, > > Stephen. > > On Mon, Jul 13, 2020 at 11:36 AM Gabor Bota > wrote: > > > Hi folks, > > > > I have put together a release candidate (RC3) for Hadoop 3.1.4. > > > > * > > The RC includes in addition to the previous ones: > > * fix of YARN-10347. Fix double locking in > > CapacityScheduler#reinitialize in branch-3.1 > > (https://issues.apache.org/jira/browse/YARN-10347) > > * the revert of HDFS-14941, as it caused > > HDFS-15421. IBR leak causes standby NN to be stuck in safe mode. > > (https://issues.apache.org/jira/browse/HDFS-15421) > > * HDFS-15323, as requested. > > (https://issues.apache.org/jira/browse/HDFS-15323) > > * > > > > The RC is available at: http://people.apache.org/~gabota/hadoop-3.1.4-RC3/ > > The RC tag in git is here: > > https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC3 > > The maven artifacts are staged at > > https://repository.apache.org/content/repositories/orgapachehadoop-1274/ > > > > You can find my public key at: > > https://dist.apache.org/repos/dist/release/hadoop/common/KEYS > > and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C > > > > Please try the release and vote. The vote will run for 7 weekdays, > > until July 22. 2020. 23:00 CET. > > > > > > Thanks, > > Gabor > > > > - > > To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org > > For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org > > > > - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[VOTE] Release Apache Hadoop 3.1.4 (RC4)
Hi folks, I have put together a release candidate (RC4) for Hadoop 3.1.4. * The RC includes in addition to the previous ones: * fix for HDFS-15313. Ensure inodes in active filesystem are not deleted during snapshot delete * fix for YARN-10347. Fix double locking in CapacityScheduler#reinitialize in branch-3.1 (https://issues.apache.org/jira/browse/YARN-10347) * the revert of HDFS-14941, as it caused HDFS-15421. IBR leak causes standby NN to be stuck in safe mode. (https://issues.apache.org/jira/browse/HDFS-15421) * HDFS-15323, as requested. (https://issues.apache.org/jira/browse/HDFS-15323) * The RC is available at: http://people.apache.org/~gabota/hadoop-3.1.4-RC4/ The RC tag in git is here: https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC4 The maven artifacts are staged at https://repository.apache.org/content/repositories/orgapachehadoop-1275/ You can find my public key at: https://dist.apache.org/repos/dist/release/hadoop/common/KEYS and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C Please try the release and vote. The vote will run for 8 weekdays, until July 31. 2020. 23:00 CET. Thanks, Gabor - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
Apache Hadoop qbt Report: branch2.10+JDK7 on Linux/x86
For more details, see https://builds.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86/754/ [Jul 20, 2020 8:39:00 PM] (vagarychen) HADOOP-16753. Refactor HAAdmin. Contributed by Xieming Li. [Jul 20, 2020 10:26:44 PM] (vagarychen) HDFS-15404. ShellCommandFencer should expose info about source. -1 overall The following subsystems voted -1: asflicense findbugs hadolint jshint pathlen unit xml The following subsystems voted -1 but were configured to be filtered/ignored: cc checkstyle javac javadoc pylint shellcheck shelldocs whitespace The following subsystems are considered long running: (runtime bigger than 1h 0m 0s) unit Specific tests: XML : Parsing Error(s): hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/conf/empty-configuration.xml hadoop-tools/hadoop-azure/src/config/checkstyle-suppressions.xml hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui/public/crossdomain.xml hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui/src/main/webapp/public/crossdomain.xml findbugs : module:hadoop-yarn-project/hadoop-yarn Useless object stored in variable removedNullContainers of method org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.removeOrTrackCompletedContainersFromContext(List) At NodeStatusUpdaterImpl.java:removedNullContainers of method org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.removeOrTrackCompletedContainersFromContext(List) At NodeStatusUpdaterImpl.java:[line 664] org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.removeVeryOldStoppedContainersFromCache() makes inefficient use of keySet iterator instead of entrySet iterator At NodeStatusUpdaterImpl.java:keySet iterator instead of entrySet iterator At NodeStatusUpdaterImpl.java:[line 741] org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.createStatus() makes inefficient use of keySet iterator instead of entrySet iterator At ContainerLocalizer.java:keySet iterator instead of entrySet iterator At ContainerLocalizer.java:[line 359] org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics.usageMetrics is a mutable collection which should be package protected At ContainerMetrics.java:which should be package protected At ContainerMetrics.java:[line 134] Boxed value is unboxed and then immediately reboxed in org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnRWHelper.readResultsWithTimestamps(Result, byte[], byte[], KeyConverter, ValueConverter, boolean) At ColumnRWHelper.java:then immediately reboxed in org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnRWHelper.readResultsWithTimestamps(Result, byte[], byte[], KeyConverter, ValueConverter, boolean) At ColumnRWHelper.java:[line 335] org.apache.hadoop.yarn.state.StateMachineFactory.generateStateGraph(String) makes inefficient use of keySet iterator instead of entrySet iterator At StateMachineFactory.java:keySet iterator instead of entrySet iterator At StateMachineFactory.java:[line 505] findbugs : module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common org.apache.hadoop.yarn.state.StateMachineFactory.generateStateGraph(String) makes inefficient use of keySet iterator instead of entrySet iterator At StateMachineFactory.java:keySet iterator instead of entrySet iterator At StateMachineFactory.java:[line 505] findbugs : module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server Useless object stored in variable removedNullContainers of method org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.removeOrTrackCompletedContainersFromContext(List) At NodeStatusUpdaterImpl.java:removedNullContainers of method org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.removeOrTrackCompletedContainersFromContext(List) At NodeStatusUpdaterImpl.java:[line 664] org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.removeVeryOldStoppedContainersFromCache() makes inefficient use of keySet iterator instead of entrySet iterator At NodeStatusUpdaterImpl.java:keySet iterator instead of entrySet iterator At NodeStatusUpdaterImpl.java:[line 741] org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.createStatus() makes inefficient use of keySet iterator instead of entrySet iterator At ContainerLocalizer.java:keySet iterator instead of entrySet iterator At ContainerLocalizer.java:[line 359] org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics.usageMetrics is a mutable collection which should be package protected At ContainerMetrics.java:which should be package protected At ContainerMetrics.java:[line 134] Boxed value is unboxed and then immediately reboxed in org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnRWHel
Apache Hadoop qbt Report: trunk+JDK8 on Linux/x86_64
For more details, see https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/210/ [Jul 20, 2020 6:14:30 AM] (noreply) HDFS-15463. Add a tool to validate FsImage. (#2140) [Jul 20, 2020 9:51:26 AM] (noreply) HADOOP-17107. hadoop-azure parallel tests not working on recent JDKs (#2118) [Jul 20, 2020 3:58:50 PM] (noreply) HADOOP-17136. ITestS3ADirectoryPerformance.testListOperations failing (#2153) [Jul 20, 2020 4:19:05 PM] (Ayush Saxena) HDFS-15381. Fix typos corrputBlocksFiles to corruptBlocksFiles. Contributed by bianqi. [Jul 20, 2020 4:43:48 PM] (Ayush Saxena) HADOOP-17119. Jetty upgrade to 9.4.x causes MR app fail with IOException. Contributed by Bilwa S T. [Jul 20, 2020 6:08:27 PM] (Eric Badger) [YARN-10353] Log vcores used and cumulative cpu in containers monitor. [Jul 20, 2020 7:49:58 PM] (Chen Liang) HDFS-15404. ShellCommandFencer should expose info about source. Contributed by Chen Liang. - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-15488) Add. a command to list all snapshots for a snaphottable root with snap Ids
Shashikant Banerjee created HDFS-15488: -- Summary: Add. a command to list all snapshots for a snaphottable root with snap Ids Key: HDFS-15488 URL: https://issues.apache.org/jira/browse/HDFS-15488 Project: Hadoop HDFS Issue Type: Sub-task Components: snapshots Reporter: Shashikant Banerjee Assignee: Shashikant Banerjee Currently, the way to list snapshots is do a ls on /.snapshot directory. Since creation time is not recorded , there is no way to actually figure out the chronological order of snapshots. The idea here is to add a command to list snapshots for a snapshottable directory along with snapshot Ids which grow monotonically as snapshots are created in the system. With snapID, it will be helpful to figure out the chronology of snapshots in the system. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org