[jira] [Created] (HDFS-15485) Fix outdated properties of JournalNode when performing rollback

2020-07-21 Thread Deegue (Jira)
Deegue created HDFS-15485:
-

 Summary: Fix outdated properties of JournalNode when performing 
rollback
 Key: HDFS-15485
 URL: https://issues.apache.org/jira/browse/HDFS-15485
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Deegue


When rollback HDFS cluster, properties in JNStorage won't be refreshed after 
the storage dir changed. It leads to exceptions when starting namenode.

The exception like:
{code:java}
2020-07-09 19:04:12,810 FATAL [IPC Server handler 105 on 8022] 
org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: 
recoverUnfinalizedSegments failed for required journal 
(JournalAndStream(mgr=QJM to [10.0.118.217:8485, 10.0.117.208:8485, 
10.0.118.179:8485], stream=null))
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions 
to achieve quorum size 2/3. 3 exceptions thrown:
10.0.118.217:8485: Incompatible namespaceID for journal Storage Directory 
/mnt/vdc-11176G-0/dfs/jn/nameservicetest1: NameNode has nsId 647617129 but 
storage has nsId 0
at 
org.apache.hadoop.hdfs.qjournal.server.JNStorage.checkConsistentNamespace(JNStorage.java:236)
at 
org.apache.hadoop.hdfs.qjournal.server.Journal.newEpoch(Journal.java:300)
at 
org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.newEpoch(JournalNodeRpcServer.java:136)
at 
org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.newEpoch(QJournalProtocolServerSideTranslatorPB.java:133)
at 
org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25417)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2278)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2274)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2274)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-15486) Costly sendResponse operation slows down async editlog handling

2020-07-21 Thread Yiqun Lin (Jira)
Yiqun Lin created HDFS-15486:


 Summary: Costly sendResponse operation slows down async editlog 
handling
 Key: HDFS-15486
 URL: https://issues.apache.org/jira/browse/HDFS-15486
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.7.0
Reporter: Yiqun Lin
 Attachments: Async-profile-(2).jpg, async-profile-(1).jpg

When our cluster NameNode in a very high load, we find it often stuck in 
Async-editlog handling.

We use async-profile tool to get the flamegraph.

!Async-profile-(2).jpg!

This happened in that async editlog thread consumes Edit from the queue and 
triggers the sendResponse call.

But here the sendResponse call is a little expensive since our cluster enabled 
the security env and will do some encode operations when doing the return 
response operation.

We often catch some moments of costly sendResponse operation when rpc call 
queue is fulled.

!async-profile-(1).jpg!

Slowness on consuming Edit in async editlog will make Edit pending Queue in the 
fulled state, then block its enqueue operation that is invoked in writeLock 
type methods in FSNamesystem class.

Here the enhancement is that we can use multiple thread to parallel execute 
sendResponse call. sendResponse doesn't need use the write lock to do 
protection, so this change is safe.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-15487) ScriptBasedMapping lead 100% cpu utilization

2020-07-21 Thread Ryan Wu (Jira)
Ryan Wu created HDFS-15487:
--

 Summary: ScriptBasedMapping lead 100% cpu utilization
 Key: HDFS-15487
 URL: https://issues.apache.org/jira/browse/HDFS-15487
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Ryan Wu


We found that sometimes NameNode cpu utilization rate of 90%  leading to 
NameNode hang up. The reason is that flink apps on k8s access HDFS at the same 
time, however their ip and host name is not fixed. So that  run topology script 
at the same time. From jstack file, also found it started several hundreds 
python processes.
{code:java}
// "process reaper" #36159 daemon prio=10 os_prio=0 tid=0x7fa7a33fa7a0 
nid=0xa3cb waiting on condition [0x7fa7a61dc000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x7fb4094a0398> (a 
java.util.concurrent.SynchronousQueue$TransferStack)
at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at 
java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
at 
java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



Re: [VOTE] Release Apache Hadoop 3.1.4 (RC3)

2020-07-21 Thread Gabor Bota
Thank you all for the suggestions and testing.
As there's a data loss issue in the release, I've created a new RC
with the patch included. I'll send the update soon.

Regards,
Gabor Bota

On Thu, Jul 16, 2020 at 1:29 PM Stephen O'Donnell
 wrote:
>
> Hi Gabor,
>
> We recently discovered a HDFS data loss issue in any build which uses
> snapshots containing HDFS-13101 but not including HDFS-15313. Unfortunately
> 3.1.4 falls into this category:
>
>  git log origin/branch-3.1.4 | egrep "HDFS-(15313|13101)"
> HDFS-15012. NN fails to parse Edit logs after applying HDFS-13101.
> Contributed by Shashikant Banerjee.
> HDFS-13101. Yet another fsimage corruption related to snapshot.
> Contributed by Shashikant Banerjee.
>
> See this comment for more information on the bug:
>
> https://issues.apache.org/jira/browse/HDFS-15313?focusedCommentId=17158140&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17158140
>
> I think we should not make a release when we have a known data loss bug in
> it. What do you think?
>
> I am going to commit HDFS-15313 onto branch-3.1 shortly, so maybe we should
> cut a new RC after including that?
>
> Thanks,
>
> Stephen.
>
> On Mon, Jul 13, 2020 at 11:36 AM Gabor Bota 
> wrote:
>
> > Hi folks,
> >
> > I have put together a release candidate (RC3) for Hadoop 3.1.4.
> >
> > *
> > The RC includes in addition to the previous ones:
> > * fix of YARN-10347. Fix double locking in
> > CapacityScheduler#reinitialize in branch-3.1
> > (https://issues.apache.org/jira/browse/YARN-10347)
> > * the revert of HDFS-14941, as it caused
> > HDFS-15421. IBR leak causes standby NN to be stuck in safe mode.
> > (https://issues.apache.org/jira/browse/HDFS-15421)
> > * HDFS-15323, as requested.
> > (https://issues.apache.org/jira/browse/HDFS-15323)
> > *
> >
> > The RC is available at: http://people.apache.org/~gabota/hadoop-3.1.4-RC3/
> > The RC tag in git is here:
> > https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC3
> > The maven artifacts are staged at
> > https://repository.apache.org/content/repositories/orgapachehadoop-1274/
> >
> > You can find my public key at:
> > https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
> > and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C
> >
> > Please try the release and vote. The vote will run for 7 weekdays,
> > until July 22. 2020. 23:00 CET.
> >
> >
> > Thanks,
> > Gabor
> >
> > -
> > To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
> > For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
> >
> >

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[VOTE] Release Apache Hadoop 3.1.4 (RC4)

2020-07-21 Thread Gabor Bota
Hi folks,

I have put together a release candidate (RC4) for Hadoop 3.1.4.

*
The RC includes in addition to the previous ones:
* fix for HDFS-15313. Ensure inodes in active filesystem are not
deleted during snapshot delete
* fix for YARN-10347. Fix double locking in
CapacityScheduler#reinitialize in branch-3.1
(https://issues.apache.org/jira/browse/YARN-10347)
* the revert of HDFS-14941, as it caused
HDFS-15421. IBR leak causes standby NN to be stuck in safe mode.
(https://issues.apache.org/jira/browse/HDFS-15421)
* HDFS-15323, as requested.
(https://issues.apache.org/jira/browse/HDFS-15323)
*

The RC is available at: http://people.apache.org/~gabota/hadoop-3.1.4-RC4/
The RC tag in git is here:
https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC4
The maven artifacts are staged at
https://repository.apache.org/content/repositories/orgapachehadoop-1275/

You can find my public key at:
https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C

Please try the release and vote. The vote will run for 8 weekdays,
until July 31. 2020. 23:00 CET.


Thanks,
Gabor

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



Apache Hadoop qbt Report: branch2.10+JDK7 on Linux/x86

2020-07-21 Thread Apache Jenkins Server
For more details, see 
https://builds.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86/754/

[Jul 20, 2020 8:39:00 PM] (vagarychen) HADOOP-16753. Refactor HAAdmin. 
Contributed by Xieming Li.
[Jul 20, 2020 10:26:44 PM] (vagarychen) HDFS-15404. ShellCommandFencer should 
expose info about source.




-1 overall


The following subsystems voted -1:
asflicense findbugs hadolint jshint pathlen unit xml


The following subsystems voted -1 but
were configured to be filtered/ignored:
cc checkstyle javac javadoc pylint shellcheck shelldocs whitespace


The following subsystems are considered long running:
(runtime bigger than 1h  0m  0s)
unit


Specific tests:

XML :

   Parsing Error(s): 
   
hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/conf/empty-configuration.xml
 
   hadoop-tools/hadoop-azure/src/config/checkstyle-suppressions.xml 
   hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui/public/crossdomain.xml 
   
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui/src/main/webapp/public/crossdomain.xml
 

findbugs :

   module:hadoop-yarn-project/hadoop-yarn 
   Useless object stored in variable removedNullContainers of method 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.removeOrTrackCompletedContainersFromContext(List)
 At NodeStatusUpdaterImpl.java:removedNullContainers of method 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.removeOrTrackCompletedContainersFromContext(List)
 At NodeStatusUpdaterImpl.java:[line 664] 
   
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.removeVeryOldStoppedContainersFromCache()
 makes inefficient use of keySet iterator instead of entrySet iterator At 
NodeStatusUpdaterImpl.java:keySet iterator instead of entrySet iterator At 
NodeStatusUpdaterImpl.java:[line 741] 
   
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.createStatus()
 makes inefficient use of keySet iterator instead of entrySet iterator At 
ContainerLocalizer.java:keySet iterator instead of entrySet iterator At 
ContainerLocalizer.java:[line 359] 
   
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics.usageMetrics
 is a mutable collection which should be package protected At 
ContainerMetrics.java:which should be package protected At 
ContainerMetrics.java:[line 134] 
   Boxed value is unboxed and then immediately reboxed in 
org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnRWHelper.readResultsWithTimestamps(Result,
 byte[], byte[], KeyConverter, ValueConverter, boolean) At 
ColumnRWHelper.java:then immediately reboxed in 
org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnRWHelper.readResultsWithTimestamps(Result,
 byte[], byte[], KeyConverter, ValueConverter, boolean) At 
ColumnRWHelper.java:[line 335] 
   
org.apache.hadoop.yarn.state.StateMachineFactory.generateStateGraph(String) 
makes inefficient use of keySet iterator instead of entrySet iterator At 
StateMachineFactory.java:keySet iterator instead of entrySet iterator At 
StateMachineFactory.java:[line 505] 

findbugs :

   module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
   
org.apache.hadoop.yarn.state.StateMachineFactory.generateStateGraph(String) 
makes inefficient use of keySet iterator instead of entrySet iterator At 
StateMachineFactory.java:keySet iterator instead of entrySet iterator At 
StateMachineFactory.java:[line 505] 

findbugs :

   module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server 
   Useless object stored in variable removedNullContainers of method 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.removeOrTrackCompletedContainersFromContext(List)
 At NodeStatusUpdaterImpl.java:removedNullContainers of method 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.removeOrTrackCompletedContainersFromContext(List)
 At NodeStatusUpdaterImpl.java:[line 664] 
   
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.removeVeryOldStoppedContainersFromCache()
 makes inefficient use of keySet iterator instead of entrySet iterator At 
NodeStatusUpdaterImpl.java:keySet iterator instead of entrySet iterator At 
NodeStatusUpdaterImpl.java:[line 741] 
   
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.createStatus()
 makes inefficient use of keySet iterator instead of entrySet iterator At 
ContainerLocalizer.java:keySet iterator instead of entrySet iterator At 
ContainerLocalizer.java:[line 359] 
   
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics.usageMetrics
 is a mutable collection which should be package protected At 
ContainerMetrics.java:which should be package protected At 
ContainerMetrics.java:[line 134] 
   Boxed value is unboxed and then immediately reboxed in 
org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnRWHel

Apache Hadoop qbt Report: trunk+JDK8 on Linux/x86_64

2020-07-21 Thread Apache Jenkins Server
For more details, see 
https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/210/

[Jul 20, 2020 6:14:30 AM] (noreply) HDFS-15463. Add a tool to validate FsImage. 
(#2140)
[Jul 20, 2020 9:51:26 AM] (noreply) HADOOP-17107. hadoop-azure parallel tests 
not working on recent JDKs (#2118)
[Jul 20, 2020 3:58:50 PM] (noreply) HADOOP-17136. 
ITestS3ADirectoryPerformance.testListOperations failing (#2153)
[Jul 20, 2020 4:19:05 PM] (Ayush Saxena) HDFS-15381. Fix typos 
corrputBlocksFiles to corruptBlocksFiles. Contributed by bianqi.
[Jul 20, 2020 4:43:48 PM] (Ayush Saxena) HADOOP-17119. Jetty upgrade to 9.4.x 
causes MR app fail with IOException. Contributed by Bilwa S T.
[Jul 20, 2020 6:08:27 PM] (Eric Badger) [YARN-10353] Log vcores used and 
cumulative cpu in containers monitor.
[Jul 20, 2020 7:49:58 PM] (Chen Liang) HDFS-15404. ShellCommandFencer should 
expose info about source. Contributed by Chen Liang.

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-15488) Add. a command to list all snapshots for a snaphottable root with snap Ids

2020-07-21 Thread Shashikant Banerjee (Jira)
Shashikant Banerjee created HDFS-15488:
--

 Summary: Add. a command to list all snapshots for a snaphottable 
root with snap Ids
 Key: HDFS-15488
 URL: https://issues.apache.org/jira/browse/HDFS-15488
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: snapshots
Reporter: Shashikant Banerjee
Assignee: Shashikant Banerjee


Currently, the way to list snapshots is do a ls on  
/.snapshot directory. Since creation time is not 
recorded , there is no way to actually figure out the chronological order of 
snapshots. The idea here is to add a command to list snapshots for a 
snapshottable directory along with snapshot Ids which grow monotonically as 
snapshots are created in the system. With snapID, it will be helpful to figure 
out the chronology of snapshots in the system.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org