Re: Both NN crashes due to JN timeout (Hadoop 3)

Wei-Chiu Chuang Tue, 08 Dec 2020 19:15:12 -0800

Thanks -- yes, you're right. 200 seconds time out.
According to this stackoverflow article, Jetty 9 default is 30 seconds. I
think it makes sense to override the Jetty's time out and set to 200
seconds. I'll file a jira.


Other things to consider:
1. fsck serverlet? (somehow I suspect this is related to the socket timeout
reported in HDFS-7175 <https://issues.apache.org/jira/browse/HDFS-7175>)
2. webhdfs, httpfs? --> we've also received reports that webhdfs can
timeout. so having a longer timeout makes sense here.
2. kms? will the longer timeout cause more lingering sockets?



On Mon, Dec 7, 2020 at 10:23 PM Jason Wen <zhenshan....@workday.com.invalid>
wrote:

> Hi Wei Chiu,
>
> We also observed same issue when NN replays large editlogs from JN.
> It looks like in jetty 6 the default max idle timeout is  200 seconds.
>
> public abstract class AbstractConnector extends AbstractBuffers implements
> Connector
> {
>     ....
>     protected int _maxIdleTime=200000;
>     ....
> }
>
> Thanks,
> Jason
>
> On 12/7/20, 9:51 PM, "Wei-Chiu Chuang" <weic...@apache.org> wrote:
>
>     Hi community,
>
>     I want to share with you this observation.
>
>     We received several case reports that users sometimes experience
>     JournalNode timeout when NN requests edits from JN. The end result is
>     (both!) NN crash after the timeout (10 seconds).
>
>     It seems to only happen to Hadoop 3 users (CDH6 and HDP3). While
>     HADOOP-15696 <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D15696&d=DwIBaQ&c=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc&r=UflFQf1BWcrVtfjfN1LUqWWh-UBP5XtRGMdcDC-0P7o&m=L67rN1m5wT8nsi0reG7VuHuSEiJ0khiFAjDFK3GFFbQ&s=eEUnJdQK8HKIlsWlNRMzmhQs4DqKn8SFs4X4s2xIENs&e=
> > offered a
>     configurable switch for you to increase hadoop.http.idle_timeout.ms,
> it
>     looks like a regression in Hadoop 3 and NN shouldn't simply crash
> because
>     JN is slightly slow. It looks to me a 10 second timeout for fetching
> edits
>     from JN is simply too low.
>
>     I believe this is a regression caused when we updated Jetty from 6 to
> 9 in
>     Hadoop 3 (HADOOP-10075 <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D10075&d=DwIBaQ&c=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc&r=UflFQf1BWcrVtfjfN1LUqWWh-UBP5XtRGMdcDC-0P7o&m=L67rN1m5wT8nsi0reG7VuHuSEiJ0khiFAjDFK3GFFbQ&s=D_Tma-NaItInfNfm3UuoQbndqB4541VxEeyXpkYMkH4&e=
> >).
>     We replaced SelectChannelConnector.setLowResourceMaxIdleTime()
>     with ServerConnector.setIdleTimeout() but they aren't the same.
>
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__archive.eclipse.org_jetty_7.0.0.RC0_apidocs_org_eclipse_jetty_server_nio_SelectChannelConnector.html-23getLowResourcesMaxIdleTime-28-29&d=DwIBaQ&c=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc&r=UflFQf1BWcrVtfjfN1LUqWWh-UBP5XtRGMdcDC-0P7o&m=L67rN1m5wT8nsi0reG7VuHuSEiJ0khiFAjDFK3GFFbQ&s=PcA6g7BGB_1fGEHHCS1Dgl0i4fS_AeCRr1q5ceVduOo&e=
>
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.eclipse.org_jetty_javadoc_9.4.26.v20200117_org_eclipse_jetty_server_AbstractConnector.html-23setIdleTimeout-28long-29&d=DwIBaQ&c=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc&r=UflFQf1BWcrVtfjfN1LUqWWh-UBP5XtRGMdcDC-0P7o&m=L67rN1m5wT8nsi0reG7VuHuSEiJ0khiFAjDFK3GFFbQ&s=FKfElxhHXM1PCAk0VpG9wt6Y6jyKbr-PN4H4v4m9Tfc&e=
>
>     Does any know the behavior back in Hadoop 2/Jetty6? Does it use the
> Jetty's
>     default idle time which is 300 seconds?
>
>

Re: Both NN crashes due to JN timeout (Hadoop 3)

Reply via email to