May I also ask what version of flink-hadoop you’re using and the
number of jobs you’re storing the history for? As of writing we have
roughly 101,000 application history files. I’m curious to know if
we’re encountering some kind of resource problem.
*// *ah**
*From:*Hailu, Andreas [Engineering]
*Sent:* Thursday, May 28, 2020 12:18 PM
*To:* 'Chesnay Schepler' <ches...@apache.org>; user@flink.apache.org
*Subject:* RE: History Server Not Showing Any Jobs - File Not Found?
Okay, I will look further to see if we’re mistakenly using a version
that’s pre-2.6.0. However, I don’t see flink-shaded-hadoop in my /lib
directory for flink-1.9.1.
flink-dist_2.11-1.9.1.jar
flink-table-blink_2.11-1.9.1.jar
flink-table_2.11-1.9.1.jar
log4j-1.2.17.jar
slf4j-log4j12-1.7.15.jar
Are the files within /lib.
*// *ah**
*From:*Chesnay Schepler <ches...@apache.org <mailto:ches...@apache.org>>
*Sent:* Thursday, May 28, 2020 11:00 AM
*To:* Hailu, Andreas [Engineering] <andreas.ha...@ny.email.gs.com
<mailto:andreas.ha...@ny.email.gs.com>>; user@flink.apache.org
<mailto:user@flink.apache.org>
*Subject:* Re: History Server Not Showing Any Jobs - File Not Found?
Looks like it is indeed stuck on downloading the archive.
I searched a bit in the Hadoop JIRA and found several similar instances:
https://issues.apache.org/jira/browse/HDFS-6999
<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HDFS-2D6999&d=DwMD-g&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM&m=b1rFpuaq4HMshPx-d-0ZmaazccTuKjDKzJjF0WZSIso&s=wtWbBz9FrMlr29HibXGZvdcsFC1wqyVPulrYiTewpoQ&e=>
https://issues.apache.org/jira/browse/HDFS-7005
<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HDFS-2D7005&d=DwMD-g&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM&m=b1rFpuaq4HMshPx-d-0ZmaazccTuKjDKzJjF0WZSIso&s=0KgRQHmW0Xj6NToNVzoi9iAGh1SIbfe8cnCqj1TXuW8&e=>
https://issues.apache.org/jira/browse/HDFS-7145
<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HDFS-2D7145&d=DwMD-g&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM&m=b1rFpuaq4HMshPx-d-0ZmaazccTuKjDKzJjF0WZSIso&s=oy8z5gRd6dNDURDDH20f2yiplIuJ9qnYZeVpTIrHMwc&e=>
It is supposed to be fixed in 2.6.0 though :/
If hadoop is available from the HADOOP_CLASSPATH and
flink-shaded-hadoop in /lib then you basically don't know what Hadoop
version is actually being used,
which could lead to incompatibilities and dependency clashes.
If flink-shaded-hadoop 2.4/2.5 is on the classpath, maybe that is
being used and runs into HDFS-7005.
On 28/05/2020 16:27, Hailu, Andreas wrote:
Just created a dump, here’s what I see:
"Flink-HistoryServer-ArchiveFetcher-thread-1" #19 daemon prio=5
os_prio=0 tid=0x00007f93a5a2c000 nid=0x5692 runnable
[0x00007f934a0d3000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at
sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at
sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x00000005df986960> (a sun.nio.ch.Util$2)
- locked <0x00000005df986948> (a
java.util.Collections$UnmodifiableSet)
- locked <0x00000005df928390> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.readChannelFully(PacketReceiver.java:258)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:209)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:102)
at
org.apache.hadoop.hdfs.RemoteBlockReader2.readNextPacket(RemoteBlockReader2.java:201)
at
org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:152)
- locked <0x00000005ceade5e0> (a
org.apache.hadoop.hdfs.RemoteBlockReader2)
at
org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:781)
at
org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:837)
- eliminated <0x00000005cead3688> (a
org.apache.hadoop.hdfs.DFSInputStream)
at
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:897)
- locked <0x00000005cead3688> (a
org.apache.hadoop.hdfs.DFSInputStream)
at
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:945)
- locked <0x00000005cead3688> (a
org.apache.hadoop.hdfs.DFSInputStream)
at java.io.DataInputStream.read(DataInputStream.java:149)
at
org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:94)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.flink.util.IOUtils.copyBytes(IOUtils.java:69)
at org.apache.flink.util.IOUtils.copyBytes(IOUtils.java:91)
at
org.apache.flink.runtime.history.FsJobArchivist.getArchivedJsons(FsJobArchivist.java:110)
at
org.apache.flink.runtime.webmonitor.history.HistoryServerArchiveFetcher$JobArchiveFetcherTask.run(HistoryServerArchiveFetcher.java:169)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
What problems could the flink-shaded-hadoop jar being included
introduce?
*// *ah
*From:*Chesnay Schepler <ches...@apache.org>
<mailto:ches...@apache.org>
*Sent:* Thursday, May 28, 2020 9:26 AM
*To:* Hailu, Andreas [Engineering] <andreas.ha...@ny.email.gs.com>
<mailto:andreas.ha...@ny.email.gs.com>; user@flink.apache.org
<mailto:user@flink.apache.org>
*Subject:* Re: History Server Not Showing Any Jobs - File Not Found?
If it were a class-loading issue I would think that we'd see an
exception of some kind. Maybe double-check that
flink-shaded-hadoop is not in the lib directory. (usually I would
ask for the full classpath that the HS is started with, but as it
turns out this isn't getting logged :( (FLINK-18008))
The fact that overview.json and jobs/overview.json are missing
indicates that something goes wrong directly on startup. What is
supposed to happens is that the HS starts, fetches all currently
available archives and then creates these files.
So it seems like the download gets stuck for some reason.
Can you use jstack to create a thread dump, and see what the
Flink-HistoryServer-ArchiveFetcher is doing?
I will also file a JIRA for adding more logging statements, like
when fetching starts/stops.
On 27/05/2020 20:57, Hailu, Andreas wrote:
Hi Chesney, apologies for not getting back to you sooner here.
So I did what you suggested - I downloaded a few files from my
jobmanager.archive.fs.dir HDFS directory to a locally
available directory named
/local/scratch/hailua_p2epdlsuat/historyserver/archived/. I
then changed my historyserver.archive.fs.dir to
file:///local/scratch/hailua_p2epdlsuat/historyserver/archived/
and that seemed to work. I’m able to see the history of the
applications I downloaded. So this points to a problem with
sourcing the history from HDFS.
Do you think this could be classpath related? This is what we
use for our HADOOP_CLASSPATH var:
//gns/software/infra/big-data/hadoop/hdp-2.6.5.0/hadoop/*:/gns/software/infra/big-data/hadoop/hdp-2.6.5.0/hadoop/lib/*:/gns/software/infra/big-data/hadoop/hdp-2.6.5.0/hadoop-hdfs/*:/gns/software/infra/big-data/hadoop/hdp-2.6.5.0/hadoop-hdfs/lib/*:/gns/software/infra/big-data/hadoop/hdp-2.6.5.0/hadoop-mapreduce/*:/gns/software/infra/big-data/hadoop/hdp-2.6.5.0/hadoop-mapreduce/lib/*:/gns/software/infra/big-data/hadoop/hdp-2.6.5.0/hadoop-yarn/*:/gns/software/infra/big-data/hadoop/hdp-2.6.5.0/hadoop-yarn/lib/*:/gns/software/ep/da/dataproc/dataproc-prod/lakeRmProxy.jar:/gns/software/infra/big-data/hadoop/hdp-2.6.5.0/hadoop/bin::/gns/mw/dbclient/postgres/jdbc/pg-jdbc-9.3.v01/postgresql-9.3-1100-jdbc4.jar/
//
You can see we have references to Hadoop mapred/yarn/hdfs libs
in there.
*// *ah
*From:*Chesnay Schepler <ches...@apache.org>
<mailto:ches...@apache.org>
*Sent:* Sunday, May 3, 2020 6:00 PM
*To:* Hailu, Andreas [Engineering]
<andreas.ha...@ny.email.gs.com>
<mailto:andreas.ha...@ny.email.gs.com>; user@flink.apache.org
<mailto:user@flink.apache.org>
*Subject:* Re: History Server Not Showing Any Jobs - File Not
Found?
yes, exactly; I want to rule out that (somehow) HDFS is the
problem.
I couldn't reproduce the issue locally myself so far.
On 01/05/2020 22:31, Hailu, Andreas wrote:
Hi Chesnay, yes – they were created using Flink 1.9.1 as
we’ve only just started to archive them in the past couple
weeks. Could you clarify on how you want to try local
filesystem archives? As in changing
jobmanager.archive.fs.dir and historyserver.web.tmpdir to
the same local directory?
*// *ah
*From:*Chesnay Schepler <ches...@apache.org>
<mailto:ches...@apache.org>
*Sent:* Wednesday, April 29, 2020 8:26 AM
*To:* Hailu, Andreas [Engineering]
<andreas.ha...@ny.email.gs.com>
<mailto:andreas.ha...@ny.email.gs.com>;
user@flink.apache.org <mailto:user@flink.apache.org>
*Subject:* Re: History Server Not Showing Any Jobs - File
Not Found?
hmm...let's see if I can reproduce the issue locally.
Are the archives from the same version the history server
runs on? (Which I supposed would be 1.9.1?)
Just for the sake of narrowing things down, it would also
be interesting to check if it works with the archives
residing in the local filesystem.
On 27/04/2020 18:35, Hailu, Andreas wrote:
bash-4.1$ ls -l /local/scratch/flink_historyserver_tmpdir/
total 8
drwxrwxr-x 3 p2epdlsuat p2epdlsuat 4096 Apr 21 10:43
flink-web-history-7fbb97cc-9f38-4844-9bcf-6272fe6828e9
drwxrwxr-x 3 p2epdlsuat p2epdlsuat 4096 Apr 21 10:22
flink-web-history-95b3f928-c60f-4351-9926-766c6ad3ee76
There are just two directories in here. I don’t see
cache directories from my attempts today, which is
interesting. Looking a little deeper into them:
bash-4.1$ ls -lr
/local/scratch/flink_historyserver_tmpdir/flink-web-history-7fbb97cc-9f38-4844-9bcf-6272fe6828e9
total 1756
drwxrwxr-x 2 p2epdlsuat p2epdlsuat 1789952 Apr 21
10:44 jobs
bash-4.1$ ls -lr
/local/scratch/flink_historyserver_tmpdir/flink-web-history-7fbb97cc-9f38-4844-9bcf-6272fe6828e9/jobs
total 0
-rw-rw-r-- 1 p2epdlsuat p2epdlsuat 0 Apr 21 10:43
overview.json
There are indeed archives already in HDFS – I’ve
included some in my initial mail, but here they are
again just for reference:
-bash-4.1$ hdfs dfs -ls /user/p2epda/lake/delp_qa/flink_hs
Found 44282 items
-rw-r----- 3 delp datalake_admin_dev 50569
2020-03-21 23:17
/user/p2epda/lake/delp_qa/flink_hs/000144dba9dc0f235768a46b2f26e936
-rw-r----- 3 delp datalake_admin_dev 49578
2020-03-03 08:45
/user/p2epda/lake/delp_qa/flink_hs/000347625d8128ee3fd0b672018e38a5
-rw-r----- 3 delp datalake_admin_dev 50842
2020-03-24 15:19
/user/p2epda/lake/delp_qa/flink_hs/0004be6ce01ba9677d1eb619ad0fa757
...
*// *ah
*From:*Chesnay Schepler <ches...@apache.org>
<mailto:ches...@apache.org>
*Sent:* Monday, April 27, 2020 10:28 AM
*To:* Hailu, Andreas [Engineering]
<andreas.ha...@ny.email.gs.com>
<mailto:andreas.ha...@ny.email.gs.com>;
user@flink.apache.org <mailto:user@flink.apache.org>
*Subject:* Re: History Server Not Showing Any Jobs -
File Not Found?
If historyserver.web.tmpdir is not set then
java.io.tmpdir is used, so that should be fine.
What are the contents of
/local/scratch/flink_historyserver_tmpdir?
I assume there are already archives in HDFS?
On 27/04/2020 16:02, Hailu, Andreas wrote:
My machine’s /tmp directory is not large enough to
support the archived files, so I changed my
java.io.tmpdir to be in some other location which
is significantly larger. I hadn’t set anything for
historyserver.web.tmpdir, so I suspect it was
still pointing at /tmp. I just tried setting
historyserver.web.tmpdir to the same location as
my java.io.tmpdir location, but I’m afraid I’m
still seeing the following issue:
2020-04-27 09:37:42,904 [nioEventLoopGroup-3-4]
DEBUG HistoryServerStaticFileServerHandler -
Unable to load requested file /overview.json from
classloader
2020-04-27 09:37:42,906 [nioEventLoopGroup-3-6]
DEBUG HistoryServerStaticFileServerHandler -
Unable to load requested file /jobs/overview.json
from classloader
flink-conf.yaml for reference:
jobmanager.archive.fs.dir:
hdfs:///user/p2epda/lake/delp_qa/flink_hs/
historyserver.archive.fs.dir:
hdfs:///user/p2epda/lake/delp_qa/flink_hs/
historyserver.web.tmpdir:
/local/scratch/flink_historyserver_tmpdir/
Did you have anything else in mind when you said
pointing somewhere funny?
*// *ah
*From:*Chesnay Schepler <ches...@apache.org>
<mailto:ches...@apache.org>
*Sent:* Monday, April 27, 2020 5:56 AM
*To:* Hailu, Andreas [Engineering]
<andreas.ha...@ny.email.gs.com>
<mailto:andreas.ha...@ny.email.gs.com>;
user@flink.apache.org <mailto:user@flink.apache.org>
*Subject:* Re: History Server Not Showing Any Jobs
- File Not Found?
overview.json is a generated file that is placed
in the local directory controlled by
/historyserver.web.tmpdir/.
Have you configured this option to point to some
non-local filesystem? (Or if not, is the
java.io.tmpdir property pointing somewhere funny?)
On 24/04/2020 18:24, Hailu, Andreas wrote:
I’m having a further look at the code in
HistoryServerStaticFileServerHandler - is
there an assumption about where overview.json
is supposed to be located?
*// *ah
*From:*Hailu, Andreas [Engineering]
*Sent:* Wednesday, April 22, 2020 1:32 PM
*To:* 'Chesnay Schepler' <ches...@apache.org>
<mailto:ches...@apache.org>; Hailu, Andreas
[Engineering] <andreas.ha...@ny.email.gs.com>
<mailto:andreas.ha...@ny.email.gs.com>;
user@flink.apache.org
<mailto:user@flink.apache.org>
*Subject:* RE: History Server Not Showing Any
Jobs - File Not Found?
Hi Chesnay, thanks for responding. We’re using
Flink 1.9.1. I enabled DEBUG level logging and
this is something relevant I see:
2020-04-22 13:25:52,566
[Flink-HistoryServer-ArchiveFetcher-thread-1]
DEBUG DFSInputStream - Connecting to datanode
10.79.252.101:1019
2020-04-22 13:25:52,567
[Flink-HistoryServer-ArchiveFetcher-thread-1]
DEBUG SaslDataTransferClient - SASL encryption
trust check: localHostTrusted = false,
remoteHostTrusted = false
2020-04-22 13:25:52,567
[Flink-HistoryServer-ArchiveFetcher-thread-1]
DEBUG SaslDataTransferClient - SASL client
skipping handshake in secured configuration
with privileged port for addr =
/10.79.252.101, datanodeId = DatanodeI
nfoWithStorage[10.79.252.101:1019,DS-7f4ec55d-7c5f-4a0e-b817-d9e635480b21,DISK]
*2020-04-22 13:25:52,571
[Flink-HistoryServer-ArchiveFetcher-thread-1]
DEBUG DFSInputStream - DFSInputStream has been
closed already*
*2020-04-22 13:25:52,573
[nioEventLoopGroup-3-6] DEBUG
HistoryServerStaticFileServerHandler - Unable
to load requested file /jobs/overview.json
from classloader*
2020-04-22 13:25:52,576 [IPC Parameter Sending
Thread #0] DEBUG Client$Connection$3 - IPC
Client (1578587450) connection to
d279536-002.dc.gs.com/10.59.61.87:8020 from
d...@gs.com <mailto:d...@gs.com> sending #1391
Aside from that, it looks like a lot of
logging around datanodes and block location
metadata. Did I miss something in my
classpath, perhaps? If so, do you have a
suggestion on what I could try?
*// *ah
*From:*Chesnay Schepler <ches...@apache.org
<mailto:ches...@apache.org>>
*Sent:* Wednesday, April 22, 2020 2:16 AM
*To:* Hailu, Andreas [Engineering]
<andreas.ha...@ny.email.gs.com
<mailto:andreas.ha...@ny.email.gs.com>>;
user@flink.apache.org
<mailto:user@flink.apache.org>
*Subject:* Re: History Server Not Showing Any
Jobs - File Not Found?
Which Flink version are you using?
Have you checked the history server logs after
enabling debug logging?
On 21/04/2020 17:16, Hailu, Andreas
[Engineering] wrote:
Hi,
I’m trying to set up the History Server,
but none of my applications are showing up
in the Web UI. Looking at the console, I
see that all of the calls to /overview
return the following 404 response:
{"errors":["File not found."]}.
I’ve set up my configuration as follows:
JobManager Archive directory:
*jobmanager.archive.fs.dir*:
hdfs:///user/p2epda/lake/delp_qa/flink_hs/
-bash-4.1$ hdfs dfs -ls
/user/p2epda/lake/delp_qa/flink_hs
Found 44282 items
-rw-r----- 3 delp datalake_admin_dev
50569 2020-03-21 23:17
/user/p2epda/lake/delp_qa/flink_hs/000144dba9dc0f235768a46b2f26e936
-rw-r----- 3 delp datalake_admin_dev
49578 2020-03-03 08:45
/user/p2epda/lake/delp_qa/flink_hs/000347625d8128ee3fd0b672018e38a5
-rw-r----- 3 delp datalake_admin_dev
50842 2020-03-24 15:19
/user/p2epda/lake/delp_qa/flink_hs/0004be6ce01ba9677d1eb619ad0fa757
...
...
History Server will fetch the archived
jobs from the same location:
*historyserver.archive.fs.dir*:
hdfs:///user/p2epda/lake/delp_qa/flink_hs/
So I’m able to confirm that there are
indeed archived applications that I should
be able to view in the histserver. I’m not
able to find out what file the overview
service is looking for from the repository
– any suggestions as to what I could look
into next?
Best,
Andreas
------------------------------------------------------------------------
Your Personal Data: We may collect and
process information about you that may be
subject to data protection laws. For more
information about how we use and disclose
your personal data, how we protect your
information, our legal basis to use your
information, your rights and who you can
contact, please refer to:
www.gs.com/privacy-notices
<http://www.gs.com/privacy-notices>
------------------------------------------------------------------------
Your Personal Data: We may collect and process
information about you that may be subject to
data protection laws. For more information
about how we use and disclose your personal
data, how we protect your information, our
legal basis to use your information, your
rights and who you can contact, please refer
to: www.gs.com/privacy-notices
<http://www.gs.com/privacy-notices>
------------------------------------------------------------------------
Your Personal Data: We may collect and process
information about you that may be subject to data
protection laws. For more information about how we
use and disclose your personal data, how we
protect your information, our legal basis to use
your information, your rights and who you can
contact, please refer to:
www.gs.com/privacy-notices
<http://www.gs.com/privacy-notices>
------------------------------------------------------------------------
Your Personal Data: We may collect and process
information about you that may be subject to data
protection laws. For more information about how we use
and disclose your personal data, how we protect your
information, our legal basis to use your information,
your rights and who you can contact, please refer to:
www.gs.com/privacy-notices
<http://www.gs.com/privacy-notices>
------------------------------------------------------------------------
Your Personal Data: We may collect and process information
about you that may be subject to data protection laws. For
more information about how we use and disclose your
personal data, how we protect your information, our legal
basis to use your information, your rights and who you can
contact, please refer to: www.gs.com/privacy-notices
<http://www.gs.com/privacy-notices>
------------------------------------------------------------------------
Your Personal Data: We may collect and process information
about you that may be subject to data protection laws. For
more information about how we use and disclose your personal
data, how we protect your information, our legal basis to use
your information, your rights and who you can contact, please
refer to: www.gs.com/privacy-notices
<http://www.gs.com/privacy-notices>
------------------------------------------------------------------------
Your Personal Data: We may collect and process information about
you that may be subject to data protection laws. For more
information about how we use and disclose your personal data, how
we protect your information, our legal basis to use your
information, your rights and who you can contact, please refer to:
www.gs.com/privacy-notices <http://www.gs.com/privacy-notices>
------------------------------------------------------------------------
Your Personal Data: We may collect and process information about you
that may be subject to data protection laws. For more information
about how we use and disclose your personal data, how we protect your
information, our legal basis to use your information, your rights and
who you can contact, please refer to: www.gs.com/privacy-notices
<http://www.gs.com/privacy-notices>