Re: [Spark Structured Streaming on K8S]: Debug - File handles/descriptor (unix pipe) leaking

Abhishek Tripathi Mon, 23 Jul 2018 07:10:59 -0700

Hello Dev!
Spark structured streaming job with simple window aggregation is leaking
file descriptor on kubernetes as cluster manager setup. It seems bug.
I am suing HDFS as FS for checkpointing.
Have anyone observed same?  Thanks for any help.


Please find more details in trailing email.


For more error log, please follow below Github gist:
https://gist.github.com/abhisheyke/1ecd952f2ae6af20cf737308a156f566
Some details about file descriptor (lsof):
https://gist.github.com/abhisheyke/27b073e7ac805ce9e6bb33c2b011bb5a
Code Snip:
https://gist.github.com/abhisheyke/6f838adf6651491bd4f263956f403c74

Thanks.

Best Regards,
*Abhishek Tripath*


On Thu, Jul 19, 2018 at 10:02 AM Abhishek Tripathi <aktripathi1...@gmail.com>
wrote:

> Hello All!
> I am using spark 2.3.1 on kubernetes to run a structured streaming spark
> job which read stream from Kafka , perform some window aggregation and
> output sink to Kafka.
> After job running few hours(5-6 hours), the executor pods is getting
> crashed which is caused by "Too many open files in system".
> Digging in further, with "lsof" command, I can see there is a lot UNIX
> pipe getting opened.
>
> # lsof -p 14 | tail
> java     14 root *112u  a_inode               0,10        0      8838
> [eventpoll]
> java     14 root *113r     FIFO                0,9      0t0 252556158 pipe
> java     14 root *114w     FIFO                0,9      0t0 252556158 pipe
> java     14 root *115u  a_inode               0,10        0      8838
> [eventpoll]
> java     14 root *119r     FIFO                0,9      0t0 252552868 pipe
> java     14 root *120w     FIFO                0,9      0t0 252552868 pipe
> java     14 root *121u  a_inode               0,10        0      8838
> [eventpoll]
> java     14 root *131r     FIFO                0,9      0t0 252561014 pipe
> java     14 root *132w     FIFO                0,9      0t0 252561014 pipe
> java     14 root *133u  a_inode               0,10        0      8838
> [eventpoll]
>
> Total count of open fd is going up to 85K (increased hard ulimit) for
> each pod and once it hit the hard limit , executor pod is getting crashed.
> For shuffling I can think of it need more fd but in my case open fd count
> keep growing forever. Not sure how can I estimate how many fd will be
> adequate or there is a bug.
> With that uncertainty, I increased hard ulimit to large number as 85k but
> no luck.
> Seems like there is file descriptor leak.
>
> This spark job is running with native support of kubernetes as spark
> cluster manager. Currently using only two executor with 20 core(request)
> and 10GB (+6GB as memoryOverhead) of physical memory each.
>
> Have any one else seen the similar problem ?
> Thanks for any suggestion.
>
>
> Error details:
> Caused by: java.io.FileNotFoundException:
> /tmp/blockmgr-b530983c-39be-4c6d-95aa-3ad12a507843/24/temp_shuffle_bf774cf5-fadb-4ca7-a27b-a5be7b835eb6
> (Too many open files in system)
> at java.io.FileOutputStream.open0(Native Method)
> at java.io.FileOutputStream.open(FileOutputStream.java:270)
> at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
> at
> org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:103)
> at
> org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
> at
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)
> at
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:109)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
>
> For more error log, please follow below Github gist:
>
> https://gist.github.com/abhisheyke/1ecd952f2ae6af20cf737308a156f566
>
>
> Some details about file descriptor (lsof):
> https://gist.github.com/abhisheyke/27b073e7ac805ce9e6bb33c2b011bb5a
>
> Code Snip:
> https://gist.github.com/abhisheyke/6f838adf6651491bd4f263956f403c74
>
> Platform  Details:
> Kubernets Version : 1.9.2
> Docker : 17.3.2
> Spark version:  2.3.1
> Kafka version: 2.11-0.10.2.1 (both topic has 20 partition and getting
> almost 5k records/s )
> Hadoop version (Using hdfs for check pointing)  : 2.7.2
>
> Thank you for any help.
>
> Best Regards,
> *Abhishek Tripathi*
>
>

Re: [Spark Structured Streaming on K8S]: Debug - File handles/descriptor (unix pipe) leaking

Reply via email to