Konstatinos,
Sure, if you have a resource leak then the collector can't free up memory and the process will use more memory. Time to break out the profiler and see where the memory is going.

The usual suspects are handles to resources (open file streams, sockets, etc) kept in containers (arrays, lists, etc). If they're in a container, they can't be collected. Another one is keeping handlers in a container which may keep an internal handle to an open resource. If the handler refers to an open resource and the handler (aka listener, aka observer) is in a container, then it the underlying resource can't be collected. Use a profiler to find out where the memory is going.

FWIW, hitting 1 million or 5 million inodes is going to be a likely bottleneck (profile to check). Consider bundling the files up into archives that you access together if you find the file system to be a bottleneck here.

HDFS, for example, was designed for larger files. Even if you're not using HDFS, millions of small files are kryptonite for parallel file systems (Panasas, Lustre, GPFS, etc).

Old Cloudera blog post, but may be relevant here:
http://blog.cloudera.com/blog/2009/02/the-small-files-problem/


-Ewan

On 13/07/15 10:19, Konstantinos Kougios wrote:
I do have other non-xml tasks and I was getting the same SIGTERM on all of them. I think the issue might be due to me processing small files via binaryFiles or wholeTextFiles. Initially I had issues with Xmx memory because I got more than 1 mil files (and in 1 occasion it is 5 mil files). I sorted that out by processing them in batches of 32k. But then this started happening. I've set the memoryOverhead to 4g for most of the tasks and it is ok now. But 4g is too much for tasks that process small files. I do have 32 threads per executor on some tasks but 32meg for stack & thread overhead should do. Maybe the issue is sockets or some mem leak of network communication.

On 13/07/15 09:15, Ewan Higgs wrote:
It depends on how large the xml files are and how you're processing them.

If you're using !ENTITY tags then you don't need a very large piece of xml to consume a lot of memory. e.g. the billion laughs xml:
https://en.wikipedia.org/wiki/Billion_laughs

-Ewan

On 13/07/15 10:11, Konstantinos Kougios wrote:
it was the memoryOverhead. It runs ok with more of that, but do you know which libraries could affect this? I find it strange that it needs 4g for a task that processes some xml files. The task themselfs require less Xmx.

Cheers

On 13/07/15 06:29, Jong Wook Kim wrote:
Based on my experience, YARN containers can get SIGTERM when

- it produces too much logs and use up the hard drive
- it uses off-heap memory more than what is given by spark.yarn.executor.memoryOverhead configuration. It might be due to too many classes loaded (less than MaxPermGen but more than memoryOverhead), or some other off-heap memory allocated by networking library, etc. - it opens too many file descriptors, which you can check on the executor node's /proc/<executor jvm's pid>/fd/

Does any of these apply to your situation?

Jong Wook

On Jul 7, 2015, at 19:16, Kostas Kougios <[email protected]> wrote:

I am still receiving these weird sigterms on the executors. The driver claims
it lost the executor, the executor receives a SIGTERM (from whom???)

It doesn't seem a memory related issue though increasing memory takes the job a bit further or completes it. But why? there is no memory pressure on
neither driver nor executor. And nothing in the logs indicating so.

driver:

15/07/07 10:47:04 INFO scheduler.TaskSetManager: Starting task 14762.0 in stage 0.0 (TID 14762, cruncher03.stratified, PROCESS_LOCAL, 13069 bytes) 15/07/07 10:47:04 INFO scheduler.TaskSetManager: Finished task 14517.0 in stage 0.0 (TID 14517) in 15950 ms on cruncher03.stratified (14507/42240) 15/07/07 10:47:04 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated
or disconnected! Shutting down. cruncher05.stratified:32976
15/07/07 10:47:04 ERROR cluster.YarnClusterScheduler: Lost executor 1 on
cruncher05.stratified: remote Rpc client disassociated
15/07/07 10:47:04 INFO scheduler.TaskSetManager: Re-queueing tasks for 1
from TaskSet 0.0
15/07/07 10:47:04 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated
or disconnected! Shutting down. cruncher05.stratified:32976
15/07/07 10:47:04 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[email protected]:32976] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].

15/07/07 10:47:04 WARN scheduler.TaskSetManager: Lost task 14591.0 in stage 0.0 (TID 14591, cruncher05.stratified): ExecutorLostFailure (executor 1
lost)

gc log for driver, it doesnt look like it run outofmem:

2015-07-07T10:45:19.887+0100: [GC (Allocation Failure)
1764131K->1391211K(3393024K), 0.0102839 secs]
2015-07-07T10:46:00.934+0100: [GC (Allocation Failure)
1764971K->1391867K(3405312K), 0.0099062 secs]
2015-07-07T10:46:45.252+0100: [GC (Allocation Failure)
1782011K->1392596K(3401216K), 0.0167572 secs]

executor:

15/07/07 10:47:03 INFO executor.Executor: Running task 14750.0 in stage 0.0
(TID 14750)
15/07/07 10:47:03 INFO spark.CacheManager: Partition rdd_493_14750 not
found, computing it
15/07/07 10:47:03 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED
SIGNAL 15: SIGTERM
15/07/07 10:47:03 INFO storage.DiskBlockManager: Shutdown hook called

executor gc log (no outofmem as it seems):
2015-07-07T10:47:02.332+0100: [GC (GCLocker Initiated GC)
24696750K->23712939K(33523712K), 0.0416640 secs]
2015-07-07T10:47:02.598+0100: [GC (GCLocker Initiated GC)
24700520K->23722043K(33523712K), 0.0391156 secs]
2015-07-07T10:47:02.862+0100: [GC (Allocation Failure)
24709182K->23726510K(33518592K), 0.0390784 secs]





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RECEIVED-SIGNAL-15-SIGTERM-tp23668.html Sent from the Apache Spark User List mailing list archive at Nabble.com <http://Nabble.com>.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected] <mailto:[email protected]>
For additional commands, e-mail: [email protected]






Reply via email to