Re: RECEIVED SIGNAL 15: SIGTERM

Ewan Higgs Mon, 13 Jul 2015 02:13:11 -0700

Konstatinos,

Sure, if you have a resource leak then the collector can't free upmemory and the process will use more memory. Time to break out theprofiler and see where the memory is going.

The usual suspects are handles to resources (open file streams, sockets,etc) kept in containers (arrays, lists, etc). If they're in a container,they can't be collected. Another one is keeping handlers in a containerwhich may keep an internal handle to an open resource. If the handlerrefers to an open resource and the handler (aka listener, aka observer)is in a container, then it the underlying resource can't be collected.Use a profiler to find out where the memory is going.

FWIW, hitting 1 million or 5 million inodes is going to be a likelybottleneck (profile to check). Consider bundling the files up intoarchives that you access together if you find the file system to be abottleneck here.

HDFS, for example, was designed for larger files. Even if you're notusing HDFS, millions of small files are kryptonite for parallel filesystems (Panasas, Lustre, GPFS, etc).


Old Cloudera blog post, but may be relevant here:
http://blog.cloudera.com/blog/2009/02/the-small-files-problem/


-Ewan

On 13/07/15 10:19, Konstantinos Kougios wrote:

I do have other non-xml tasks and I was getting the same SIGTERM onall of them. I think the issue might be due to me processing smallfiles via binaryFiles or wholeTextFiles. Initially I had issues withXmx memory because I got more than 1 mil files (and in 1 occasion itis 5 mil files). I sorted that out by processing them in batches of32k. But then this started happening. I've set the memoryOverhead to4g for most of the tasks and it is ok now. But 4g is too much fortasks that process small files. I do have 32 threads per executor onsome tasks but 32meg for stack & thread overhead should do. Maybe theissue is sockets or some mem leak of network communication.
On 13/07/15 09:15, Ewan Higgs wrote:
It depends on how large the xml files are and how you're processing them.
If you're using !ENTITY tags then you don't need a very large pieceof xml to consume a lot of memory. e.g. the billion laughs xml:
https://en.wikipedia.org/wiki/Billion_laughs

-Ewan

On 13/07/15 10:11, Konstantinos Kougios wrote:
it was the memoryOverhead. It runs ok with more of that, but do youknow which libraries could affect this? I find it strange that itneeds 4g for a task that processes some xml files. The taskthemselfs require less Xmx.
Cheers

On 13/07/15 06:29, Jong Wook Kim wrote:
Based on my experience, YARN containers can get SIGTERM when

- it produces too much logs and use up the hard drive
- it uses off-heap memory more than what is given byspark.yarn.executor.memoryOverhead configuration. It might be dueto too many classes loaded (less than MaxPermGen but more thanmemoryOverhead), or some other off-heap memory allocated bynetworking library, etc.- it opens too many file descriptors, which you can check on theexecutor node's /proc/<executor jvm's pid>/fd/
Does any of these apply to your situation?

Jong Wook
On Jul 7, 2015, at 19:16, Kostas Kougios<[email protected]> wrote:
I am still receiving these weird sigterms on the executors. Thedriver claims
it lost the executor, the executor receives a SIGTERM (from whom???)
It doesn't seem a memory related issue though increasing memorytakes thejob a bit further or completes it. But why? there is no memorypressure on
neither driver nor executor. And nothing in the logs indicating so.

driver:
15/07/07 10:47:04 INFO scheduler.TaskSetManager: Starting task14762.0 instage 0.0 (TID 14762, cruncher03.stratified, PROCESS_LOCAL, 13069bytes)15/07/07 10:47:04 INFO scheduler.TaskSetManager: Finished task14517.0 instage 0.0 (TID 14517) in 15950 ms on cruncher03.stratified(14507/42240)15/07/07 10:47:04 INFO yarn.ApplicationMaster$AMEndpoint: Driverterminated
or disconnected! Shutting down. cruncher05.stratified:32976
15/07/07 10:47:04 ERROR cluster.YarnClusterScheduler: Lostexecutor 1 on
cruncher05.stratified: remote Rpc client disassociated
15/07/07 10:47:04 INFO scheduler.TaskSetManager: Re-queueing tasksfor 1
from TaskSet 0.0
15/07/07 10:47:04 INFO yarn.ApplicationMaster$AMEndpoint: Driverterminated
or disconnected! Shutting down. cruncher05.stratified:32976
15/07/07 10:47:04 WARN remote.ReliableDeliverySupervisor:Association withremote system[akka.tcp://[email protected]:32976] hasfailed, address is now gated for [5000] ms. Reason is:[Disassociated].
15/07/07 10:47:04 WARN scheduler.TaskSetManager: Lost task 14591.0in stage0.0 (TID 14591, cruncher05.stratified): ExecutorLostFailure(executor 1
lost)

gc log for driver, it doesnt look like it run outofmem:

2015-07-07T10:45:19.887+0100: [GC (Allocation Failure)
1764131K->1391211K(3393024K), 0.0102839 secs]
2015-07-07T10:46:00.934+0100: [GC (Allocation Failure)
1764971K->1391867K(3405312K), 0.0099062 secs]
2015-07-07T10:46:45.252+0100: [GC (Allocation Failure)
1782011K->1392596K(3401216K), 0.0167572 secs]

executor:
15/07/07 10:47:03 INFO executor.Executor: Running task 14750.0 instage 0.0
(TID 14750)
15/07/07 10:47:03 INFO spark.CacheManager: Partition rdd_493_14750 not
found, computing it
15/07/07 10:47:03 ERROR executor.CoarseGrainedExecutorBackend:RECEIVED
SIGNAL 15: SIGTERM
15/07/07 10:47:03 INFO storage.DiskBlockManager: Shutdown hook called

executor gc log (no outofmem as it seems):
2015-07-07T10:47:02.332+0100: [GC (GCLocker Initiated GC)
24696750K->23712939K(33523712K), 0.0416640 secs]
2015-07-07T10:47:02.598+0100: [GC (GCLocker Initiated GC)
24700520K->23722043K(33523712K), 0.0391156 secs]
2015-07-07T10:47:02.862+0100: [GC (Allocation Failure)
24709182K->23726510K(33518592K), 0.0390784 secs]





--
View this message in context:http://apache-spark-user-list.1001560.n3.nabble.com/RECEIVED-SIGNAL-15-SIGTERM-tp23668.htmlSent from the Apache Spark User List mailing list archive atNabble.com <http://Nabble.com>.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]<mailto:[email protected]>
For additional commands, e-mail: [email protected]

Re: RECEIVED SIGNAL 15: SIGTERM

Reply via email to