I should have mentioned: yes I am using Kryo and have registered KeyClass and 
ValueClass.



I guess it’s not clear to me what is actually taking up space on the driver 
heap - I can’t see how it can be data with the code that I have.

On 27/08/2015 12:09, "Ewan Leith" <ewan.le...@realitymine.com> wrote:

>Are you using the Kryo serializer? If not, have a look at it, it can save a 
>lot of memory during shuffles
>
>https://spark.apache.org/docs/latest/tuning.html
>
>I did a similar task and had various issues with the volume of data being 
>parsed in one go, but that helped a lot. It looks like the main difference 
>from what you're doing to me is that my input classes were just a string and a 
>byte array, which I then processed once it was read into the RDD, maybe your 
>classes are memory heavy?
>
>
>Thanks,
>Ewan
>
>-----Original Message-----
>From: andrew.row...@thomsonreuters.com 
>[mailto:andrew.row...@thomsonreuters.com] 
>Sent: 27 August 2015 11:53
>To: user@spark.apache.org
>Subject: Driver running out of memory - caused by many tasks?
>
>I have a spark v.1.4.1 on YARN job where the first stage has ~149,000 tasks 
>(it’s reading a few TB of data). The job itself is fairly simple - it’s just 
>getting a list of distinct values:
>
>    val days = spark
>      .sequenceFile(inputDir, classOf[KeyClass], classOf[ValueClass])
>      .sample(withReplacement = false, fraction = 0.01)
>      .map(row => row._1.getTimestamp.toString("yyyy-MM-dd"))
>      .distinct()
>      .collect()
>
>The cardinality of the ‘day’ is quite small - there’s only a handful. However, 
>I’m frequently running into OutOfMemory issues on the driver. I’ve had it fail 
>with 24GB RAM, and am currently nudging it upwards to find out where it works. 
>The ratio between input and shuffle write in the distinct stage is about 
>3TB:7MB. On a smaller dataset, it works without issue on a smaller (4GB) heap. 
>In YARN cluster mode, I get a failure message similar to:
>
>    Container 
> [pid=36844,containerID=container_e15_1438040390147_4982_01_000001] is running 
> beyond physical memory limits. Current usage: 27.6 GB of 27 GB physical 
> memory used; 29.5 GB of 56.7 GB virtual memory used. Killing container.
>
>
>Is the driver running out of memory simply due to the number of tasks, or is 
>there something about the job program that’s causing it to put a lot of data 
>into the driver heap and go oom? If the former, is there any general guidance 
>about the amount of memory to give to the driver as a function of how many 
>tasks there are?
>
>Andrew

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to