I should have mentioned: yes I am using Kryo and have registered KeyClass and ValueClass.
I guess it’s not clear to me what is actually taking up space on the driver heap - I can’t see how it can be data with the code that I have. On 27/08/2015 12:09, "Ewan Leith" <ewan.le...@realitymine.com> wrote: >Are you using the Kryo serializer? If not, have a look at it, it can save a >lot of memory during shuffles > >https://spark.apache.org/docs/latest/tuning.html > >I did a similar task and had various issues with the volume of data being >parsed in one go, but that helped a lot. It looks like the main difference >from what you're doing to me is that my input classes were just a string and a >byte array, which I then processed once it was read into the RDD, maybe your >classes are memory heavy? > > >Thanks, >Ewan > >-----Original Message----- >From: andrew.row...@thomsonreuters.com >[mailto:andrew.row...@thomsonreuters.com] >Sent: 27 August 2015 11:53 >To: user@spark.apache.org >Subject: Driver running out of memory - caused by many tasks? > >I have a spark v.1.4.1 on YARN job where the first stage has ~149,000 tasks >(it’s reading a few TB of data). The job itself is fairly simple - it’s just >getting a list of distinct values: > > val days = spark > .sequenceFile(inputDir, classOf[KeyClass], classOf[ValueClass]) > .sample(withReplacement = false, fraction = 0.01) > .map(row => row._1.getTimestamp.toString("yyyy-MM-dd")) > .distinct() > .collect() > >The cardinality of the ‘day’ is quite small - there’s only a handful. However, >I’m frequently running into OutOfMemory issues on the driver. I’ve had it fail >with 24GB RAM, and am currently nudging it upwards to find out where it works. >The ratio between input and shuffle write in the distinct stage is about >3TB:7MB. On a smaller dataset, it works without issue on a smaller (4GB) heap. >In YARN cluster mode, I get a failure message similar to: > > Container > [pid=36844,containerID=container_e15_1438040390147_4982_01_000001] is running > beyond physical memory limits. Current usage: 27.6 GB of 27 GB physical > memory used; 29.5 GB of 56.7 GB virtual memory used. Killing container. > > >Is the driver running out of memory simply due to the number of tasks, or is >there something about the job program that’s causing it to put a lot of data >into the driver heap and go oom? If the former, is there any general guidance >about the amount of memory to give to the driver as a function of how many >tasks there are? > >Andrew
smime.p7s
Description: S/MIME cryptographic signature