Hi everyone,
When I run a spark job contains quite a lot of tasks(in my case is
200,000*200,000), the driver occured OOM mainly caused by the object MapStatus,
As is shown in the pic bellow, RoaringBitmap that used to mark which block is
empty seems to use too many memories.
Are there any
In our case, we are dealing with 20TB text data which is separated to about
200k map tasks and 200k reduce tasks, and our driver's memory is 15G,.
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/MapStatus-too-large-for-drvier-tp14704p14707.html
Sent fr
I try to use org.apache.spark.util.collection.BitSet instead of
RoaringBitMap, and it can save about 20% memories but runs much slower.
For the 200K tasks job,
RoaringBitMap uses 3 Long[1024] and 1 Short[3392]
=3*64*1024+16*3392=250880(bit)
BitSet uses 1 Long[3125] = 3125*64=20(bit)
Memory s