I have an RDD[String, MyObj] which is a result of Join + Map operation. It
has no partitioner info. I run reduceByKey without passing any Partitioner
or partition counts. I observed that output aggregation result for given
key is incorrect sometime. like 1 out of 5 times. It looks like reduce
operation is joining values from two different keys. There is no
configuration change between multiple runs. I am scratching my head over
this. I verified results by printing out RDD before and after reduce
operation; collecting subset at driver.
Besides shuffle and storage memory fraction I use following options:
sparkConf.set("spark.driver.userClassPathFirst","true")
sparkConf.set("spark.unsafe.offHeap","true")
sparkConf.set("spark.reducer.maxSizeInFlight","128m")
sparkConf.set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
--
[image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
<https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn]
<https://www.linkedin.com/company/xactly-corporation> [image: Twitter]
<https://twitter.com/Xactly> [image: Facebook]
<https://www.facebook.com/XactlyCorp> [image: YouTube]
<http://www.youtube.com/xactlycorporation>