I got some time to look in to it. It appears as that Spark (latest git)
is doing this operation much more often compare to Aug 1 version. Here
is the log from operation I am referring to
14/08/19 12:37:26 INFO spark.CacheManager: Partition rdd_8_414 not
found, computing it
14/08/19 12:37:26 INFO r
Hi,
I am running spark from the git directly. I recently compiled the newer
version Aug 13 version and it has performance drop of 2-3x in read from
HDFS compare to git version of Aug 1. So I am wondering which commit
would have cause such an issue in read performance. The performance is
almost sam