Go through this once, if you haven't read it already. https://spark.apache.org/docs/latest/tuning.html
Thanks Best Regards On Sat, Mar 28, 2015 at 7:33 PM, nsareen <nsar...@gmail.com> wrote: > Hi All, > > I'm facing performance issues with spark implementation, and was briefly > investigating on WebUI logs, i noticed that my RDD size is 55GB & the > Shuffle Write is 10 GB & Input Size is 200GB. Application is a web > application which does predictive analytics, so we keep most of our data in > memory. This observation was only for 30mins usage of the application on a > single user. We anticipate atleast 10-15 users of the application sending > requests in parallel, which makes me a bit nervous. > > One constraint we have is that we do not have too many nodes in a cluster, > we may end up with 3-4 machines at best, but they can be scaled up > vertically each having 24 cores / 512 GB ram etc. which can allow us to > make > a virtual 10-15 node cluster. > > Even then the input size & shuffle write is too high for my liking. Any > suggestions in this regard will be greatly appreciated as there aren't much > resource on the net for handling performance issues such as these. > > Some pointers on my application's data structures & design > > 1) RDD is a JavaPairRDD with Key / Value as CustomPOJO containing 3-4 > Hashmaps & Value containing 1 Hashmap > 2) Data is loaded via JDBCRDD during application startup, which also tends > to take a lot of time, since we massage the data once it is fetched from DB > and then save it as JavaPairRDD. > 3) Most of the data is structured, but we are still using JavaPairRDD, have > not explored the option of Spark SQL though. > 4) We have only one SparkContext which caters to all the requests coming > into the application from various users. > 5) During a single user session user can send 3-4 parallel stages > consisting > of Map / Group By / Join / Reduce etc. > 6) We have to change the RDD structure using different types of group by > operations since the user can do drill down drill up of the data ( > aggregation at a higher / lower level). This is where we make use of > Groupby's but there is a cost associated with this. > 7) We have observed, that the initial RDD's we create have 40 odd > partitions, but post some stage executions like groupby's the partitions > increase to 200 or so, this was odd, and we havn't figured out why this > happens. > > In summary we wan to use Spark to provide us the capability to process our > in-memory data structure very fast as well as scale to a larger volume when > required in the future. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/input-size-too-large-Performance-issues-with-Spark-tp22270.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >