Go through this once, if you haven't read it already.
https://spark.apache.org/docs/latest/tuning.html

Thanks
Best Regards

On Sat, Mar 28, 2015 at 7:33 PM, nsareen <nsar...@gmail.com> wrote:

> Hi All,
>
> I'm facing performance issues with spark implementation, and was briefly
> investigating on WebUI logs, i noticed that my RDD size is 55GB & the
> Shuffle Write is 10 GB & Input Size is 200GB. Application is a web
> application which does predictive analytics, so we keep most of our data in
> memory. This observation was only for 30mins usage of the application on a
> single user. We anticipate atleast 10-15 users of the application sending
> requests in parallel, which makes me a bit nervous.
>
> One constraint we have is that we do not have too many nodes in a cluster,
> we may end up with 3-4 machines at best, but they can be scaled up
> vertically each having 24 cores / 512 GB ram etc. which can allow us to
> make
> a virtual 10-15 node cluster.
>
> Even then the input size & shuffle write is too high for my liking. Any
> suggestions in this regard will be greatly appreciated as there aren't much
> resource on the net for handling performance issues such as these.
>
> Some pointers on my application's data structures & design
>
> 1) RDD is a JavaPairRDD with Key / Value as CustomPOJO containing 3-4
> Hashmaps & Value containing 1 Hashmap
> 2) Data is loaded via JDBCRDD during application startup, which also tends
> to take a lot of time, since we massage the data once it is fetched from DB
> and then save it as JavaPairRDD.
> 3) Most of the data is structured, but we are still using JavaPairRDD, have
> not explored the option of Spark SQL though.
> 4) We have only one SparkContext which caters to all the requests coming
> into the application from various users.
> 5) During a single user session user can send 3-4 parallel stages
> consisting
> of Map / Group By / Join / Reduce etc.
> 6) We have to change the RDD structure using different types of group by
> operations since the user can do drill down drill up of the data (
> aggregation at a higher / lower level). This is where we make use of
> Groupby's but there is a cost associated with this.
> 7) We have observed, that the initial RDD's we create have 40 odd
> partitions, but post some stage executions like groupby's the partitions
> increase to 200 or so, this was odd, and we havn't figured out why this
> happens.
>
> In summary we wan to use Spark to provide us the capability to process our
> in-memory data structure very fast as well as scale to a larger volume when
> required in the future.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/input-size-too-large-Performance-issues-with-Spark-tp22270.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to