Re: Largest input data set observed for Spark.

2014-03-20 Thread Andrew Ash
Understood of course. Did the data fit comfortably in memory or did you experience memory pressure? I've had to do a fair amount of tuning when under memory pressure in the past (0.7.x) and was hoping that the handling of this scenario is improved in later Spark versions. On Thu, Mar 20, 2014 a

Re: Largest input data set observed for Spark.

2014-03-20 Thread Henry Saputra
Reynold, just curious did you guys ran it in AWS? - Henry On Thu, Mar 20, 2014 at 11:08 AM, Reynold Xin wrote: > Actually we just ran a job with 70TB+ compressed data on 28 worker nodes - > I didn't count the size of the uncompressed data, but I am guessing it is > somewhere between 200TB to 700

Re: Largest input data set observed for Spark.

2014-03-20 Thread Reynold Xin
I'm not really at liberty to discuss details of the job. It involves some expensive aggregated statistics, and took 10 hours to complete (mostly bottlenecked by network & io). On Thu, Mar 20, 2014 at 11:12 AM, Surendranauth Hiraman < suren.hira...@velos.io> wrote: > Reynold, > > How complex w

Re: Largest input data set observed for Spark.

2014-03-20 Thread Surendranauth Hiraman
Reynold, How complex was that job (I guess in terms of number of transforms and actions) and how long did that take to process? -Suren On Thu, Mar 20, 2014 at 2:08 PM, Reynold Xin wrote: > Actually we just ran a job with 70TB+ compressed data on 28 worker nodes - > I didn't count the size of

Re: Largest input data set observed for Spark.

2014-03-20 Thread Reynold Xin
Actually we just ran a job with 70TB+ compressed data on 28 worker nodes - I didn't count the size of the uncompressed data, but I am guessing it is somewhere between 200TB to 700TB. On Thu, Mar 20, 2014 at 12:23 AM, Usman Ghani wrote: > All, > What is the largest input data set y'all have com

Largest input data set observed for Spark.

2014-03-20 Thread Usman Ghani
All, What is the largest input data set y'all have come across that has been successfully processed in production using spark. Ball park?