Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread arthur.hk.c...@gmail.com
Wonderful !!

On 11 Oct, 2014, at 12:00 am, Nan Zhu  wrote:

> Great! Congratulations!
> 
> -- 
> Nan Zhu
> On Friday, October 10, 2014 at 11:19 AM, Mridul Muralidharan wrote:
> 
>> Brilliant stuff ! Congrats all :-)
>> This is indeed really heartening news !
>> 
>> Regards,
>> Mridul
>> 
>> 
>> On Fri, Oct 10, 2014 at 8:24 PM, Matei Zaharia  
>> wrote:
>>> Hi folks,
>>> 
>>> I interrupt your regularly scheduled user / dev list to bring you some 
>>> pretty cool news for the project, which is that we've been able to use 
>>> Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x 
>>> faster on 10x fewer nodes. There's a detailed writeup at 
>>> http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
>>>  Summary: while Hadoop MapReduce held last year's 100 TB world record by 
>>> sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on 
>>> 206 nodes; and we also scaled up to sort 1 PB in 234 minutes.
>>> 
>>> I want to thank Reynold Xin for leading this effort over the past few 
>>> weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali 
>>> Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for 
>>> providing the machines to make this possible. Finally, this result would of 
>>> course not be possible without the many many other contributions, testing 
>>> and feature requests from throughout the community.
>>> 
>>> For an engine to scale from these multi-hour petabyte batch jobs down to 
>>> 100-millisecond streaming and interactive queries is quite uncommon, and 
>>> it's thanks to all of you folks that we are able to make this happen.
>>> 
>>> Matei
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
> 



Re: Surprising Spark SQL benchmark

2014-11-01 Thread arthur.hk.c...@gmail.com
Hi Key,

Thank you so much for your update!!
Look forward to the shared code from AMPLab.  As a member of the Spark 
community, I really hope that I could help to run TPC-DS on SparkSQL.  At the 
moment, I am trying TPC-H 22 queries on SparkSQL 1.1.0 +Hive 0.12, and Hive 
0.13.1 respectively (waiting Spark 1.2).

Arthur  

On 1 Nov, 2014, at 3:51 am, Kay Ousterhout  wrote:

> There's been an effort in the AMPLab at Berkeley to set up a shared
> codebase that makes it easy to run TPC-DS on SparkSQL, since it's something
> we do frequently in the lab to evaluate new research.  Based on this
> thread, it sounds like making this more widely-available is something that
> would be useful to folks for reproducing the results published by
> Databricks / Hortonworks / Cloudera / etc.; we'll share the code on the
> list as soon as we're done.
> 
> -Kay
> 
> On Fri, Oct 31, 2014 at 12:45 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
> 
>> I believe that benchmark has a pending certification on it. See
>> http://sortbenchmark.org under "Process".
>> 
>> It's true they did not share enough details on the blog for readers to
>> reproduce the benchmark, but they will have to share enough with the
>> committee behind the benchmark in order to be certified. Given that this is
>> a benchmark not many people will be able to reproduce due to size and
>> complexity, I don't see it as a big negative that the details are not laid
>> out as long as there is independent certification from a third party.
>> 
>> From what I've seen so far, the best big data benchmark anywhere is this:
>> https://amplab.cs.berkeley.edu/benchmark/
>> 
>> Is has all the details you'd expect, including hosted datasets, to allow
>> anyone to reproduce the full benchmark, covering a number of systems. I
>> look forward to the next update to that benchmark (a lot has changed since
>> Feb). And from what I can tell, it's produced by the same people behind
>> Spark (Patrick being among them).
>> 
>> So I disagree that the Spark community "hasn't been any better" in this
>> regard.
>> 
>> Nick
>> 
>> 
>> 2014년 10월 31일 금요일, Steve Nunez님이 작성한 메시지:
>> 
>>> To be fair, we (Spark community) haven’t been any better, for example
>> this
>>> benchmark:
>>> 
>>>https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
>>> 
>>> 
>>> For which no details or code have been released to allow others to
>>> reproduce it. I would encourage anyone doing a Spark benchmark in future
>>> to avoid the stigma of vendor reported benchmarks and publish enough
>>> information and code to let others repeat the exercise easily.
>>> 
>>>- Steve
>>> 
>>> 
>>> 
>>> On 10/31/14, 11:30, "Nicholas Chammas" >> > wrote:
>>> 
 Thanks for the response, Patrick.
 
 I guess the key takeaways are 1) the tuning/config details are
>> everything
 (they're not laid out here), 2) the benchmark should be reproducible
>> (it's
 not), and 3) reach out to the relevant devs before publishing (didn't
 happen).
 
 Probably key takeaways for any kind of benchmark, really...
 
 Nick
 
 
 2014년 10월 31일 금요일, Patrick Wendell>님이
>>> 작성한 메시지:
 
> Hey Nick,
> 
> Unfortunately Citus Data didn't contact any of the Spark or Spark SQL
> developers when running this. It is really easy to make one system
> look better than others when you are running a benchmark yourself
> because tuning and sizing can lead to a 10X performance improvement.
> This benchmark doesn't share the mechanism in a reproducible way.
> 
> There are a bunch of things that aren't clear here:
> 
> 1. Spark SQL has optimized parquet features, were these turned on?
> 2. It doesn't mention computing statistics in Spark SQL, but it does
> this for Impala and Parquet. Statistics allow Spark SQL to broadcast
> small tables which can make a 10X difference in TPC-H.
> 3. For data larger than memory, Spark SQL often performs better if you
> don't call "cache", did they try this?
> 
> Basically, a self-reported marketing benchmark like this that
> *shocker* concludes this vendor's solution is the best, is not
> particularly useful.
> 
> If Citus data wants to run a credible benchmark, I'd invite them to
> directly involve Spark SQL developers in the future. Until then, I
> wouldn't give much credence to this or any other similar vendor
> benchmark.
> 
> - Patrick
> 
> On Fri, Oct 31, 2014 at 10:38 AM, Nicholas Chammas
>  > wrote:
>> I know we don't want to be jumping at every benchmark someone posts
> out
>> there, but this one surprised me:
>> 
>> 
>> http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
>> 
>> This benchmark has Spark SQL failing to complete several queries in
> the
>> TPC-H benchmark. I don't understand much about the details of
> performing
>> benchmarks, but this was surpris