Hi Key, Thank you so much for your update!! Look forward to the shared code from AMPLab. As a member of the Spark community, I really hope that I could help to run TPC-DS on SparkSQL. At the moment, I am trying TPC-H 22 queries on SparkSQL 1.1.0 +Hive 0.12, and Hive 0.13.1 respectively (waiting Spark 1.2).
Arthur On 1 Nov, 2014, at 3:51 am, Kay Ousterhout <k...@eecs.berkeley.edu> wrote: > There's been an effort in the AMPLab at Berkeley to set up a shared > codebase that makes it easy to run TPC-DS on SparkSQL, since it's something > we do frequently in the lab to evaluate new research. Based on this > thread, it sounds like making this more widely-available is something that > would be useful to folks for reproducing the results published by > Databricks / Hortonworks / Cloudera / etc.; we'll share the code on the > list as soon as we're done. > > -Kay > > On Fri, Oct 31, 2014 at 12:45 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> I believe that benchmark has a pending certification on it. See >> http://sortbenchmark.org under "Process". >> >> It's true they did not share enough details on the blog for readers to >> reproduce the benchmark, but they will have to share enough with the >> committee behind the benchmark in order to be certified. Given that this is >> a benchmark not many people will be able to reproduce due to size and >> complexity, I don't see it as a big negative that the details are not laid >> out as long as there is independent certification from a third party. >> >> From what I've seen so far, the best big data benchmark anywhere is this: >> https://amplab.cs.berkeley.edu/benchmark/ >> >> Is has all the details you'd expect, including hosted datasets, to allow >> anyone to reproduce the full benchmark, covering a number of systems. I >> look forward to the next update to that benchmark (a lot has changed since >> Feb). And from what I can tell, it's produced by the same people behind >> Spark (Patrick being among them). >> >> So I disagree that the Spark community "hasn't been any better" in this >> regard. >> >> Nick >> >> >> 2014년 10월 31일 금요일, Steve Nunez<snu...@hortonworks.com>님이 작성한 메시지: >> >>> To be fair, we (Spark community) haven’t been any better, for example >> this >>> benchmark: >>> >>> https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html >>> >>> >>> For which no details or code have been released to allow others to >>> reproduce it. I would encourage anyone doing a Spark benchmark in future >>> to avoid the stigma of vendor reported benchmarks and publish enough >>> information and code to let others repeat the exercise easily. >>> >>> - Steve >>> >>> >>> >>> On 10/31/14, 11:30, "Nicholas Chammas" <nicholas.cham...@gmail.com >>> <javascript:;>> wrote: >>> >>>> Thanks for the response, Patrick. >>>> >>>> I guess the key takeaways are 1) the tuning/config details are >> everything >>>> (they're not laid out here), 2) the benchmark should be reproducible >> (it's >>>> not), and 3) reach out to the relevant devs before publishing (didn't >>>> happen). >>>> >>>> Probably key takeaways for any kind of benchmark, really... >>>> >>>> Nick >>>> >>>> >>>> 2014년 10월 31일 금요일, Patrick Wendell<pwend...@gmail.com <javascript:;>>님이 >>> 작성한 메시지: >>>> >>>>> Hey Nick, >>>>> >>>>> Unfortunately Citus Data didn't contact any of the Spark or Spark SQL >>>>> developers when running this. It is really easy to make one system >>>>> look better than others when you are running a benchmark yourself >>>>> because tuning and sizing can lead to a 10X performance improvement. >>>>> This benchmark doesn't share the mechanism in a reproducible way. >>>>> >>>>> There are a bunch of things that aren't clear here: >>>>> >>>>> 1. Spark SQL has optimized parquet features, were these turned on? >>>>> 2. It doesn't mention computing statistics in Spark SQL, but it does >>>>> this for Impala and Parquet. Statistics allow Spark SQL to broadcast >>>>> small tables which can make a 10X difference in TPC-H. >>>>> 3. For data larger than memory, Spark SQL often performs better if you >>>>> don't call "cache", did they try this? >>>>> >>>>> Basically, a self-reported marketing benchmark like this that >>>>> *shocker* concludes this vendor's solution is the best, is not >>>>> particularly useful. >>>>> >>>>> If Citus data wants to run a credible benchmark, I'd invite them to >>>>> directly involve Spark SQL developers in the future. Until then, I >>>>> wouldn't give much credence to this or any other similar vendor >>>>> benchmark. >>>>> >>>>> - Patrick >>>>> >>>>> On Fri, Oct 31, 2014 at 10:38 AM, Nicholas Chammas >>>>> <nicholas.cham...@gmail.com <javascript:;> <javascript:;>> wrote: >>>>>> I know we don't want to be jumping at every benchmark someone posts >>>>> out >>>>>> there, but this one surprised me: >>>>>> >>>>>> >> http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style >>>>>> >>>>>> This benchmark has Spark SQL failing to complete several queries in >>>>> the >>>>>> TPC-H benchmark. I don't understand much about the details of >>>>> performing >>>>>> benchmarks, but this was surprising. >>>>>> >>>>>> Are these results expected? >>>>>> >>>>>> Related HN discussion here: >>>>> https://news.ycombinator.com/item?id=8539678 >>>>>> >>>>>> Nick >>>>> >>> >>> >>> >>> -- >>> CONFIDENTIALITY NOTICE >>> NOTICE: This message is intended for the use of the individual or entity >> to >>> which it is addressed and may contain information that is confidential, >>> privileged and exempt from disclosure under applicable law. If the reader >>> of this message is not the intended recipient, you are hereby notified >> that >>> any printing, copying, dissemination, distribution, disclosure or >>> forwarding of this communication is strictly prohibited. If you have >>> received this communication in error, please contact the sender >> immediately >>> and delete it from your system. Thank You. >>> >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org