Re: Parquet Migrations

2014-10-31 Thread Michael Armbrust
You can't change parquet schema without reencoding the data as you need to recalculate the footer index data. You can manually do what SPARK-3851 is going to do today however. Consider two schemas: Old Schema: (a: Int, b: String) New Schema, whe

Re: Spark consulting

2014-10-31 Thread Stephen Boesch
HI Alessandro, It is important to me and probably others as well to be able to focus on the technical issues and not be distracted that way. thanks stephenb 2014-10-31 13:48 GMT-07:00 Alessandro Baretta : > Stephen, > > Sorry for being OT. On the other hand, there is no j...@spark.apach

Re: Spark consulting

2014-10-31 Thread Alessandro Baretta
Stephen, Sorry for being OT. On the other hand, there is no j...@spark.apache.org, and the LinkedIn Spark group is a desert. Alex On Fri, Oct 31, 2014 at 1:44 PM, Stephen Boesch wrote: > May we please refrain from using spark mailing list for job inquiries. > Thanks. > > 2014-10-31 13:35 GMT-0

Parquet Migrations

2014-10-31 Thread Gary Malouf
Outside of what is discussed here as a future solution, is there any path for being able to modify a Parquet schema once some data has been written? This seems like the kind of thing that should make people pause when considering whether or not to

Re: Spark consulting

2014-10-31 Thread Stephen Boesch
May we please refrain from using spark mailing list for job inquiries. Thanks. 2014-10-31 13:35 GMT-07:00 Alessandro Baretta : > Hello, > > Is anyone open to do some consulting work on Spark in San Mateo? > > Thanks. > > Alex >

Spark consulting

2014-10-31 Thread Alessandro Baretta
Hello, Is anyone open to do some consulting work on Spark in San Mateo? Thanks. Alex

Re: Surprising Spark SQL benchmark

2014-10-31 Thread Kay Ousterhout
There's been an effort in the AMPLab at Berkeley to set up a shared codebase that makes it easy to run TPC-DS on SparkSQL, since it's something we do frequently in the lab to evaluate new research. Based on this thread, it sounds like making this more widely-available is something that would be us

Re: Surprising Spark SQL benchmark

2014-10-31 Thread Nicholas Chammas
I believe that benchmark has a pending certification on it. See http://sortbenchmark.org under "Process". It's true they did not share enough details on the blog for readers to reproduce the benchmark, but they will have to share enough with the committee behind the benchmark in order to be certif

Re: Surprising Spark SQL benchmark

2014-10-31 Thread Steve Nunez
To be fair, we (Spark community) haven’t been any better, for example this benchmark: https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html For which no details or code have been released to allow others to reproduce it. I would encourage anyone doing a Spark benchmark in futur

Re: Surprising Spark SQL benchmark

2014-10-31 Thread Nicholas Chammas
Thanks for the response, Patrick. I guess the key takeaways are 1) the tuning/config details are everything (they're not laid out here), 2) the benchmark should be reproducible (it's not), and 3) reach out to the relevant devs before publishing (didn't happen). Probably key takeaways for any kind

Re: Surprising Spark SQL benchmark

2014-10-31 Thread Patrick Wendell
Hey Nick, Unfortunately Citus Data didn't contact any of the Spark or Spark SQL developers when running this. It is really easy to make one system look better than others when you are running a benchmark yourself because tuning and sizing can lead to a 10X performance improvement. This benchmark d

Surprising Spark SQL benchmark

2014-10-31 Thread Nicholas Chammas
I know we don't want to be jumping at every benchmark someone posts out there, but this one surprised me: http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style This benchmark has Spark SQL failing to complete several queries in the TPC-H benchmark. I don't understand much about th

Re: matrix factorization cross validation

2014-10-31 Thread Sean Owen
No, excepting approximate methods like LSH to figure out the relatively small set of candidates for the users in the partition, and broadcast or join those. On Fri, Oct 31, 2014 at 5:45 AM, Nick Pentreath wrote: > Sean, re my point earlier do you know a more efficient way to compute top k > for e