Re: SparkSQL 1.3.0 (RC3) failed to read parquet file generated by 1.1.1

2015-03-12 Thread Michael Armbrust
We are looking at the issue and will likely fix it for Spark 1.3.1. On Thu, Mar 12, 2015 at 8:25 PM, giive chen wrote: > Hi all > > My team has the same issue. It looks like Spark 1.3's sparkSQL cannot read > parquet file generated by Spark 1.1. It will cost a lot of migration work > when we wan

Re: Using CUDA within Spark / boosting linear algebra

2015-03-12 Thread Reynold Xin
Thanks for chiming in, John. I missed your meetup last night - do you have any writeups or slides about roofline design? In particular, I'm curious about what optimizations are available for power-law dense * sparse? (I don't have any background in optimizations) On Thu, Mar 12, 2015 at 8:50 PM,

Re: Using CUDA within Spark / boosting linear algebra

2015-03-12 Thread jfcanny
If you're contemplating GPU acceleration in Spark, its important to look beyond BLAS. Dense BLAS probably account for only 10% of the cycles in the datasets we've tested in BIDMach, and we've tried to make them representative of industry machine learning workloads. Unless you're crunching images or

Re: SparkSQL 1.3.0 (RC3) failed to read parquet file generated by 1.1.1

2015-03-12 Thread giive chen
Hi all My team has the same issue. It looks like Spark 1.3's sparkSQL cannot read parquet file generated by Spark 1.1. It will cost a lot of migration work when we wanna to upgrade Spark 1.3. Is there anyone can help me? Thanks Wisely Chen On Tue, Mar 10, 2015 at 5:06 PM, Pei-Lun Lee wrote

Re: Is this a bug in MLlib.stat.test ? About the mapPartitions API used in Chi-Squared test

2015-03-12 Thread Joseph Bradley
The checks against maxCategories are not for statistical purposes; they are to make sure communication does not blow up. There currently are not checks to make sure that there are enough entries for statistically significant results. That is up to the user. I do like the idea of adding a warning

Profiling Spark: MemoryStore

2015-03-12 Thread Ulanov, Alexander
Hi, I am working on artificial neural networks for Spark. It is solved with Gradient Descent, so each step the data is read, sum of gradients is calculated for each data partition (on each worker), aggregated (on the driver) and broadcasted back. I noticed that the gradient computation time is

Spilling when not expected

2015-03-12 Thread Tom Hubregtsen
Hi all, I'm running the teraSort benchmark with a relative small input set: 5GB. During profiling, I can see I am using a total of 68GB. I've got a terabyte of memory in my system, and set spark.executor.memory 900g spark.driver.memory 900g I use the default for spark.shuffle.memoryFraction spar

Re: adding some temporary jenkins worker nodes...

2015-03-12 Thread shane knapp
the big 1.3 push is over, so i'll be reclaiming these three extra workers. :) On Mon, Feb 9, 2015 at 5:18 PM, shane knapp wrote: > ...to help w/the build backlog. let's all welcome > amp-jenkins-slave-{01..03} back to the fray! >

Re: Using CUDA within Spark / boosting linear algebra

2015-03-12 Thread Shivaram Venkataraman
I have run some BLAS comparison benchmarks on different EC2 instance sizes and also on NERSC super computers. I can put together a github-backed website where we can host latest benchmark results and update them over time. Sam -- Does that sound like what you had in mind ? Thanks Shivaram On Tue

Is this a bug in MLlib.stat.test ? About the mapPartitions API used in Chi-Squared test

2015-03-12 Thread Chunnan Yao
Hi everyone! I am digging into MLlib of Spark 1.2.1 currently. When reading codes of MLlib.stat.test, in the file ChiSqTest.scala under /spark/mllib/src/main/scala/org/apache/spark/mllib/stat/test, I am confused by the usage of mapPartitions API in the function def chiSquaredFeatures(data: RDD[La