Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-03 Thread Xiangrui Meng
Was "user" presented in training? We can put a check there and return NaN if the user is not included in the model. -Xiangrui On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das wrote: > Hi, > > I am testing MatrixFactorizationModel.predict(user: Int, product: Int) but > the code fails on userFeatures.l

Re: Spark shuffle consolidateFiles performance degradation numbers

2014-11-03 Thread Zach Fry
Hey Andrew, Matei, Thanks for responding. For some more context, we were running into "Too many open files" issues where we were seeing this happen immediately after the Collect phase (about 30 seconds into a run) on a decently sized dataset (14 MM rows). The ulimit set in the spark-env was 256,0

Re: Spark shuffle consolidateFiles performance degradation numbers

2014-11-03 Thread Matei Zaharia
(BTW this had a bug with negative hash codes in 1.1.0 so you should try branch-1.1 for it). Matei > On Nov 3, 2014, at 6:28 PM, Matei Zaharia wrote: > > In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will have > better performance while creating fewer files. So I'd suggest

Re: Spark shuffle consolidateFiles performance degradation numbers

2014-11-03 Thread Matei Zaharia
In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will have better performance while creating fewer files. So I'd suggest trying that too. Matei > On Nov 3, 2014, at 6:12 PM, Andrew Or wrote: > > Hey Matt, > > There's some prior work that compares consolidation performance on

Re: Spark shuffle consolidateFiles performance degradation numbers

2014-11-03 Thread Andrew Or
Hey Matt, There's some prior work that compares consolidation performance on some medium-scale workload: http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf There we noticed about 2x performance degradation in the reduce phase on ext3. I am not aware of a

Spark shuffle consolidateFiles performance degradation quantification

2014-11-03 Thread Matt Cheah
Hi everyone, I'm running into more and more cases where too many files are opened when spark.shuffle.consolidateFiles is turned off. I was wondering if this is a common scenario among the rest of the community, and if so, if it is worth considering the setting to be turned on by default. From the

Spark shuffle consolidateFiles performance degradation numbers

2014-11-03 Thread Matt Cheah
Hi everyone, I'm running into more and more cases where too many files are opened when spark.shuffle.consolidateFiles is turned off. I was wondering if this is a common scenario among the rest of the community, and if so, if it is worth considering the setting to be turned on by default. From the

MatrixFactorizationModel predict(Int, Int) API

2014-11-03 Thread Debasish Das
Hi, I am testing MatrixFactorizationModel.predict(user: Int, product: Int) but the code fails on userFeatures.lookup(user).head In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has been called and in all the test-cases that API has been used... I can perhaps refactor my code to d

Re: matrix factorization cross validation

2014-11-03 Thread Debasish Das
I added the drivers for precisionAt(k: Int) driver for the movielens test-cases...Although I am a bit confused on precisionAt(k: Int) code from RankingMetrics.scala... While cross validating, I am really not sure how to set K... if (labSet.nonEmpty) { val n = math.min(pred.length, k) ... } If I

Re: Surprising Spark SQL benchmark

2014-11-03 Thread ozgun
Hey Patrick, It's Ozgun from Citus Data. We'd like to make these benchmark results fair, and have tried different config settings for SparkSQL over the past month. We picked the best config settings we could find, and also contacted the Spark users list about running TPC-H numbers. http://goo.gl/

Re: branch-1.2 has been cut

2014-11-03 Thread Nicholas Chammas
Minor question, but when would be the right time to update the default Spark version in the EC2 script? On Mon, Nov 3, 2014 at 3:55 AM, Patrick Wendell wrote: > Hi All, > > I've just cut the rele

Re: sbt scala compiler crashes on spark-sql

2014-11-03 Thread Imran Rashid
thanks everyone, that worked. I had been just cleaning the "sql" project, which wasn't enough, but a full clean of everything and its happy now. just in case this helps anybody else come up with steps to reproduce, for me the error was always in DataTypeConversions.scala, and I think it *might* h

branch-1.2 has been cut

2014-11-03 Thread Patrick Wendell
Hi All, I've just cut the release branch for Spark 1.2, consistent with then end of the scheduled feature window for the release. New commits to master will need to be explicitly merged into branch-1.2 in order to be in the release. This begins the transition into a QA period for Spark 1.2, with