Re: Unit test logs in Jenkins?

2015-04-01 Thread Patrick Wendell
Hey Marcelo, Great question. Right now, some of the more active developers have an account that allows them to log into this cluster to inspect logs (we copy the logs from each run to a node on that cluster). The infrastructure is maintained by the AMPLab. I will put you in touch the someone ther

RE: Can I call aggregate UDF in DataFrame?

2015-04-01 Thread Haopu Wang
Great! Thank you! From: Reynold Xin [mailto:r...@databricks.com] Sent: Thursday, April 02, 2015 8:11 AM To: Haopu Wang Cc: user; dev@spark.apache.org Subject: Re: Can I call aggregate UDF in DataFrame? You totally can. https://github.com/apache/spark/b

Re: Can I call aggregate UDF in DataFrame?

2015-04-01 Thread Reynold Xin
You totally can. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L792 There is also an attempt at adding stddev here already: https://github.com/apache/spark/pull/5228 On Thu, Mar 26, 2015 at 12:37 AM, Haopu Wang wrote: > Specifically

Volunteers for Spark MOOCs

2015-04-01 Thread Ameet Talwalkar
Dear Spark Devs, Anthony Joseph and I are teaching two large MOOCs this summer on Apache Spark and we are looking for participants from the community who would like to help us administer the course. Anthony is a Professor in Computer Scie

Re: Spark 2.0: Rearchitecting Spark for Mobile, Local, Social

2015-04-01 Thread Burak Yavuz
This is awesome! I can write the apps for it, to make the Web UI more functional! On Wed, Apr 1, 2015 at 12:37 AM, Tathagata Das wrote: > This is a significant effort that Reynold has undertaken, and I am super > glad to see that it's finally taking a concrete form. Would love to see > what the

RE: Storing large data for MLlib machine learning

2015-04-01 Thread Ulanov, Alexander
Jeremy, thanks for explanation! What if instead you've used Parquet file format? You can still write a number of small files as you do, but you don't have to implement a writer/reader, because they are available for Parquet in various languages. From: Jeremy Freeman [mailto:freeman.jer...@gmail.

Re: Storing large data for MLlib machine learning

2015-04-01 Thread Jeremy Freeman
@Alexander, re: using flat binary and metadata, you raise excellent points! At least in our case, we decided on a specific endianness, but do end up storing some extremely minimal specification in a JSON file, and have written importers and exporters within our library to parse it. While it does

Unit test logs in Jenkins?

2015-04-01 Thread Marcelo Vanzin
Hey all, Is there a way to access unit test logs in jenkins builds? e.g., core/target/unit-tests.log That would be really helpful to debug build failures. The scalatest output isn't all that helpful. If that's currently not available, would it be possible to add those logs as build artifacts? -

Re: Storing large data for MLlib machine learning

2015-04-01 Thread Hector Yee
Just using sc.textfile then a .map(decode) Yes by default it is multiple files .. our training data is 1TB gzipped into 5000 shards. On Wed, Apr 1, 2015 at 12:32 PM, Ulanov, Alexander wrote: > Thanks, sounds interesting! How do you load files to Spark? Did you > consider having multiple files i

RE: Using CUDA within Spark / boosting linear algebra

2015-04-01 Thread Ulanov, Alexander
FYI, I've added instructions to Netlib-java wiki, Sam added the link to them from the project's readme.md https://github.com/fommil/netlib-java/wiki/NVBLAS Best regards, Alexander -Original Message- From: Xiangrui Meng [mailto:men...@gmail.com] Sent: Monday, March 30, 2015 2:43 PM To: Se

RE: Storing large data for MLlib machine learning

2015-04-01 Thread Ulanov, Alexander
Thanks, sounds interesting! How do you load files to Spark? Did you consider having multiple files instead of file lines? From: Hector Yee [mailto:hector@gmail.com] Sent: Wednesday, April 01, 2015 11:36 AM To: Ulanov, Alexander Cc: Evan R. Sparks; Stephen Boesch; dev@spark.apache.org Subject:

Re: Storing large data for MLlib machine learning

2015-04-01 Thread Hector Yee
I use Thrift and then base64 encode the binary and save it as text file lines that are snappy or gzip encoded. It makes it very easy to copy small chunks locally and play with subsets of the data and not have dependencies on HDFS / hadoop for server stuff for example. On Thu, Mar 26, 2015 at 2:5

RE: Stochastic gradient descent performance

2015-04-01 Thread Ulanov, Alexander
Sorry for bothering you again, but I think that it is an important issue for applicability of SGD in Spark MLlib. Could Spark developers please comment on it. -Original Message- From: Ulanov, Alexander Sent: Monday, March 30, 2015 5:00 PM To: dev@spark.apache.org Subject: Stochastic gra

Re: One corrupt gzip in a directory of 100s

2015-04-01 Thread Ted Yu
bq. writing the output (to Amazon S3) failed What's the value of "fs.s3.maxRetries" ? Increasing the value should help. Cheers On Wed, Apr 1, 2015 at 8:34 AM, Romi Kuntsman wrote: > What about communication errors and not corrupted files? > Both when reading input and when writing output. > We

Re: One corrupt gzip in a directory of 100s

2015-04-01 Thread Romi Kuntsman
What about communication errors and not corrupted files? Both when reading input and when writing output. We currently experience a failure of the entire process, if the last stage of writing the output (to Amazon S3) failed because of a very temporary DNS resolution issue (easily resolved by retry

Re: One corrupt gzip in a directory of 100s

2015-04-01 Thread Gil Vernik
I actually saw the same issue, where we analyzed some container with few hundreds of GBs zip files - one was corrupted and Spark exit with Exception on the entire job. I like SPARK-6593, since it can cover also additional cases, not just in case of corrupted zip files. From: Dale Richardso

Re: Spark 2.0: Rearchitecting Spark for Mobile, Local, Social

2015-04-01 Thread Tathagata Das
This is a significant effort that Reynold has undertaken, and I am super glad to see that it's finally taking a concrete form. Would love to see what the community thinks about the idea. TD On Wed, Apr 1, 2015 at 3:11 AM, Reynold Xin wrote: > Hi Spark devs, > > I've spent the last few months in

Re: Spark 2.0: Rearchitecting Spark for Mobile, Local, Social

2015-04-01 Thread Kushal Datta
Reynold, what's the idea behind using LLVM? On Wed, Apr 1, 2015 at 12:31 AM, Akhil Das wrote: > Nice try :) > > Thanks > Best Regards > > On Wed, Apr 1, 2015 at 12:41 PM, Reynold Xin wrote: > > > Hi Spark devs, > > > > I've spent the last few months investigating the feasibility of > > re-archi

Re: Spark 2.0: Rearchitecting Spark for Mobile, Local, Social

2015-04-01 Thread Akhil Das
Nice try :) Thanks Best Regards On Wed, Apr 1, 2015 at 12:41 PM, Reynold Xin wrote: > Hi Spark devs, > > I've spent the last few months investigating the feasibility of > re-architecting Spark for mobile platforms, considering the growing > population of Android/iOS users. I'm happy to share wi

Spark 2.0: Rearchitecting Spark for Mobile, Local, Social

2015-04-01 Thread Reynold Xin
Hi Spark devs, I've spent the last few months investigating the feasibility of re-architecting Spark for mobile platforms, considering the growing population of Android/iOS users. I'm happy to share with you my findings at https://issues.apache.org/jira/browse/SPARK-6646 The tl;dr is that we shou