Re: Should spark-ec2 get its own repo?

2015-08-10 Thread Jeremy Freeman
Hi all, definitely a +1 to this plan. Wanted to also share this library for Spark + GCE by a collaborator of mine, Michael Broxton, which seems to expand and improve on the earlier one Nick pointed us to. It’s pip installable, not yet on spark-packages, but I’m sure he’d be game to add it. htt

Re: PySpark on PyPi

2015-07-24 Thread Jeremy Freeman
Hey all, great discussion, just wanted to +1 that I see a lot of value in steps that make it easier to use PySpark as an ordinary python library. You might want to check out this (https://github.com/minrk/findspark ), started by Jupyter project devs, that off

Re: Which method do you think is better for making MIN_REMEMBER_DURATION configurable?

2015-04-08 Thread Jeremy Freeman
+1 for this feature In our use case, we probably wouldn’t use this feature in production, but it can be useful during prototyping and algorithm development to repeatedly perform the same streaming operation on a fixed, already existing set of files. - jeremyfreeman.net @

Re: Storing large data for MLlib machine learning

2015-04-01 Thread Jeremy Freeman
@Alexander, re: using flat binary and metadata, you raise excellent points! At least in our case, we decided on a specific endianness, but do end up storing some extremely minimal specification in a JSON file, and have written importers and exporters within our library to parse it. While it does

Re: Storing large data for MLlib machine learning

2015-03-26 Thread Jeremy Freeman
Hi Ulvanov, great question, we’ve encountered it frequently with scientific data (e.g. time series). Agreed text is inefficient for dense arrays, and we also found HDF5+Spark to be a pain. Our strategy has been flat binary files with fixed length records. Loading these is now supported in Spar

Re: Google Summer of Code - ideas

2015-02-26 Thread Jeremy Freeman
For topic #4 (streaming ML in Python), there’s an existing JIRA, but progress seems to have stalled. I’d be happy to help if you want to pick it up! https://issues.apache.org/jira/browse/SPARK-4127 - jeremyfreeman.net @thefreemanlab On Feb 26, 2015, at 4:20 PM, Xiangrui

Re: Adding third party jars to classpath used by pyspark

2014-12-29 Thread Jeremy Freeman
Hi Stephen, it should be enough to include > --jars /path/to/file.jar in the command line call to either pyspark or spark-submit, as in > spark-submit --master local --jars /path/to/file.jar myfile.py and you can check the bottom of the Web UI’s “Environment" tab to make sure the jar gets on

Re: [VOTE] Release Apache Spark 1.2.0 (RC1)

2014-12-02 Thread Jeremy Freeman
+1 (non-binding) Installed version pre-built for Hadoop on a private HPC ran PySpark shell w/ iPython loaded data using custom Hadoop input formats ran MLlib routines in PySpark ran custom workflows in PySpark browsed the web UI Noticeable improvements in stability and performance during large sh

Re: Python3 and spark 1.1.0

2014-11-06 Thread Jeremy Freeman
Currently, Spark 1.1.0 works with Python 2.6 or higher, but not Python 3. There does seem to be interest, see also this post (http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-on-python-3-td15706.html). I believe Ariel Rokem (cced) has been trying to get it work and might be working o

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Jeremy Freeman
Great idea! +1 — Jeremy - jeremyfreeman.net @thefreemanlab On Nov 5, 2014, at 11:48 PM, Timothy Chen wrote: > Matei that makes sense, +1 (non-binding) > > Tim > > On Wed, Nov 5, 2014 at 8:46 PM, Cheng Lian wrote: >> +1 since this is already the de facto model we are

Re: Building and Running Spark on OS X

2014-10-20 Thread Jeremy Freeman
I also prefer sbt on Mac. You might want to add checking for / getting Python 2.6+ (though most modern Macs should have it), and maybe numpy as an optional dependency. I often just point people to Anaconda. — Jeremy - jeremyfreeman.net @thefreemanlab On Oct 20, 2014, a

sampling broken in PySpark with recent NumPy

2014-10-17 Thread Jeremy Freeman
Hi all, I found a significant bug in PySpark's sampling methods, due to a recent NumPy change (as of v1.9). I created a JIRA (https://issues.apache.org/jira/browse/SPARK-3995), but wanted to share here as well in case anyone hits it. Steps to reproduce are: > foo = sc.parallelize(range(1000),

Re: [mllib] Add multiplying large scale matrices

2014-09-05 Thread Jeremy Freeman
from the AMPLab. -- Jeremy --------- jeremy freeman, phd neuroscientist @thefreemanlab On Sep 5, 2014, at 12:23 PM, Patrick Wendell wrote: > Hey There, > > I believe this is on the roadmap for the 1.2 next release. But > Xiangrui can comment on this. > > - Patr

Re: [VOTE] Release Apache Spark 1.1.0 (RC4)

2014-09-03 Thread Jeremy Freeman
+1 -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-0-RC4-tp8219p8254.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

RE: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Jeremy Freeman
+1 -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-0-RC3-tp8147p8211.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: [VOTE] Release Apache Spark 1.1.0 (RC2)

2014-08-29 Thread Jeremy Freeman
+1. Validated several custom analysis pipelines on a private cluster in standalone mode. Tested new PySpark support for arbitrary Hadoop input formats, works great! -- Jeremy -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-27 Thread Jeremy Freeman
Hey RJ, Sorry for the delay, I'd be happy to take a look at this if you can post the code! I think splitting the largest cluster in each round is fairly common, but ideally it would be an option to do it one way or the other. -- Jeremy - jeremy freeman, phd neuroscie

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Jeremy Freeman
@Ignacio, happy to share, here's a link to a library we've been developing (https://github.com/freeman-lab/thunder). As just a couple examples, we have pipelines that use fourier transforms and other signal processing from scipy, and others that do massively parallel model fitting via Scikit lea

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Jeremy Freeman
Our experience matches Reynold's comments; pure-Python implementations of anything are generally sub-optimal compared to pure Scala implementations, or Scala versions exposed to Python (which are faster, but still slower than pure Scala). It also seems on first glance that some of the implementatio

Re: Re:How to run specific sparkSQL test with maven

2014-08-01 Thread Jeremy Freeman
With maven you can run a particular test suite like this: mvn -DwildcardSuites=org.apache.spark.sql.SQLQuerySuite test see the note here (under "Spark Tests in Maven"): http://spark.apache.org/docs/latest/building-with-maven.html -- View this message in context: http://apache-spark-developer

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-18 Thread Jeremy Freeman
Hi RJ, that sounds like a great idea. I'd be happy to look over what you put together. -- Jeremy -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7418.html Sent from the Apache Spark Devel

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-17 Thread Jeremy Freeman
Hi all, Cool discussion! I agree that a more standardized API for clustering, and easy access to underlying routines, would be useful (we've also been discussing this when trying to develop streaming clustering algorithms, similar to https://github.com/apache/spark/pull/1361) For divisive, hier