from:"Evan R. Sparks"

Re: Construct model matrix from SchemaRDD automatically

2015-03-05 Thread Evan R. Sparks

Hi Wush, I'm CC'ing user@spark.apache.org (which is the new list) and BCC'ing u...@spark.incubator.apache.org. In Spark 1.3, schemaRDD is in fact being renamed to DataFrame (see: https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html ) As for a "

Re: Spark on teradata?

2015-01-08 Thread Evan R. Sparks

Have you taken a look at the TeradataDBInputFormat? Spark is compatible with arbitrary hadoop input formats - so this might work for you: http://developer.teradata.com/extensibility/articles/hadoop-mapreduce-connector-to-teradata-edw On Thu, Jan 8, 2015 at 10:53 AM, gen tang wrote: > Thanks a lo

Re: Spark and Stanford CoreNLP

2014-11-25 Thread Evan R. Sparks

Chris, Thanks for stopping by! Here's a simple example. Imagine I've got a corpus of data, which is an RDD[String], and I want to do some POS tagging on it. In naive spark, that might look like this: val props = new Properties.setAnnotators("pos") val proc = new StanfordCoreNLP(props) val data =

Re: Mllib native netlib-java/OpenBLAS

2014-11-24 Thread Evan R. Sparks

You can try recompiling spark with that option, and doing an sbt/sbt publish-local, then change your spark version from 1.1.0 to 1.2.0-SNAPSHOT (assuming you're building from the 1.1 branch) - sbt or maven (whichever you're compiling your app with) will pick up the version of spark that you just bu

Re: Spark and Stanford CoreNLP

2014-11-24 Thread Evan R. Sparks

Neat hack! This is cute and actually seems to work. The fact that it works is a little surprising and somewhat unintuitive. On Mon, Nov 24, 2014 at 8:08 AM, Ian O'Connell wrote: > > object MyCoreNLP { > @transient lazy val coreNLP = new coreNLP() > } > > and then refer to it from your map/redu

Re: Mllib native netlib-java/OpenBLAS

2014-11-24 Thread Evan R. Sparks

Additionally - I strongly recommend using OpenBLAS over the Atlas build from the default Ubuntu repositories. Alternatively, you can build ATLAS on the hardware you're actually going to be running the matrix ops on (the master/workers), but we've seen modest performance gains doing this vs. OpenBLA

Re: Spark and Stanford CoreNLP

2014-11-24 Thread Evan R. Sparks

This is probably not the right venue for general questions on CoreNLP - the project website (http://nlp.stanford.edu/software/corenlp.shtml) provides documentation and links to mailing lists/stack overflow topics. On Mon, Nov 24, 2014 at 9:08 AM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wr

Re: Best practice for multi-user web controller in front of Spark

2014-11-11 Thread Evan R. Sparks

For sharing RDDs across multiple jobs - you could also have a look at Tachyon. It provides an HDFS compatible in-memory storage layer that keeps data in memory across multiple jobs/frameworks - http://tachyon-project.org/ . - On Tue, Nov 11, 2014 at 8:11 AM, Sonal Goyal wrote: > I believe the S

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Evan R. Sparks

ly in their own > clusters (load, train, save). and at some point during run time these > sub-models merge into the master model, which also loads, trains, and saves > at the master level. > > much appreciated. > > > > On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks >

Re: word2vec: how to save an mllib model and reload it?

2014-11-06 Thread Evan R. Sparks

to add? On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh wrote: > that works. is there a better way in spark? this seems like the most > common feature for any machine learning work - to be able to save your > model after training it and load it later. > > On Fri, Nov 7, 2014 at 2:

Re: word2vec: how to save an mllib model and reload it?

2014-11-06 Thread Evan R. Sparks

Plain old java serialization is one straightforward approach if you're in java/scala. On Thu, Nov 6, 2014 at 11:26 PM, ll wrote: > what is the best way to save an mllib model that you just trained and > reload > it in the future? specifically, i'm using the mllib word2vec model... > thanks. > >

Re: why decision trees do binary split?

2014-11-06 Thread Evan R. Sparks

You can imagine this same logic applying to the continuous case. E.g. what if all the quartiles or deciles of a particular value have different behavior - this could capture that too. Of what if some combination of features was highly discriminitive but only into n buckets, rather than two.. you ca

Re: SparkContext.stop() ?

2014-10-31 Thread Evan R. Sparks

In cluster settings if you don't explicitly call sc.stop() your application may hang. Like closing files, network connections, etc, when you're done with them, it's a good idea to call sc.stop(), which lets the spark master know that your application is finished consuming resources. On Fri, Oct 31

Re: How to run kmeans after pca?

2014-09-30 Thread Evan R. Sparks

Caching after doing the multiply is a good idea. Keep in mind that during the first iteration of KMeans, the cached rows haven't yet been materialized - so it is both doing the multiply and the first pass of KMeans all at once. To isolate which part is slow you can run cachedRows.numRows() to force

Re: spark1.0 principal component analysis

2014-09-23 Thread Evan R. Sparks

In its current implementation, the principal components are computed in MLlib in two steps: 1) In a distributed fashion, compute the covariance matrix - the result is a local matrix. 2) On this local matrix, compute the SVD. The sorting comes from the SVD. If you want to get the eigenvalues out, y

Re: mllib performance on cluster

2014-09-03 Thread Evan R. Sparks

I spoke with SK offline about this, it looks like the difference in timings came from the fact that he was training 100 models for 100 iterations and taking the total time (vs. my example which trains a single model for 100 iterations). I'm posting my response here, though, because I think it's wor

Re: Message Passing among workers

2014-09-03 Thread Evan R. Sparks

Asynchrony is not supported directly - spark's programming model is naturally BSP. I have seen cases where people have instantiated actors with akka on worker nodes to enable message passing, or even used spark's own ActorSystem to do this. But, I do not recommend this, since you lose a bunch of be

Re: mllib performance on cluster

2014-09-02 Thread Evan R. Sparks

Hmm... something is fishy here. That's a *really* small dataset for a spark job, so almost all your time will be spent in these overheads, but still you should be able to train a logistic regression model with the default options and 100 iterations in <1s on a single machine. Are you caching your

Re: mllib performance on cluster

2014-09-02 Thread Evan R. Sparks

Also - what hardware are you running the cluster on? And what is the local machine hardware? On Tue, Sep 2, 2014 at 11:57 AM, Evan R. Sparks wrote: > How many iterations are you running? Can you provide the exact details > about the size of the dataset? (how many data points, how many fe

Re: mllib performance on cluster

2014-09-02 Thread Evan R. Sparks

How many iterations are you running? Can you provide the exact details about the size of the dataset? (how many data points, how many features) Is this sparse or dense - and for the sparse case, how many non-zeroes? How many partitions is your data RDD? For very small datasets the scheduling overh

Re: [MLLib]:choosing the Loss function

2014-08-07 Thread Evan R. Sparks

The loss functions are represented in the various names of the model families. SVM is hinge loss, LogisticRegression is logistic loss, LinearRegression is linear loss. These are used internally as arguments to the SGD and L-BFGS optimizers. On Thu, Aug 7, 2014 at 6:31 PM, SK wrote: > Hi, > > Ac

Re: How can I implement eigenvalue decomposition in Spark?

2014-08-07 Thread Evan R. Sparks

Reza Zadeh has contributed the distributed implementation of (Tall/Skinny) SVD (http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html), which is in MLlib (Spark 1.0) and a distributed sparse SVD coming in Spark 1.1. (https://issues.apache.org/jira/browse/SPARK-1782). If your data

Re: reduceByKey to get all associated values

2014-08-07 Thread Evan R. Sparks

Specifically, reduceByKey expects a commutative/associative reduce operation, and will automatically do this locally before a shuffle, which means it acts like a "combiner" in MapReduce terms - http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions On Thu

Re: Computing mean and standard deviation by key

2014-08-01 Thread Evan R. Sparks

Ignoring my warning about overflow - even more functional - just use a reduceByKey. Since your main operation is just a bunch of summing, you've got a commutative-associative reduce operation and spark will run do everything cluster-parallel, and then shuffle the (small) result set and merge appro

Re: Computing mean and standard deviation by key

2014-08-01 Thread Evan R. Sparks

Computing the variance is similar to this example, you just need to keep around the sum of squares as well. The formula for variance is (sumsq/n) - (sum/n)^2 But with big datasets or large values, you can quickly run into overflow issues - MLlib handles this by maintaining the the average sum of

Re: Decision tree classifier in MLlib

2014-07-25 Thread Evan R. Sparks

Can you share the dataset via a gist or something and we can take a look at what's going on? On Fri, Jul 25, 2014 at 10:51 AM, SK wrote: > yes, the output is continuous. So I used a threshold to get binary labels. > If prediction < threshold, then class is 0 else 1. I use this binary label > t

Re: Getting the number of slaves

2014-07-24 Thread Evan R. Sparks

Try sc.getExecutorStorageStatus().length SparkContext's getExecutorMemoryStatus or getExecutorStorageStatus will give you back an object per executor - the StorageStatus objects are what drives a lot of the Spark Web UI. https://spark.apache.org/docs/1.0.1/api/scala/index.html#org.apache.spark.Sp

Re: How to parallelize model fitting with different cross-validation folds?

2014-07-05 Thread Evan R. Sparks

To be clear - each of the RDDs is still a distributed dataset and each of the individual SVM models will be trained in parallel across the cluster. Sean's suggestion effectively has you submitting multiple spark jobs simultaneously, which, depending on your cluster configuration and the size of you

Re: How to use K-fold validation in spark-1.0?

2014-06-24 Thread Evan R. Sparks

There is a method in org.apache.spark.mllib.util.MLUtils called "kFold" which will automatically partition your dataset for you into k train/test splits at which point you can build k different models and aggregate the results. For example (a very rough sketch - assuming I want to do 10-fold cross

Re: Performance problems on SQL JOIN

2014-06-20 Thread Evan R. Sparks

Also - you could consider caching your data after the first split (before the first filter), this will prevent you from retrieving the data from s3 twice. On Fri, Jun 20, 2014 at 8:32 AM, Xiangrui Meng wrote: > Your data source is S3 and data is used twice. m1.large does not have very > good ne

Re: How do you run your spark app?

2014-06-19 Thread Evan R. Sparks

I use SBT, create an assembly, and then add the assembly jars when I create my spark context. The main executor I run with something like "java -cp ... MyDriver". That said - as of spark 1.0 the preferred way to run spark applications is via spark-submit - http://spark.apache.org/docs/latest/submi

Re: Is There Any Benchmarks Comparing C++ MPI with Spark

2014-06-19 Thread Evan R. Sparks

Larry, I don't see any reference to Spark in particular there. Additionally, the benchmark only scales up to datasets that are roughly 10gb (though I realize they've picked some fairly computationally intensive tasks), and they don't present their results on more than 4 nodes. This can hide thing

Re: Patterns for making multiple aggregations in one pass

2014-06-18 Thread Evan R. Sparks

This looks like a job for SparkSQL! val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ case class MyRecord(country: String, name: String, age: Int, hits: Long) val data = sc.parallelize(Array(MyRecord("USA", "Franklin", 24, 234), MyRecord("USA", "Bob", 55, 108), MyRecord

Re: Native library can not be loaded when using Mllib PCA

2014-06-12 Thread Evan R. Sparks

I wouldn't be surprised if the default BLAS that ships with jblas is not optimized for your target platform. Breeze (which we call into, uses either jblas and falls back to netlib if jblas can't be loaded.) I'd recommend using jblas if you can. You probably want to compile a native BLAS library sp

Re: pmml with augustus

2014-06-10 Thread Evan R. Sparks

I should point out that if you don't want to take a polyglot approach to languages and reside solely in the JVM, then you can just use plain old java serialization on the Model objects that come out of MLlib's APIs from Java or Scala and load them up in another process and call the relevant .predic

Re: Random Forest on Spark

2014-04-18 Thread Evan R. Sparks

rker is about what we expect our typical customers would tolerate and I don't think that it's unreasonable for shallow trees. On Thu, Apr 17, 2014 at 3:54 PM, Evan R. Sparks wrote: > What kind of data are you training on? These effects are *highly* data > dependent, and

Re: Random Forest on Spark

2014-04-17 Thread Evan R. Sparks

intuitive sense IMO > because a decision tree is a non-parametric model, and the expressibility > of a tree depends on the number of nodes. > > With a huge amount of data (millions or even billions of rows), we found > that the depth of 10 is simply not adequate to build high-

Re: Random Forest on Spark

2014-04-17 Thread Evan R. Sparks

rests. > > There are some papers that mix boosting-like technique with bootstrap > averaging (e.g. http://arxiv.org/pdf/1103.2068.pdf) where you could > potentially use shallow trees to build boosted learners, but then average > the results of many boosted learners. > > > On

Re: Random Forest on Spark

2014-04-17 Thread Evan R. Sparks

Sorry - I meant to say that "Multiclass classification, Gradient Boosting, and Random Forest support based on the recent Decision Tree implementation in MLlib is planned and coming soon." On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks wrote: > Multiclass classification, Gradient

Re: Random Forest on Spark

2014-04-17 Thread Evan R. Sparks

Multiclass classification, Gradient Boosting, and Random Forest support for based on the recent Decision Tree implementation in MLlib. Sung - I'd be curious to hear about your use of decision trees (and forests) where you want to go to 100+ depth. My experience with random forests has been that pe

Re: Spark with SSL?

2014-04-08 Thread Evan R. Sparks

A bandaid might be to set up ssh tunneling between slaves and master - has anyone tried deploying this way? I would expect it to pretty negatively impact performance on communication-heavy jobs. On Tue, Apr 8, 2014 at 3:23 PM, Benjamin Black wrote: > Only if you trust the provider networks and

Re: Status of MLI?

2014-04-07 Thread Evan R. Sparks

> Hi, Evan, > > Just noticed this thread, do you mind sharing more details regarding > algorithms targetted at hyperparameter tuning/model selection? or a link > to dev git repo for that work. > > thanks, > yi > > > On Wed, Apr 2, 2014 at 6:03 PM, Evan R. Sparks w

Re: Status of MLI?

2014-04-02 Thread Evan R. Sparks

at the Github > code is linked to Spark 0.8; will it not work with 0.9 (which is what I > have set up) or higher versions? > > > On Wed, Apr 2, 2014 at 1:44 AM, Evan R. Sparks [via Apache Spark User > List] <[hidden email] > <http://user/SendEmail.jtp?type=node&node=36

Re: Status of MLI?

2014-04-01 Thread Evan R. Sparks

Hi there, MLlib is the first component of MLbase - MLI and the higher levels of the stack are still being developed. Look for updates in terms of our progress on the hyperparameter tuning/model selection problem in the next month or so! - Evan On Tue, Apr 1, 2014 at 8:05 PM, Krakna H wrote: >

Re: [HELP] ask for some information about public data set

2014-02-25 Thread Evan R. Sparks

Hi hyqgod, This is probably a better question for the spark user's list than the dev list (cc'ing user and bcc'ing dev on this reply). To answer your question, though: Amazon's Public Datasets Page is a nice place to start: http://aws.amazon.com/datasets/ - these work well with spark because the

45 matches

Mail list logo