from:"Evan"

Re: weightCol doesn't seem to be handled properly in PySpark

2016-09-12 Thread Evan Zamir

Yep, done. https://issues.apache.org/jira/browse/SPARK-17508 On Mon, Sep 12, 2016 at 9:06 AM Nick Pentreath wrote: > Could you create a JIRA ticket for it? > > https://issues.apache.org/jira/browse/SPARK > > On Thu, 8 Sep 2016 at 07:50 evanzamir wrote: > >> When I am trying to use LinearRegress

Re: I noticed LinearRegression sometimes produces negative R^2 values

2016-09-07 Thread Evan Zamir

Yes, it's on a hold out segment from the data set being fitted. On Wed, Sep 7, 2016 at 1:02 AM Sean Owen wrote: > Yes, should be. > It's also not necessarily nonnegative if you evaluate R^2 on a > different data set than you fit it to. Is that the case? > > On Tue, Sep

Re: I noticed LinearRegression sometimes produces negative R^2 values

2016-09-06 Thread Evan Zamir

I am using the default setting for setting *fitIntercept*, which *should* be TRUE right? On Tue, Sep 6, 2016 at 1:38 PM Sean Owen wrote: > Are you not fitting an intercept / regressing through the origin? with > that constraint it's no longer true that R^2 is necessarily > nonnegative. It basica

[Community] Python support added to Spark Job Server

2016-08-17 Thread Evan Chan

Hi folks, Just a friendly message that we have added Python support to the REST Spark Job Server project. If you are a Python user looking for a RESTful way to manage your Spark jobs, please come have a look at our project! https://github.com/spark-jobserver/spark-jobserver -Evan

Re: How to add custom steps to Pipeline models?

2016-08-14 Thread Evan Zamir

Thanks, but I should have been more clear that I'm trying to do this in PySpark, not Scala. Using an example I found on SO, I was able to implement a Pipeline step in Python, but it seems it is more difficult (perhaps currently impossible) to make it persist to disk (I tried implementing _to_java m

Re: Can we use spark inside a web service?

2016-03-14 Thread Evan Chan

at Mark is running a slightly-modified version of stock Spark. >>> (He's mentioned this in prior posts, as well.) >>> >>> And I have to say that I'm, personally, seeing more and more >>> slightly-modified versions of Spark being deployed to production to >

Re: Can we use spark inside a web service?

2016-03-14 Thread Evan Chan

#x27;s mentioned this in prior posts, as well.) >> >> And I have to say that I'm, personally, seeing more and more >> slightly-modified versions of Spark being deployed to production to >> workaround outstanding PR's and Jiras. >> >> this may not be what peop

Re: Can we use spark inside a web service?

2016-03-10 Thread Evan Chan

simultaneous Tasks, but that doesn't really tell you anything about how >> many Jobs are or can be concurrently tracked by the DAGScheduler, which will >> be apportioning the Tasks from those concurrent Jobs across the available >> Executor cores. >> >> On Thu, M

Achieving 700 Spark SQL Queries Per Second

2016-03-10 Thread Evan Chan

achieving 700 queries per second in Spark: http://velvia.github.io/Spark-Concurrent-Fast-Queries/ Would love your feedback. thanks, Evan - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-ma

Re: Construct model matrix from SchemaRDD automatically

2015-03-05 Thread Evan R. Sparks

the categorical variables in a DataFrame might be a welcome addition. - Evan On Thu, Mar 5, 2015 at 8:43 PM, Wush Wu wrote: > Dear all, > > I am a new spark user from R. > > After exploring the schemaRDD, I notice that it is similar to data.frame. > Is there a feature like

Re: Spark on teradata?

2015-01-08 Thread Evan R. Sparks

Have you taken a look at the TeradataDBInputFormat? Spark is compatible with arbitrary hadoop input formats - so this might work for you: http://developer.teradata.com/extensibility/articles/hadoop-mapreduce-connector-to-teradata-edw On Thu, Jan 8, 2015 at 10:53 AM, gen tang wrote: > Thanks a lo

Re: Spark and Stanford CoreNLP

2014-11-25 Thread Evan R. Sparks

va library which has non-serializable objects will face this issue. HTH, Evan On Tue, Nov 25, 2014 at 8:05 AM, Christopher Manning wrote: > I’m not (yet!) an active Spark user, but saw this thread on twitter … and > am involved with Stanford CoreNLP. > > Could someone explain how t

Re: Spark and Stanford CoreNLP

2014-11-25 Thread Evan Sparks

If you only mark it as transient, then the object won't be serialized, and on the worker the field will be null. When the worker goes to use it, you get an NPE. Marking it lazy defers initialization to first use. If that use happens to be after serialization time (e.g. on the worker), then the

Re: Mllib native netlib-java/OpenBLAS

2014-11-24 Thread Evan R. Sparks

You can try recompiling spark with that option, and doing an sbt/sbt publish-local, then change your spark version from 1.1.0 to 1.2.0-SNAPSHOT (assuming you're building from the 1.1 branch) - sbt or maven (whichever you're compiling your app with) will pick up the version of spark that you just bu

Re: Spark and Stanford CoreNLP

2014-11-24 Thread Evan R. Sparks

fer to it from your map/reduce/map partitions or that it should > be fine (presuming its thread safe), it will only be initialized once per > classloader per jvm > > On Mon, Nov 24, 2014 at 7:58 AM, Evan Sparks > wrote: > >> We have gotten this to work, but it requires instant

Re: Mllib native netlib-java/OpenBLAS

2014-11-24 Thread Evan R. Sparks

Additionally - I strongly recommend using OpenBLAS over the Atlas build from the default Ubuntu repositories. Alternatively, you can build ATLAS on the hardware you're actually going to be running the matrix ops on (the master/workers), but we've seen modest performance gains doing this vs. OpenBLA

Re: Spark and Stanford CoreNLP

2014-11-24 Thread Evan R. Sparks

ct MyCoreNLP { >> @transient lazy val coreNLP = new coreNLP() >> } >> >> and then refer to it from your map/reduce/map partitions or that it >> should be fine (presuming its thread safe), it will only be initialized >> once per classloader per jvm >> >>

Re: Spark and Stanford CoreNLP

2014-11-24 Thread Evan Sparks

We have gotten this to work, but it requires instantiating the CoreNLP object on the worker side. Because of the initialization time it makes a lot of sense to do this inside of a .mapPartitions instead of a .map, for example. As an aside, if you're using it from Scala, have a look at sistanlp,

Re: SparkSQL - can we add new column(s) to parquet files

2014-11-21 Thread Evan Chan

I would expect an SQL query on c would fail because c would not be known in the schema of the older Parquet file. What I'd be very interested in is how to add a new column as an incremental new parquet file, and be able to somehow join the existing and new file, in an efficient way. IE, somehow

Re: Best practice for multi-user web controller in front of Spark

2014-11-11 Thread Evan R. Sparks

For sharing RDDs across multiple jobs - you could also have a look at Tachyon. It provides an HDFS compatible in-memory storage layer that keeps data in memory across multiple jobs/frameworks - http://tachyon-project.org/ . - On Tue, Nov 11, 2014 at 8:11 AM, Sonal Goyal wrote: > I believe the S

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Evan R. Sparks

ly in their own > clusters (load, train, save). and at some point during run time these > sub-models merge into the master model, which also loads, trains, and saves > at the master level. > > much appreciated. > > > > On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks >

Re: word2vec: how to save an mllib model and reload it?

2014-11-06 Thread Evan R. Sparks

to add? On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh wrote: > that works. is there a better way in spark? this seems like the most > common feature for any machine learning work - to be able to save your > model after training it and load it later. > > On Fri, Nov 7, 2014 at 2:

Re: word2vec: how to save an mllib model and reload it?

2014-11-06 Thread Evan R. Sparks

Plain old java serialization is one straightforward approach if you're in java/scala. On Thu, Nov 6, 2014 at 11:26 PM, ll wrote: > what is the best way to save an mllib model that you just trained and > reload > it in the future? specifically, i'm using the mllib word2vec model... > thanks. > >

Re: why decision trees do binary split?

2014-11-06 Thread Evan R. Sparks

You can imagine this same logic applying to the continuous case. E.g. what if all the quartiles or deciles of a particular value have different behavior - this could capture that too. Of what if some combination of features was highly discriminitive but only into n buckets, rather than two.. you ca

Re: SparkContext.stop() ?

2014-10-31 Thread Evan R. Sparks

In cluster settings if you don't explicitly call sc.stop() your application may hang. Like closing files, network connections, etc, when you're done with them, it's a good idea to call sc.stop(), which lets the spark master know that your application is finished consuming resources. On Fri, Oct 31

Re: Multitenancy in Spark - within/across spark context

2014-10-23 Thread Evan Chan

/ rebuild the RDD (it tries to only rebuild the missing part, but sometimes it must rebuild everything). Job server can help with 1 or 2, 2 in particular. If you have any questions about job server, feel free to ask at the spark-jobserver google group. I am the maintainer. -Evan On Thu, Oct 23

Re: MLlib linking error Mac OS X

2014-10-20 Thread Evan Sparks

speed up your program. - Evan > On Oct 20, 2014, at 3:54 AM, npomfret wrote: > > I'm getting the same warning on my mac. Accompanied by what appears to be > pretty low CPU usage > (http://apache-spark-user-list.1001560.n3.nabble.com/mlib-model-build-and-low-CPU-usage-td16777.

Re: Spark speed performance

2014-10-18 Thread Evan Sparks

How many files do you have and how big is each JSON object? Spark works better with a few big files vs many smaller ones. So you could try cat'ing your files together and rerunning the same experiment. - Evan > On Oct 18, 2014, at 12:07 PM, > wrote: > > Hi, > >

Re: where are my python lambda functions run in yarn-client mode?

2014-10-11 Thread Evan Samanas

ggestion would be to backport 'spark.localExecution.enabled' to the 1.0 line. Thanks for all your help! Evan On Fri, Oct 10, 2014 at 10:40 PM, Davies Liu wrote: > This is some kind of implementation details, so not documented :-( > > If you think this is a blocker for you, you

Re: where are my python lambda functions run in yarn-client mode?

2014-10-10 Thread Evan

Thank you! I was looking for a config variable to that end, but I was looking in Spark 1.0.2 documentation, since that was the version I had the problem with. Is this behavior documented in 1.0.2's documentation? Evan On 10/09/2014 04:12 PM, Davies Liu wrote: When you call rdd.take

Re: How to run kmeans after pca?

2014-09-30 Thread Evan R. Sparks

Caching after doing the multiply is a good idea. Keep in mind that during the first iteration of KMeans, the cached rows haven't yet been materialized - so it is both doing the multiply and the first pass of KMeans all at once. To isolate which part is slow you can run cachedRows.numRows() to force

Re: spark1.0 principal component analysis

2014-09-23 Thread Evan R. Sparks

, you can simply run step 1 yourself on your RowMatrix via the (experimental) computeCovariance() method, and then run SVD on the result using a library like breeze. - Evan On Tue, Sep 23, 2014 at 12:49 PM, st553 wrote: > sowen wrote > > it seems that the singular values from the S

Re: Better way to process large image data set ?

2014-09-19 Thread Evan Chan

What Sean said. You should also definitely turn on Kryo serialization. The default Java serialization is really really slow if you're gonna move around lots of data.Also make sure you use a cluster with high network bandwidth on. On Thu, Sep 18, 2014 at 3:06 AM, Sean Owen wrote: > Base 64 i

Re: Example of Geoprocessing with Spark

2014-09-18 Thread Evan Chan

Hi Abel, Pretty interesting. May I ask how big is your point CSV dataset? It seems you are relying on searching through the FeatureCollection of polygons for which one intersects your point. This is going to be extremely slow. I highly recommend using a SpatialIndex, such as the many that exis

Re: Use Case of mutable RDD - any ideas around will help.

2014-09-18 Thread Evan Chan

Sweet, that's probably it. Too bad it didn't seem to make 1.1? On Wed, Sep 17, 2014 at 5:32 PM, Michael Armbrust wrote: > The unknown slowdown might be addressed by > https://github.com/apache/spark/commit/f858f466862541c3faad76a1fa2391f1c17ec9dd > > On Sun, Sep 14, 2014

Re: Use Case of mutable RDD - any ideas around will help.

2014-09-14 Thread Evan Chan

SPARK-1671 looks really promising. Note that even right now, you don't need to un-cache the existing table. You can do something like this: newAdditionRdd.registerTempTable("table2") sqlContext.cacheTable("table2") val unionedRdd = sqlContext.table("table1").unionAll(sqlContext.table("table2"))

Re: mllib performance on cluster

2014-09-03 Thread Evan R. Sparks

I spoke with SK offline about this, it looks like the difference in timings came from the fact that he was training 100 models for 100 iterations and taking the total time (vs. my example which trains a single model for 100 iterations). I'm posting my response here, though, because I think it's wor

Re: Message Passing among workers

2014-09-03 Thread Evan R. Sparks

Asynchrony is not supported directly - spark's programming model is naturally BSP. I have seen cases where people have instantiated actors with akka on worker nodes to enable message passing, or even used spark's own ActorSystem to do this. But, I do not recommend this, since you lose a bunch of be

Re: mllib performance on cluster

2014-09-02 Thread Evan R. Sparks

Hmm... something is fishy here. That's a *really* small dataset for a spark job, so almost all your time will be spent in these overheads, but still you should be able to train a logistic regression model with the default options and 100 iterations in <1s on a single machine. Are you caching your

Re: mllib performance on cluster

2014-09-02 Thread Evan R. Sparks

Also - what hardware are you running the cluster on? And what is the local machine hardware? On Tue, Sep 2, 2014 at 11:57 AM, Evan R. Sparks wrote: > How many iterations are you running? Can you provide the exact details > about the size of the dataset? (how many data points, how many fe

Re: mllib performance on cluster

2014-09-02 Thread Evan R. Sparks

How many iterations are you running? Can you provide the exact details about the size of the dataset? (how many data points, how many features) Is this sparse or dense - and for the sparse case, how many non-zeroes? How many partitions is your data RDD? For very small datasets the scheduling overh

Re: Merging two Spark SQL tables?

2014-08-28 Thread Evan Chan

Filed SPARK-3295. On Mon, Aug 25, 2014 at 12:49 PM, Michael Armbrust wrote: >> SO I tried the above (why doesn't union or ++ have the same behavior >> btw?) > > > I don't think there is a good reason for this. I'd open a JIRA. > >> >> and it works, but is slow because the original Rdds are not >

Re: Finding previous and next element in a sorted RDD

2014-08-21 Thread Evan Chan

There's no way to avoid a shuffle due to the first and last elements of each partition needing to be computed with the others, but I wonder if there is a way to do a minimal shuffle. On Thu, Aug 21, 2014 at 6:13 PM, cjwang wrote: > One way is to do zipWithIndex on the RDD. Then use the index as

Spark-JobServer moving to a new location

2014-08-21 Thread Evan Chan

-jobserver The git commit history is still there, but unfortunately the pull requests don't migrate over. I'll be contacting each of you with open PRs to move them over to the new location. Happy Hacking! Evan (@velvia) Kelvin (@kelvinchu) Daniel

Re: [Tachyon] Error reading from Parquet files in HDFS

2014-08-21 Thread Evan Chan

And it worked earlier with non-parquet directory. On Thu, Aug 21, 2014 at 12:22 PM, Evan Chan wrote: > The underFS is HDFS btw. > > On Thu, Aug 21, 2014 at 12:22 PM, Evan Chan wrote: >> Spark 1.0.2, Tachyon 0.4.1, Hadoop 1.0 (standard EC2 config) >>

Re: [Tachyon] Error reading from Parquet files in HDFS

2014-08-21 Thread Evan Chan

The underFS is HDFS btw. On Thu, Aug 21, 2014 at 12:22 PM, Evan Chan wrote: > Spark 1.0.2, Tachyon 0.4.1, Hadoop 1.0 (standard EC2 config) > > scala> val gdeltT = > sqlContext.parquetFile("tachyon://172.31.42.40:19998/gdelt-parquet/1979-2005/") > 14/08/21 19:07:14

[Tachyon] Error reading from Parquet files in HDFS

2014-08-21 Thread Evan Chan

Spark 1.0.2, Tachyon 0.4.1, Hadoop 1.0 (standard EC2 config) scala> val gdeltT = sqlContext.parquetFile("tachyon://172.31.42.40:19998/gdelt-parquet/1979-2005/") 14/08/21 19:07:14 INFO : initialize(tachyon://172.31.42.40:19998/gdelt-parquet/1979-2005, Configuration: core-default.xml, core-site.xml

Writeup on Spark SQL with GDELT

2014-08-21 Thread Evan Chan

I just put up a repo with a write-up on how to import the GDELT public dataset into Spark SQL and play around. Has a lot of notes on different import methods and observations about Spark SQL. Feel free to have a look and comment. http://www.github.com/velvia/spark-sql-gdelt ---

Re: Merging two Spark SQL tables?

2014-08-21 Thread Evan Chan

014 at 12:17 AM, Michael Armbrust wrote: > I believe this should work if you run srdd1.unionAll(srdd2). Both RDDs must > have the same schema. > > > On Wed, Aug 20, 2014 at 11:30 PM, Evan Chan wrote: >> >> Is it possible to merge two cached Spark SQL tables into a sing

Merging two Spark SQL tables?

2014-08-20 Thread Evan Chan

cached too. thanks, Evan - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: type issue: found RDD[T] expected RDD[A]

2014-08-19 Thread Evan Chan

That might not be enough. Reflection is used to determine what the fields are, thus your class might actually need to have members corresponding to the fields in the table. I heard that a more generic method of inputting stuff is coming. On Tue, Aug 19, 2014 at 6:43 PM, Tobias Pfeiffer wrote: >

Spark SQL: Caching nested structures extremely slow

2014-08-19 Thread Evan Chan

edious, or to construct Parquet files manually (also tedious). thanks, Evan - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: [MLLib]:choosing the Loss function

2014-08-07 Thread Evan R. Sparks

The loss functions are represented in the various names of the model families. SVM is hinge loss, LogisticRegression is logistic loss, LinearRegression is linear loss. These are used internally as arguments to the SGD and L-BFGS optimizers. On Thu, Aug 7, 2014 at 6:31 PM, SK wrote: > Hi, > > Ac

Re: How can I implement eigenvalue decomposition in Spark?

2014-08-07 Thread Evan R. Sparks

Reza Zadeh has contributed the distributed implementation of (Tall/Skinny) SVD (http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html), which is in MLlib (Spark 1.0) and a distributed sparse SVD coming in Spark 1.1. (https://issues.apache.org/jira/browse/SPARK-1782). If your data

Re: reduceByKey to get all associated values

2014-08-07 Thread Evan R. Sparks

Specifically, reduceByKey expects a commutative/associative reduce operation, and will automatically do this locally before a shuffle, which means it acts like a "combiner" in MapReduce terms - http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions On Thu

Re: Problem reading from S3 in standalone application

2014-08-06 Thread Evan Sparks

Try s3n:// > On Aug 6, 2014, at 12:22 AM, sparkuser2345 wrote: > > I'm getting the same "Input path does not exist" error also after setting the > AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables and using > the format "s3:///test_data.txt" for the input file. > > > > -- >

Re: Computing mean and standard deviation by key

2014-08-01 Thread Evan R. Sparks

ficient way to do it, but I think it's a nice example of how to think about using spark at a higher level of abstraction. - Evan On Fri, Aug 1, 2014 at 2:00 PM, Sean Owen wrote: > Here's the more functional programming-friendly take on the > computation (but

Re: Computing mean and standard deviation by key

2014-08-01 Thread Evan R. Sparks

Computing the variance is similar to this example, you just need to keep around the sum of squares as well. The formula for variance is (sumsq/n) - (sum/n)^2 But with big datasets or large values, you can quickly run into overflow issues - MLlib handles this by maintaining the the average sum of

Re: Decision tree classifier in MLlib

2014-07-25 Thread Evan R. Sparks

Can you share the dataset via a gist or something and we can take a look at what's going on? On Fri, Jul 25, 2014 at 10:51 AM, SK wrote: > yes, the output is continuous. So I used a threshold to get binary labels. > If prediction < threshold, then class is 0 else 1. I use this binary label > t

Re: Getting the number of slaves

2014-07-24 Thread Evan R. Sparks

Try sc.getExecutorStorageStatus().length SparkContext's getExecutorMemoryStatus or getExecutorStorageStatus will give you back an object per executor - the StorageStatus objects are what drives a lot of the Spark Web UI. https://spark.apache.org/docs/1.0.1/api/scala/index.html#org.apache.spark.Sp

Re: How to parallelize model fitting with different cross-validation folds?

2014-07-05 Thread Evan R. Sparks

ze of your dataset, may or may not be a good idea. There are some tricks you can do to make training multiple models on the same dataset faster, which we're hoping to expose to users in an upcoming release. - Evan On Sat, Jul 5, 2014 at 1:50 AM, Sean Owen wrote: > If you call .par on da

Re: How to use K-fold validation in spark-1.0?

2014-06-24 Thread Evan R. Sparks

There is a method in org.apache.spark.mllib.util.MLUtils called "kFold" which will automatically partition your dataset for you into k train/test splits at which point you can build k different models and aggregate the results. For example (a very rough sketch - assuming I want to do 10-fold cross

Re: MLLib sample data format

2014-06-22 Thread Evan Sparks

Oh, and the movie lens one is userid::movieid::rating - Evan > On Jun 22, 2014, at 3:35 PM, Justin Yip wrote: > > Hello, > > I am looking into a couple of MLLib data files in > https://github.com/apache/spark/tree/master/data/mllib. But I cannot find any > explanation

Re: MLLib sample data format

2014-06-22 Thread Evan Sparks

but can double (or more) storage requirements for dense data. - Evan > On Jun 22, 2014, at 3:35 PM, Justin Yip wrote: > > Hello, > > I am looking into a couple of MLLib data files in > https://github.com/apache/spark/tree/master/data/mllib. But I cannot find any >

Re: Performance problems on SQL JOIN

2014-06-20 Thread Evan R. Sparks

Also - you could consider caching your data after the first split (before the first filter), this will prevent you from retrieving the data from s3 twice. On Fri, Jun 20, 2014 at 8:32 AM, Xiangrui Meng wrote: > Your data source is S3 and data is used twice. m1.large does not have very > good ne

Re: How do you run your spark app?

2014-06-19 Thread Evan R. Sparks

I use SBT, create an assembly, and then add the assembly jars when I create my spark context. The main executor I run with something like "java -cp ... MyDriver". That said - as of spark 1.0 the preferred way to run spark applications is via spark-submit - http://spark.apache.org/docs/latest/submi

Re: Is There Any Benchmarks Comparing C++ MPI with Spark

2014-06-19 Thread Evan R. Sparks

r algorithm fits in a BSP programming model, with Spark you can achieve performance that is comparable to a tuned C++/MPI codebase by leveraging the right libraries locally and thinking carefully about what and when you have to communicate. - Evan On Thu, Jun 19, 2014 at 8:48 AM, ldmtwo wrot

Re: Patterns for making multiple aggregations in one pass

2014-06-18 Thread Evan R. Sparks

This looks like a job for SparkSQL! val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ case class MyRecord(country: String, name: String, age: Int, hits: Long) val data = sc.parallelize(Array(MyRecord("USA", "Franklin", 24, 234), MyRecord("USA", "Bob", 55, 108), MyRecord

Re: Native library can not be loaded when using Mllib PCA

2014-06-12 Thread Evan R. Sparks

I wouldn't be surprised if the default BLAS that ships with jblas is not optimized for your target platform. Breeze (which we call into, uses either jblas and falls back to netlib if jblas can't be loaded.) I'd recommend using jblas if you can. You probably want to compile a native BLAS library sp

Re: pmml with augustus

2014-06-10 Thread Evan R. Sparks

I should point out that if you don't want to take a polyglot approach to languages and reside solely in the JVM, then you can just use plain old java serialization on the Model objects that come out of MLlib's APIs from Java or Scala and load them up in another process and call the relevant .predic

Re: Random Forest on Spark

2014-04-18 Thread Evan R. Sparks

ou wanted to go 1000 features at depth 10 I'd estimate a couple of gigs necessary for heap space for the worker to compute/store the histograms, and I guess 2x that on the master to do the reduce. Again 2GB per worker is pretty tight, because there are overheads of just starting the jvm, launc

Re: Random Forest on Spark

2014-04-17 Thread Evan R. Sparks

intuitive sense IMO > because a decision tree is a non-parametric model, and the expressibility > of a tree depends on the number of nodes. > > With a huge amount of data (millions or even billions of rows), we found > that the depth of 10 is simply not adequate to build high-

Re: Random Forest on Spark

2014-04-17 Thread Evan R. Sparks

reduction technique, but I'd be a little surprised if a bunch of hugely deep trees don't overfit to training data. I guess I view limiting tree depth as an analogue to regularization in linear models. On Thu, Apr 17, 2014 at 12:19 PM, Sung Hwan Chung wrote: > Evan, > > I actual

Re: Random Forest on Spark

2014-04-17 Thread Evan R. Sparks

Sorry - I meant to say that "Multiclass classification, Gradient Boosting, and Random Forest support based on the recent Decision Tree implementation in MLlib is planned and coming soon." On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks wrote: > Multiclass classification, Gradient

Re: Random Forest on Spark

2014-04-17 Thread Evan R. Sparks

Multiclass classification, Gradient Boosting, and Random Forest support for based on the recent Decision Tree implementation in MLlib. Sung - I'd be curious to hear about your use of decision trees (and forests) where you want to go to 100+ depth. My experience with random forests has been that pe

Re: Spark with SSL?

2014-04-08 Thread Evan R. Sparks

A bandaid might be to set up ssh tunneling between slaves and master - has anyone tried deploying this way? I would expect it to pretty negatively impact performance on communication-heavy jobs. On Tue, Apr 8, 2014 at 3:23 PM, Benjamin Black wrote: > Only if you trust the provider networks and

Re: Status of MLI?

2014-04-07 Thread Evan R. Sparks

ate branch with TPE (as well as random and grid search) integrated with MLI, but the code is research quality right now and not extremely general. We're actively working on bringing these things up to snuff for a proper open source release. On Fri, Apr 4, 2014 at 11:28 AM, Yi Zou wrote:

Re: how to save RDD partitions in different folders?

2014-04-04 Thread Evan Sparks

Have a look at MultipleOutputs in the hadoop API. Spark can read and write to arbitrary hadoop formats. > On Apr 4, 2014, at 6:01 AM, dmpour23 wrote: > > Hi all, > Say I have an input file which I would like to partition using > HashPartitioner k times. > > Calling rdd.saveAsTextFile(""hdfs:

Re: Status of MLI?

2014-04-02 Thread Evan R. Sparks

Targeting 0.9.0 should work out of the box (just a change to the build.sbt) - I'll push some changes I've been sitting on to the public repo in the next couple of days. On Wed, Apr 2, 2014 at 4:05 AM, Krakna H wrote: > Thanks for the update Evan! In terms of using MLI, I see th

Re: Status of MLI?

2014-04-01 Thread Evan R. Sparks

Hi there, MLlib is the first component of MLbase - MLI and the higher levels of the stack are still being developed. Look for updates in terms of our progress on the hyperparameter tuning/model selection problem in the next month or so! - Evan On Tue, Apr 1, 2014 at 8:05 PM, Krakna H wrote

Re: [HELP] ask for some information about public data set

2014-02-25 Thread Evan R. Sparks

http://svonava.com/post/62186512058/datasets-released-by-google - Evan On Tue, Feb 25, 2014 at 6:33 PM, 黄远强 wrote: > Hi all: > I am a freshman in Spark community. i dream of being a expert in the field > of big data. But i have no idea where to start after i have gone through > the publis

81 matches

Mail list logo