Hi Wush,
I'm CC'ing user@spark.apache.org (which is the new list) and BCC'ing
u...@spark.incubator.apache.org.
In Spark 1.3, schemaRDD is in fact being renamed to DataFrame (see:
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
)
As for a "
Have you taken a look at the TeradataDBInputFormat? Spark is compatible
with arbitrary hadoop input formats - so this might work for you:
http://developer.teradata.com/extensibility/articles/hadoop-mapreduce-connector-to-teradata-edw
On Thu, Jan 8, 2015 at 10:53 AM, gen tang wrote:
> Thanks a lo
Chris,
Thanks for stopping by! Here's a simple example. Imagine I've got a corpus
of data, which is an RDD[String], and I want to do some POS tagging on it.
In naive spark, that might look like this:
val props = new Properties.setAnnotators("pos")
val proc = new StanfordCoreNLP(props)
val data =
You can try recompiling spark with that option, and doing an sbt/sbt
publish-local, then change your spark version from 1.1.0 to 1.2.0-SNAPSHOT
(assuming you're building from the 1.1 branch) - sbt or maven (whichever
you're compiling your app with) will pick up the version of spark that you
just bu
Neat hack! This is cute and actually seems to work. The fact that it works
is a little surprising and somewhat unintuitive.
On Mon, Nov 24, 2014 at 8:08 AM, Ian O'Connell wrote:
>
> object MyCoreNLP {
> @transient lazy val coreNLP = new coreNLP()
> }
>
> and then refer to it from your map/redu
Additionally - I strongly recommend using OpenBLAS over the Atlas build
from the default Ubuntu repositories. Alternatively, you can build ATLAS on
the hardware you're actually going to be running the matrix ops on (the
master/workers), but we've seen modest performance gains doing this vs.
OpenBLA
This is probably not the right venue for general questions on CoreNLP - the
project website (http://nlp.stanford.edu/software/corenlp.shtml) provides
documentation and links to mailing lists/stack overflow topics.
On Mon, Nov 24, 2014 at 9:08 AM, Madabhattula Rajesh Kumar <
mrajaf...@gmail.com> wr
For sharing RDDs across multiple jobs - you could also have a look at
Tachyon. It provides an HDFS compatible in-memory storage layer that keeps
data in memory across multiple jobs/frameworks - http://tachyon-project.org/
.
-
On Tue, Nov 11, 2014 at 8:11 AM, Sonal Goyal wrote:
> I believe the S
ly in their own
> clusters (load, train, save). and at some point during run time these
> sub-models merge into the master model, which also loads, trains, and saves
> at the master level.
>
> much appreciated.
>
>
>
> On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks
>
to add?
On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh wrote:
> that works. is there a better way in spark? this seems like the most
> common feature for any machine learning work - to be able to save your
> model after training it and load it later.
>
> On Fri, Nov 7, 2014 at 2:
Plain old java serialization is one straightforward approach if you're in
java/scala.
On Thu, Nov 6, 2014 at 11:26 PM, ll wrote:
> what is the best way to save an mllib model that you just trained and
> reload
> it in the future? specifically, i'm using the mllib word2vec model...
> thanks.
>
>
You can imagine this same logic applying to the continuous case. E.g. what
if all the quartiles or deciles of a particular value have different
behavior - this could capture that too. Of what if some combination of
features was highly discriminitive but only into n buckets, rather than
two.. you ca
In cluster settings if you don't explicitly call sc.stop() your application
may hang. Like closing files, network connections, etc, when you're done
with them, it's a good idea to call sc.stop(), which lets the spark master
know that your application is finished consuming resources.
On Fri, Oct 31
Caching after doing the multiply is a good idea. Keep in mind that during
the first iteration of KMeans, the cached rows haven't yet been
materialized - so it is both doing the multiply and the first pass of
KMeans all at once. To isolate which part is slow you can run
cachedRows.numRows() to force
In its current implementation, the principal components are computed in
MLlib in two steps:
1) In a distributed fashion, compute the covariance matrix - the result is
a local matrix.
2) On this local matrix, compute the SVD.
The sorting comes from the SVD. If you want to get the eigenvalues out, y
I spoke with SK offline about this, it looks like the difference in timings
came from the fact that he was training 100 models for 100 iterations and
taking the total time (vs. my example which trains a single model for 100
iterations). I'm posting my response here, though, because I think it's
wor
Asynchrony is not supported directly - spark's programming model is
naturally BSP. I have seen cases where people have instantiated actors with
akka on worker nodes to enable message passing, or even used spark's own
ActorSystem to do this. But, I do not recommend this, since you lose a
bunch of be
Hmm... something is fishy here.
That's a *really* small dataset for a spark job, so almost all your time
will be spent in these overheads, but still you should be able to train a
logistic regression model with the default options and 100 iterations in
<1s on a single machine.
Are you caching your
Also - what hardware are you running the cluster on? And what is the local
machine hardware?
On Tue, Sep 2, 2014 at 11:57 AM, Evan R. Sparks
wrote:
> How many iterations are you running? Can you provide the exact details
> about the size of the dataset? (how many data points, how many fe
How many iterations are you running? Can you provide the exact details
about the size of the dataset? (how many data points, how many features) Is
this sparse or dense - and for the sparse case, how many non-zeroes? How
many partitions is your data RDD?
For very small datasets the scheduling overh
The loss functions are represented in the various names of the model
families. SVM is hinge loss, LogisticRegression is logistic loss,
LinearRegression is linear loss. These are used internally as arguments to
the SGD and L-BFGS optimizers.
On Thu, Aug 7, 2014 at 6:31 PM, SK wrote:
> Hi,
>
> Ac
Reza Zadeh has contributed the distributed implementation of (Tall/Skinny)
SVD (http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html),
which is in MLlib (Spark 1.0) and a distributed sparse SVD coming in Spark
1.1. (https://issues.apache.org/jira/browse/SPARK-1782). If your data
Specifically, reduceByKey expects a commutative/associative reduce
operation, and will automatically do this locally before a shuffle, which
means it acts like a "combiner" in MapReduce terms -
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions
On Thu
Ignoring my warning about overflow - even more functional - just use a
reduceByKey.
Since your main operation is just a bunch of summing, you've got a
commutative-associative reduce operation and spark will run do everything
cluster-parallel, and then shuffle the (small) result set and merge
appro
Computing the variance is similar to this example, you just need to keep
around the sum of squares as well.
The formula for variance is (sumsq/n) - (sum/n)^2
But with big datasets or large values, you can quickly run into overflow
issues - MLlib handles this by maintaining the the average sum of
Can you share the dataset via a gist or something and we can take a look at
what's going on?
On Fri, Jul 25, 2014 at 10:51 AM, SK wrote:
> yes, the output is continuous. So I used a threshold to get binary labels.
> If prediction < threshold, then class is 0 else 1. I use this binary label
> t
Try sc.getExecutorStorageStatus().length
SparkContext's getExecutorMemoryStatus or getExecutorStorageStatus will
give you back an object per executor - the StorageStatus objects are what
drives a lot of the Spark Web UI.
https://spark.apache.org/docs/1.0.1/api/scala/index.html#org.apache.spark.Sp
To be clear - each of the RDDs is still a distributed dataset and each of
the individual SVM models will be trained in parallel across the cluster.
Sean's suggestion effectively has you submitting multiple spark jobs
simultaneously, which, depending on your cluster configuration and the size
of you
There is a method in org.apache.spark.mllib.util.MLUtils called "kFold"
which will automatically partition your dataset for you into k train/test
splits at which point you can build k different models and aggregate the
results.
For example (a very rough sketch - assuming I want to do 10-fold cross
Also - you could consider caching your data after the first split (before
the first filter), this will prevent you from retrieving the data from s3
twice.
On Fri, Jun 20, 2014 at 8:32 AM, Xiangrui Meng wrote:
> Your data source is S3 and data is used twice. m1.large does not have very
> good ne
I use SBT, create an assembly, and then add the assembly jars when I create
my spark context. The main executor I run with something like "java -cp ...
MyDriver".
That said - as of spark 1.0 the preferred way to run spark applications is
via spark-submit -
http://spark.apache.org/docs/latest/submi
Larry,
I don't see any reference to Spark in particular there.
Additionally, the benchmark only scales up to datasets that are roughly
10gb (though I realize they've picked some fairly computationally intensive
tasks), and they don't present their results on more than 4 nodes. This can
hide thing
This looks like a job for SparkSQL!
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
case class MyRecord(country: String, name: String, age: Int, hits: Long)
val data = sc.parallelize(Array(MyRecord("USA", "Franklin", 24, 234),
MyRecord("USA", "Bob", 55, 108), MyRecord
I wouldn't be surprised if the default BLAS that ships with jblas is not
optimized for your target platform. Breeze (which we call into, uses either
jblas and falls back to netlib if jblas can't be loaded.) I'd recommend
using jblas if you can.
You probably want to compile a native BLAS library sp
I should point out that if you don't want to take a polyglot approach to
languages and reside solely in the JVM, then you can just use plain old
java serialization on the Model objects that come out of MLlib's APIs from
Java or Scala and load them up in another process and call the relevant
.predic
rker is about what
we expect our typical customers would tolerate and I don't think that it's
unreasonable for shallow trees.
On Thu, Apr 17, 2014 at 3:54 PM, Evan R. Sparks wrote:
> What kind of data are you training on? These effects are *highly* data
> dependent, and
intuitive sense IMO
> because a decision tree is a non-parametric model, and the expressibility
> of a tree depends on the number of nodes.
>
> With a huge amount of data (millions or even billions of rows), we found
> that the depth of 10 is simply not adequate to build high-
rests.
>
> There are some papers that mix boosting-like technique with bootstrap
> averaging (e.g. http://arxiv.org/pdf/1103.2068.pdf) where you could
> potentially use shallow trees to build boosted learners, but then average
> the results of many boosted learners.
>
>
> On
Sorry - I meant to say that "Multiclass classification, Gradient Boosting,
and Random Forest support based on the recent Decision Tree implementation
in MLlib is planned and coming soon."
On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks wrote:
> Multiclass classification, Gradient
Multiclass classification, Gradient Boosting, and Random Forest support for
based on the recent Decision Tree implementation in MLlib.
Sung - I'd be curious to hear about your use of decision trees (and
forests) where you want to go to 100+ depth. My experience with random
forests has been that pe
A bandaid might be to set up ssh tunneling between slaves and master - has
anyone tried deploying this way? I would expect it to pretty negatively
impact performance on communication-heavy jobs.
On Tue, Apr 8, 2014 at 3:23 PM, Benjamin Black wrote:
> Only if you trust the provider networks and
> Hi, Evan,
>
> Just noticed this thread, do you mind sharing more details regarding
> algorithms targetted at hyperparameter tuning/model selection? or a link
> to dev git repo for that work.
>
> thanks,
> yi
>
>
> On Wed, Apr 2, 2014 at 6:03 PM, Evan R. Sparks w
at the Github
> code is linked to Spark 0.8; will it not work with 0.9 (which is what I
> have set up) or higher versions?
>
>
> On Wed, Apr 2, 2014 at 1:44 AM, Evan R. Sparks [via Apache Spark User
> List] <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=36
Hi there,
MLlib is the first component of MLbase - MLI and the higher levels of the
stack are still being developed. Look for updates in terms of our progress
on the hyperparameter tuning/model selection problem in the next month or
so!
- Evan
On Tue, Apr 1, 2014 at 8:05 PM, Krakna H wrote:
>
Hi hyqgod,
This is probably a better question for the spark user's list than the dev
list (cc'ing user and bcc'ing dev on this reply).
To answer your question, though:
Amazon's Public Datasets Page is a nice place to start:
http://aws.amazon.com/datasets/ - these work well with spark because
the
45 matches
Mail list logo