Yep, done. https://issues.apache.org/jira/browse/SPARK-17508
On Mon, Sep 12, 2016 at 9:06 AM Nick Pentreath
wrote:
> Could you create a JIRA ticket for it?
>
> https://issues.apache.org/jira/browse/SPARK
>
> On Thu, 8 Sep 2016 at 07:50 evanzamir wrote:
>
>> When I am trying to use LinearRegress
Yes, it's on a hold out segment from the data set being fitted.
On Wed, Sep 7, 2016 at 1:02 AM Sean Owen wrote:
> Yes, should be.
> It's also not necessarily nonnegative if you evaluate R^2 on a
> different data set than you fit it to. Is that the case?
>
> On Tue, Sep
I am using the default setting for setting *fitIntercept*, which *should*
be TRUE right?
On Tue, Sep 6, 2016 at 1:38 PM Sean Owen wrote:
> Are you not fitting an intercept / regressing through the origin? with
> that constraint it's no longer true that R^2 is necessarily
> nonnegative. It basica
Hi folks,
Just a friendly message that we have added Python support to the REST
Spark Job Server project. If you are a Python user looking for a
RESTful way to manage your Spark jobs, please come have a look at our
project!
https://github.com/spark-jobserver/spark-jobserver
-Evan
Thanks, but I should have been more clear that I'm trying to do this in
PySpark, not Scala. Using an example I found on SO, I was able to implement
a Pipeline step in Python, but it seems it is more difficult (perhaps
currently impossible) to make it persist to disk (I tried implementing
_to_java m
at Mark is running a slightly-modified version of stock Spark.
>>> (He's mentioned this in prior posts, as well.)
>>>
>>> And I have to say that I'm, personally, seeing more and more
>>> slightly-modified versions of Spark being deployed to production to
>
#x27;s mentioned this in prior posts, as well.)
>>
>> And I have to say that I'm, personally, seeing more and more
>> slightly-modified versions of Spark being deployed to production to
>> workaround outstanding PR's and Jiras.
>>
>> this may not be what peop
simultaneous Tasks, but that doesn't really tell you anything about how
>> many Jobs are or can be concurrently tracked by the DAGScheduler, which will
>> be apportioning the Tasks from those concurrent Jobs across the available
>> Executor cores.
>>
>> On Thu, M
achieving 700 queries per second in Spark:
http://velvia.github.io/Spark-Concurrent-Fast-Queries/
Would love your feedback.
thanks,
Evan
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-ma
the
categorical variables in a DataFrame might be a welcome addition.
- Evan
On Thu, Mar 5, 2015 at 8:43 PM, Wush Wu wrote:
> Dear all,
>
> I am a new spark user from R.
>
> After exploring the schemaRDD, I notice that it is similar to data.frame.
> Is there a feature like
Have you taken a look at the TeradataDBInputFormat? Spark is compatible
with arbitrary hadoop input formats - so this might work for you:
http://developer.teradata.com/extensibility/articles/hadoop-mapreduce-connector-to-teradata-edw
On Thu, Jan 8, 2015 at 10:53 AM, gen tang wrote:
> Thanks a lo
va library which has
non-serializable objects will face this issue.
HTH,
Evan
On Tue, Nov 25, 2014 at 8:05 AM, Christopher Manning
wrote:
> I’m not (yet!) an active Spark user, but saw this thread on twitter … and
> am involved with Stanford CoreNLP.
>
> Could someone explain how t
If you only mark it as transient, then the object won't be serialized, and on
the worker the field will be null. When the worker goes to use it, you get an
NPE.
Marking it lazy defers initialization to first use. If that use happens to be
after serialization time (e.g. on the worker), then the
You can try recompiling spark with that option, and doing an sbt/sbt
publish-local, then change your spark version from 1.1.0 to 1.2.0-SNAPSHOT
(assuming you're building from the 1.1 branch) - sbt or maven (whichever
you're compiling your app with) will pick up the version of spark that you
just bu
fer to it from your map/reduce/map partitions or that it should
> be fine (presuming its thread safe), it will only be initialized once per
> classloader per jvm
>
> On Mon, Nov 24, 2014 at 7:58 AM, Evan Sparks
> wrote:
>
>> We have gotten this to work, but it requires instant
Additionally - I strongly recommend using OpenBLAS over the Atlas build
from the default Ubuntu repositories. Alternatively, you can build ATLAS on
the hardware you're actually going to be running the matrix ops on (the
master/workers), but we've seen modest performance gains doing this vs.
OpenBLA
ct MyCoreNLP {
>> @transient lazy val coreNLP = new coreNLP()
>> }
>>
>> and then refer to it from your map/reduce/map partitions or that it
>> should be fine (presuming its thread safe), it will only be initialized
>> once per classloader per jvm
>>
>>
We have gotten this to work, but it requires instantiating the CoreNLP object
on the worker side. Because of the initialization time it makes a lot of sense
to do this inside of a .mapPartitions instead of a .map, for example.
As an aside, if you're using it from Scala, have a look at sistanlp,
I would expect an SQL query on c would fail because c would not be known in
the schema of the older Parquet file.
What I'd be very interested in is how to add a new column as an incremental
new parquet file, and be able to somehow join the existing and new file, in
an efficient way. IE, somehow
For sharing RDDs across multiple jobs - you could also have a look at
Tachyon. It provides an HDFS compatible in-memory storage layer that keeps
data in memory across multiple jobs/frameworks - http://tachyon-project.org/
.
-
On Tue, Nov 11, 2014 at 8:11 AM, Sonal Goyal wrote:
> I believe the S
ly in their own
> clusters (load, train, save). and at some point during run time these
> sub-models merge into the master model, which also loads, trains, and saves
> at the master level.
>
> much appreciated.
>
>
>
> On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks
>
to add?
On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh wrote:
> that works. is there a better way in spark? this seems like the most
> common feature for any machine learning work - to be able to save your
> model after training it and load it later.
>
> On Fri, Nov 7, 2014 at 2:
Plain old java serialization is one straightforward approach if you're in
java/scala.
On Thu, Nov 6, 2014 at 11:26 PM, ll wrote:
> what is the best way to save an mllib model that you just trained and
> reload
> it in the future? specifically, i'm using the mllib word2vec model...
> thanks.
>
>
You can imagine this same logic applying to the continuous case. E.g. what
if all the quartiles or deciles of a particular value have different
behavior - this could capture that too. Of what if some combination of
features was highly discriminitive but only into n buckets, rather than
two.. you ca
In cluster settings if you don't explicitly call sc.stop() your application
may hang. Like closing files, network connections, etc, when you're done
with them, it's a good idea to call sc.stop(), which lets the spark master
know that your application is finished consuming resources.
On Fri, Oct 31
/ rebuild the RDD (it tries to only
rebuild the missing part, but sometimes it must rebuild everything).
Job server can help with 1 or 2, 2 in particular. If you have any
questions about job server, feel free to ask at the spark-jobserver
google group. I am the maintainer.
-Evan
On Thu, Oct 23
speed up your program.
- Evan
> On Oct 20, 2014, at 3:54 AM, npomfret wrote:
>
> I'm getting the same warning on my mac. Accompanied by what appears to be
> pretty low CPU usage
> (http://apache-spark-user-list.1001560.n3.nabble.com/mlib-model-build-and-low-CPU-usage-td16777.
How many files do you have and how big is each JSON object?
Spark works better with a few big files vs many smaller ones. So you could try
cat'ing your files together and rerunning the same experiment.
- Evan
> On Oct 18, 2014, at 12:07 PM,
> wrote:
>
> Hi,
>
>
ggestion would be to backport 'spark.localExecution.enabled' to
the 1.0 line. Thanks for all your help!
Evan
On Fri, Oct 10, 2014 at 10:40 PM, Davies Liu wrote:
> This is some kind of implementation details, so not documented :-(
>
> If you think this is a blocker for you, you
Thank you! I was looking for a config variable to that end, but I was
looking in Spark 1.0.2 documentation, since that was the version I had
the problem with. Is this behavior documented in 1.0.2's documentation?
Evan
On 10/09/2014 04:12 PM, Davies Liu wrote:
When you call rdd.take
Caching after doing the multiply is a good idea. Keep in mind that during
the first iteration of KMeans, the cached rows haven't yet been
materialized - so it is both doing the multiply and the first pass of
KMeans all at once. To isolate which part is slow you can run
cachedRows.numRows() to force
, you
can simply run step 1 yourself on your RowMatrix via the (experimental)
computeCovariance() method, and then run SVD on the result using a library
like breeze.
- Evan
On Tue, Sep 23, 2014 at 12:49 PM, st553 wrote:
> sowen wrote
> > it seems that the singular values from the S
What Sean said.
You should also definitely turn on Kryo serialization. The default
Java serialization is really really slow if you're gonna move around
lots of data.Also make sure you use a cluster with high network
bandwidth on.
On Thu, Sep 18, 2014 at 3:06 AM, Sean Owen wrote:
> Base 64 i
Hi Abel,
Pretty interesting. May I ask how big is your point CSV dataset?
It seems you are relying on searching through the FeatureCollection of
polygons for which one intersects your point. This is going to be
extremely slow. I highly recommend using a SpatialIndex, such as the
many that exis
Sweet, that's probably it. Too bad it didn't seem to make 1.1?
On Wed, Sep 17, 2014 at 5:32 PM, Michael Armbrust
wrote:
> The unknown slowdown might be addressed by
> https://github.com/apache/spark/commit/f858f466862541c3faad76a1fa2391f1c17ec9dd
>
> On Sun, Sep 14, 2014
SPARK-1671 looks really promising.
Note that even right now, you don't need to un-cache the existing
table. You can do something like this:
newAdditionRdd.registerTempTable("table2")
sqlContext.cacheTable("table2")
val unionedRdd = sqlContext.table("table1").unionAll(sqlContext.table("table2"))
I spoke with SK offline about this, it looks like the difference in timings
came from the fact that he was training 100 models for 100 iterations and
taking the total time (vs. my example which trains a single model for 100
iterations). I'm posting my response here, though, because I think it's
wor
Asynchrony is not supported directly - spark's programming model is
naturally BSP. I have seen cases where people have instantiated actors with
akka on worker nodes to enable message passing, or even used spark's own
ActorSystem to do this. But, I do not recommend this, since you lose a
bunch of be
Hmm... something is fishy here.
That's a *really* small dataset for a spark job, so almost all your time
will be spent in these overheads, but still you should be able to train a
logistic regression model with the default options and 100 iterations in
<1s on a single machine.
Are you caching your
Also - what hardware are you running the cluster on? And what is the local
machine hardware?
On Tue, Sep 2, 2014 at 11:57 AM, Evan R. Sparks
wrote:
> How many iterations are you running? Can you provide the exact details
> about the size of the dataset? (how many data points, how many fe
How many iterations are you running? Can you provide the exact details
about the size of the dataset? (how many data points, how many features) Is
this sparse or dense - and for the sparse case, how many non-zeroes? How
many partitions is your data RDD?
For very small datasets the scheduling overh
Filed SPARK-3295.
On Mon, Aug 25, 2014 at 12:49 PM, Michael Armbrust
wrote:
>> SO I tried the above (why doesn't union or ++ have the same behavior
>> btw?)
>
>
> I don't think there is a good reason for this. I'd open a JIRA.
>
>>
>> and it works, but is slow because the original Rdds are not
>
There's no way to avoid a shuffle due to the first and last elements
of each partition needing to be computed with the others, but I wonder
if there is a way to do a minimal shuffle.
On Thu, Aug 21, 2014 at 6:13 PM, cjwang wrote:
> One way is to do zipWithIndex on the RDD. Then use the index as
-jobserver
The git commit history is still there, but unfortunately the pull
requests don't migrate over. I'll be contacting each of you with
open PRs to move them over to the new location.
Happy Hacking!
Evan (@velvia)
Kelvin (@kelvinchu)
Daniel
And it worked earlier with non-parquet directory.
On Thu, Aug 21, 2014 at 12:22 PM, Evan Chan wrote:
> The underFS is HDFS btw.
>
> On Thu, Aug 21, 2014 at 12:22 PM, Evan Chan wrote:
>> Spark 1.0.2, Tachyon 0.4.1, Hadoop 1.0 (standard EC2 config)
>>
The underFS is HDFS btw.
On Thu, Aug 21, 2014 at 12:22 PM, Evan Chan wrote:
> Spark 1.0.2, Tachyon 0.4.1, Hadoop 1.0 (standard EC2 config)
>
> scala> val gdeltT =
> sqlContext.parquetFile("tachyon://172.31.42.40:19998/gdelt-parquet/1979-2005/")
> 14/08/21 19:07:14
Spark 1.0.2, Tachyon 0.4.1, Hadoop 1.0 (standard EC2 config)
scala> val gdeltT =
sqlContext.parquetFile("tachyon://172.31.42.40:19998/gdelt-parquet/1979-2005/")
14/08/21 19:07:14 INFO :
initialize(tachyon://172.31.42.40:19998/gdelt-parquet/1979-2005,
Configuration: core-default.xml, core-site.xml
I just put up a repo with a write-up on how to import the GDELT public
dataset into Spark SQL and play around. Has a lot of notes on
different import methods and observations about Spark SQL. Feel free
to have a look and comment.
http://www.github.com/velvia/spark-sql-gdelt
---
014 at 12:17 AM, Michael Armbrust
wrote:
> I believe this should work if you run srdd1.unionAll(srdd2). Both RDDs must
> have the same schema.
>
>
> On Wed, Aug 20, 2014 at 11:30 PM, Evan Chan wrote:
>>
>> Is it possible to merge two cached Spark SQL tables into a sing
cached too.
thanks,
Evan
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
That might not be enough. Reflection is used to determine what the
fields are, thus your class might actually need to have members
corresponding to the fields in the table.
I heard that a more generic method of inputting stuff is coming.
On Tue, Aug 19, 2014 at 6:43 PM, Tobias Pfeiffer wrote:
>
edious, or to construct Parquet files manually (also
tedious).
thanks,
Evan
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
The loss functions are represented in the various names of the model
families. SVM is hinge loss, LogisticRegression is logistic loss,
LinearRegression is linear loss. These are used internally as arguments to
the SGD and L-BFGS optimizers.
On Thu, Aug 7, 2014 at 6:31 PM, SK wrote:
> Hi,
>
> Ac
Reza Zadeh has contributed the distributed implementation of (Tall/Skinny)
SVD (http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html),
which is in MLlib (Spark 1.0) and a distributed sparse SVD coming in Spark
1.1. (https://issues.apache.org/jira/browse/SPARK-1782). If your data
Specifically, reduceByKey expects a commutative/associative reduce
operation, and will automatically do this locally before a shuffle, which
means it acts like a "combiner" in MapReduce terms -
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions
On Thu
Try s3n://
> On Aug 6, 2014, at 12:22 AM, sparkuser2345 wrote:
>
> I'm getting the same "Input path does not exist" error also after setting the
> AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables and using
> the format "s3:///test_data.txt" for the input file.
>
>
>
> --
>
ficient way to do it, but I
think it's a nice example of how to think about using spark at a higher
level of abstraction.
- Evan
On Fri, Aug 1, 2014 at 2:00 PM, Sean Owen wrote:
> Here's the more functional programming-friendly take on the
> computation (but
Computing the variance is similar to this example, you just need to keep
around the sum of squares as well.
The formula for variance is (sumsq/n) - (sum/n)^2
But with big datasets or large values, you can quickly run into overflow
issues - MLlib handles this by maintaining the the average sum of
Can you share the dataset via a gist or something and we can take a look at
what's going on?
On Fri, Jul 25, 2014 at 10:51 AM, SK wrote:
> yes, the output is continuous. So I used a threshold to get binary labels.
> If prediction < threshold, then class is 0 else 1. I use this binary label
> t
Try sc.getExecutorStorageStatus().length
SparkContext's getExecutorMemoryStatus or getExecutorStorageStatus will
give you back an object per executor - the StorageStatus objects are what
drives a lot of the Spark Web UI.
https://spark.apache.org/docs/1.0.1/api/scala/index.html#org.apache.spark.Sp
ze
of your dataset, may or may not be a good idea.
There are some tricks you can do to make training multiple models on the
same dataset faster, which we're hoping to expose to users in an upcoming
release.
- Evan
On Sat, Jul 5, 2014 at 1:50 AM, Sean Owen wrote:
> If you call .par on da
There is a method in org.apache.spark.mllib.util.MLUtils called "kFold"
which will automatically partition your dataset for you into k train/test
splits at which point you can build k different models and aggregate the
results.
For example (a very rough sketch - assuming I want to do 10-fold cross
Oh, and the movie lens one is userid::movieid::rating
- Evan
> On Jun 22, 2014, at 3:35 PM, Justin Yip wrote:
>
> Hello,
>
> I am looking into a couple of MLLib data files in
> https://github.com/apache/spark/tree/master/data/mllib. But I cannot find any
> explanation
but can double (or more)
storage requirements for dense data.
- Evan
> On Jun 22, 2014, at 3:35 PM, Justin Yip wrote:
>
> Hello,
>
> I am looking into a couple of MLLib data files in
> https://github.com/apache/spark/tree/master/data/mllib. But I cannot find any
>
Also - you could consider caching your data after the first split (before
the first filter), this will prevent you from retrieving the data from s3
twice.
On Fri, Jun 20, 2014 at 8:32 AM, Xiangrui Meng wrote:
> Your data source is S3 and data is used twice. m1.large does not have very
> good ne
I use SBT, create an assembly, and then add the assembly jars when I create
my spark context. The main executor I run with something like "java -cp ...
MyDriver".
That said - as of spark 1.0 the preferred way to run spark applications is
via spark-submit -
http://spark.apache.org/docs/latest/submi
r algorithm fits in a BSP
programming model, with Spark you can achieve performance that is
comparable to a tuned C++/MPI codebase by leveraging the right libraries
locally and thinking carefully about what and when you have to communicate.
- Evan
On Thu, Jun 19, 2014 at 8:48 AM, ldmtwo wrot
This looks like a job for SparkSQL!
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
case class MyRecord(country: String, name: String, age: Int, hits: Long)
val data = sc.parallelize(Array(MyRecord("USA", "Franklin", 24, 234),
MyRecord("USA", "Bob", 55, 108), MyRecord
I wouldn't be surprised if the default BLAS that ships with jblas is not
optimized for your target platform. Breeze (which we call into, uses either
jblas and falls back to netlib if jblas can't be loaded.) I'd recommend
using jblas if you can.
You probably want to compile a native BLAS library sp
I should point out that if you don't want to take a polyglot approach to
languages and reside solely in the JVM, then you can just use plain old
java serialization on the Model objects that come out of MLlib's APIs from
Java or Scala and load them up in another process and call the relevant
.predic
ou
wanted to go 1000 features at depth 10 I'd estimate a couple of gigs
necessary for heap space for the worker to compute/store the histograms,
and I guess 2x that on the master to do the reduce.
Again 2GB per worker is pretty tight, because there are overheads of just
starting the jvm, launc
intuitive sense IMO
> because a decision tree is a non-parametric model, and the expressibility
> of a tree depends on the number of nodes.
>
> With a huge amount of data (millions or even billions of rows), we found
> that the depth of 10 is simply not adequate to build high-
reduction technique, but I'd be a little surprised if a bunch of hugely
deep trees don't overfit to training data. I guess I view limiting tree
depth as an analogue to regularization in linear models.
On Thu, Apr 17, 2014 at 12:19 PM, Sung Hwan Chung
wrote:
> Evan,
>
> I actual
Sorry - I meant to say that "Multiclass classification, Gradient Boosting,
and Random Forest support based on the recent Decision Tree implementation
in MLlib is planned and coming soon."
On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks wrote:
> Multiclass classification, Gradient
Multiclass classification, Gradient Boosting, and Random Forest support for
based on the recent Decision Tree implementation in MLlib.
Sung - I'd be curious to hear about your use of decision trees (and
forests) where you want to go to 100+ depth. My experience with random
forests has been that pe
A bandaid might be to set up ssh tunneling between slaves and master - has
anyone tried deploying this way? I would expect it to pretty negatively
impact performance on communication-heavy jobs.
On Tue, Apr 8, 2014 at 3:23 PM, Benjamin Black wrote:
> Only if you trust the provider networks and
ate branch with TPE (as well as random and grid
search) integrated with MLI, but the code is research quality right now and
not extremely general.
We're actively working on bringing these things up to snuff for a proper
open source release.
On Fri, Apr 4, 2014 at 11:28 AM, Yi Zou wrote:
Have a look at MultipleOutputs in the hadoop API. Spark can read and write to
arbitrary hadoop formats.
> On Apr 4, 2014, at 6:01 AM, dmpour23 wrote:
>
> Hi all,
> Say I have an input file which I would like to partition using
> HashPartitioner k times.
>
> Calling rdd.saveAsTextFile(""hdfs:
Targeting 0.9.0 should work out of the box (just a change to the build.sbt)
- I'll push some changes I've been sitting on to the public repo in the
next couple of days.
On Wed, Apr 2, 2014 at 4:05 AM, Krakna H wrote:
> Thanks for the update Evan! In terms of using MLI, I see th
Hi there,
MLlib is the first component of MLbase - MLI and the higher levels of the
stack are still being developed. Look for updates in terms of our progress
on the hyperparameter tuning/model selection problem in the next month or
so!
- Evan
On Tue, Apr 1, 2014 at 8:05 PM, Krakna H wrote
http://svonava.com/post/62186512058/datasets-released-by-google
- Evan
On Tue, Feb 25, 2014 at 6:33 PM, 黄远强 wrote:
> Hi all:
> I am a freshman in Spark community. i dream of being a expert in the field
> of big data. But i have no idea where to start after i have gone through
> the publis
81 matches
Mail list logo