Hi Parviz,
Yes certainly - by all means! Thanks for all your hard work making it easy
to run Spark on EMR...
Cheers,
Alex
On Fri, Apr 18, 2014 at 5:27 AM, Pdeyhim wrote:
> Awesome! Ok if we mention this on aws blog?
>
> Sent from my iPad
>
> On Apr 17, 2014, at 7:10 AM, Alex Dean wrote:
>
>
It happens with normal data rate, i.e., lets say 20 records per second.
Apart from that, I am also getting some more strange behavior. Let me
explain.
I establish two sscs. Start them one after another. In SSCs I get the
streams from Kafka sources, and do some manipulations. Like, adding some
"Re
I am trying to implement joining with co-partitioned inputs. As described in
the documentation, we can avoid shuffling by partitioning elements with the
same hash code into the same machine.
>>> links =
>>> sc.parallelize([('a','b'),('a','c'),('b','c'),('c','a')]).groupByKey(3)
>>> links.glom().co
Hi,
Good to know. does that Ooyala spark job server can run on the Yarn? is
there a job scheduler?
On Fri, Apr 18, 2014 at 12:12 PM, All In A Days Work <
allinadays...@gmail.com> wrote:
> Hi,
>
> In 2013 spark summit, Ooyala had presented their spark Job server and
> indicated that they wanted
Awesome! Ok if we mention this on aws blog?
Sent from my iPad
> On Apr 17, 2014, at 7:10 AM, Alex Dean wrote:
>
> Hi all,
>
> Just a quick email to share a new GitHub project we've just released at
> Snowplow:
>
> https://github.com/snowplow/spark-example-project
>
> It's an example Scala &
Hi,
In 2013 spark summit, Ooyala had presented their spark Job server and
indicated that they wanted open source the work.
Is there any plan to merge this functionality into the spark offering
itself - and not offer just as Ooyala open sourced version ?
Thanks,
got it, thank you.
On Fri, Apr 18, 2014 at 9:55 AM, Cheng Lian wrote:
> Ah, I’m not saying println is bad, it’s just that you need to go to the
> right place to locate the output, e.g. you can check stdout of any executor
> from the Web UI.
>
>
> On Fri, Apr 18, 2014 at 9:48 AM, 诺铁 wrote:
>
>>
Preferably increase the ulimit on your machines. Spark needs to access a lot of
small files hence hard to control file handlers.
—
Sent from Mailbox
On Fri, Apr 18, 2014 at 3:59 AM, Ryan Compton
wrote:
> Btw, I've got System.setProperty("spark.shuffle.consolidate.files",
> "true") and use ex
Ah, I’m not saying println is bad, it’s just that you need to go to the
right place to locate the output, e.g. you can check stdout of any executor
from the Web UI.
On Fri, Apr 18, 2014 at 9:48 AM, 诺铁 wrote:
> hi,Cheng,
>
> thank you for let me know this. so what do you think is better way to
hi,Cheng,
thank you for let me know this. so what do you think is better way to
debug?
On Fri, Apr 18, 2014 at 9:27 AM, Cheng Lian wrote:
> A tip: using println is only convenient when you are working with local
> mode. When running Spark in clustering mode (standalone/YARN/Mesos), output
>
A tip: using println is only convenient when you are working with local
mode. When running Spark in clustering mode (standalone/YARN/Mesos), output
of println goes to executor stdout.
On Fri, Apr 18, 2014 at 6:53 AM, 诺铁 wrote:
> yeah, I got it.!
> using println to debug is great for me to explo
Yes, it should be data specific and perhaps we're biased toward the data
sets that we are playing with. To put things in perspective, we're highly
interested in (and I believe, our customers are):
1. large (hundreds of millions of rows)
2. multi-class classification - nowadays, dozens of target ca
This is a good usecase for using DStream.updateStateByKey! This allows you
to maintain arbitrary per-key state. Checkout this example.
https://github.com/tdas/spark/blob/master/examples/src/main/scala/org/apache/spark/streaming/examples/StatefulNetworkWordCount.scala
Also take a look at the docume
What kind of data are you training on? These effects are *highly* data
dependent, and while saying "the depth of 10 is simply not adequate to
build high-accuracy models" may be accurate for the particular problem
you're modeling, it is not true in general. From a statistical perspective,
I consider
yeah, I got it.!
using println to debug is great for me to explore spark.
thank you very much for your kindly help.
On Fri, Apr 18, 2014 at 12:54 AM, Daniel Darabos <
daniel.dara...@lynxanalytics.com> wrote:
> Here's a way to debug something like this:
>
> scala> d5.keyBy(_.split(" ")(0)).reduc
Btw, I've got System.setProperty("spark.shuffle.consolidate.files",
"true") and use ext3 (CentOS...)
On Thu, Apr 17, 2014 at 3:20 PM, Ryan Compton wrote:
> Does this continue in newer versions? (I'm on 0.8.0 now)
>
> When I use .distinct() on moderately large datasets (224GB, 8.5B rows,
> I'm gue
Does this continue in newer versions? (I'm on 0.8.0 now)
When I use .distinct() on moderately large datasets (224GB, 8.5B rows,
I'm guessing about 500M are distinct) my jobs fail with:
14/04/17 15:04:02 INFO cluster.ClusterTaskSetManager: Loss was due to
java.io.FileNotFoundException
java.io.File
I believe that they show one example comparing depth 1 ensemble vs depth 3
ensemble but it is based on boosting, not bagging.
On Thu, Apr 17, 2014 at 2:21 PM, Debasish Das wrote:
> Evan,
>
> Was not mllib decision tree implemented using ideas from Google's PLANET
> paper...do the paper also prop
Thanks again Eugen! I don't get the point..why you prefer to avoid kyro ser
for closures?is there any problem with that?
On Apr 17, 2014 11:10 PM, "Eugen Cepoi" wrote:
> You have two kind of ser : data and closures. They both use java ser. This
> means that in your function you reference an objec
Evan,
Was not mllib decision tree implemented using ideas from Google's PLANET
paper...do the paper also propose to grow a shallow tree ?
Thanks.
Deb
On Thu, Apr 17, 2014 at 1:52 PM, Sung Hwan Chung
wrote:
> Additionally, the 'random features per node' (or mtry in R) is a very
> important feat
You have two kind of ser : data and closures. They both use java ser. This
means that in your function you reference an object outside of it and it is
getting ser with your task. To enable kryo ser for closures set
spark.closure.serializer property. But usualy I dont as it allows me to
detect such
Additionally, the 'random features per node' (or mtry in R) is a very
important feature for Random Forest. The variance reduction comes if the
trees are decorrelated from each other and often the random features per
node does more than bootstrap samples. And this is something that would
have to be
Well, if you read the original paper,
http://oz.berkeley.edu/~breiman/randomforest2001.pdf
"Grow the tree using CART methodology to maximum size and do not prune."
Now, the elements of statistical learning book on page 598 says that you
could potentially overfit fully-grown regression random fores
Hi, I'm completely new to Spark streaming (and Spark) and have been reading
up on it and trying out various examples the past few days. I have a
particular use case which I think it would work well for, but I wanted to
put it out there and get some feedback on whether or not it actually would.
The
Now I have another problem..I have to pass one o this non serializable
object to a PairFunction and I received another non serializable
exception..it seems that Kyro doesn't work within Functions. Am I wrong or
this is a limit of Spark?
On Apr 15, 2014 1:36 PM, "Flavio Pompermaier" wrote:
> Ok th
Has anyone managed to write Booleans to Cassandra from an RDD with Calliope?
My Booleans give compile time errors: expression of type List[Any] does not
conform to expected type Types.CQLRowValues
CQLColumnValue is definted as ByteBuffer: type CQLColumnValue = ByteBuffer
For now I convert them to
Oh dear I read this as a build problem. I can build with the latest
Java 7, including those versions of Spark and Mesos, no problem. I did
not deploy them.
Mesos does have some native libraries, so it might well be some kind
of compatibility issue at that level. Anything more in the error log
that
Sure. Here it is. Pretty sure it's something else. Any suggestions on other
avenues to investigate from folks who've seen this?
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x7f543716cce9, pid=8260, tid=13226316544
#
# JRE version: J
If you can test it quickly, an option would be to try with the exact same
version that Sean used (1.7.0_51) ?
Maybe it was a bug fixed in 51 and a regression has been introduced in 55
:-D
Andy
On Thu, Apr 17, 2014 at 9:36 PM, Steven Cox wrote:
> FYI, I've tried older versions (jdk6.x), openjdk
FYI, I've tried older versions (jdk6.x), openjdk. Also here's a fresh core dump
on jdk7u55-b13:
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x7f7c6b718d39, pid=7708, tid=140171900581632
#
# JRE version: Java(TM) SE Runtime Environment (7.0_
I'm quite new myself (just subscribed to the mailing list today :)), but
this happens to be something we've had success with. So let me know if you
hit any problems with this sort of usage.
On Thu, Apr 17, 2014 at 9:11 PM, Jim Carroll wrote:
> Daniel,
>
> I'm new to Spark but I thought that thr
Hmm... can you provide some pointers to examples where deep trees are
helpful?
Typically with Decision Trees you limit depth (either directly or
indirectly with minimum node size and minimum improvement criteria) to
avoid overfitting. I agree with the assessment that forests are a variance
reducti
No of course, but I was guessing some native libs imported (to communicate
with Mesos) in the project that... could miserably crash the JVM.
Anyway, so you tell us that using this oracle version, you don't have any
issues when using spark on mesos 0.18.0, that's interesting 'cause AFAIR,
my last t
Evan,
I actually haven't heard of 'shallow' random forest. I think that the only
scenarios where shallow trees are useful are boosting scenarios.
AFAIK, Random Forest is a variance reducing technique and doesn't do much
about bias (although some people claim that it does have some bias reducing
e
Daniel,
I'm new to Spark but I thought that thread hinted at the right answer.
Thanks,
Jim
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Continuously-running-non-streaming-jobs-tp4391p4397.html
Sent from the Apache Spark User List mailing list archive
Sorry - I meant to say that "Multiclass classification, Gradient Boosting,
and Random Forest support based on the recent Decision Tree implementation
in MLlib is planned and coming soon."
On Thu, Apr 17, 2014 at 12:07 PM, Evan R. Sparks wrote:
> Multiclass classification, Gradient Boosting, and
I don't know if it's anything you or the project is missing... that's
just a JDK bug.
FWIW I am on 1.7.0_51 and have not seen anything like that.
I don't think it's a protobuf issue -- you don't crash the JVM with
simple version incompatibilities :)
--
Sean Owen | Director, Data Science | London
Hyea,
I still have to try it myself (I'm trying to create GCE images with Spark
on Mesos 0.18.0) but I think your change is one of the required ones,
however my gut feeling is that others will be required to have this working.
Actually, in my understanding, this core dump is due to protobuf
incom
Multiclass classification, Gradient Boosting, and Random Forest support for
based on the recent Decision Tree implementation in MLlib.
Sung - I'd be curious to hear about your use of decision trees (and
forests) where you want to go to 100+ depth. My experience with random
forests has been that pe
The linked thread does a good job answering your question. You should
create a SparkContext at startup and re-use it for all of your queries. For
example we create a SparkContext in a web server at startup, and are then
able to use the Spark cluster for serving Ajax queries with latency of a
second
Debasish, we've tested the MLLib decision tree a bit and it eats up too
much memory for RF purposes.
Once the tree got to depth 8~9, it was easy to get heap exception, even
with 2~4 GB of memory per worker.
With RF, it's very easy to get 100+ depth in RF with even only 100,000+
rows (because trees
So I tried a fix found on the list...
"The issue was due to meos version mismatch as I am using latest mesos
0.17.0, but spark uses 0.13.0.
Fixed by updating the SparkBuild.scala to latest version."
I changed this line in SparkBuild.scala
"org.apache.mesos" % "mesos"
Is there a way to create continuously-running, or at least
continuously-loaded, jobs that can be 'invoked' rather than 'sent' to to
avoid the job creation overhead of a couple seconds?
I read through the following:
http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-
Mllib has decision treethere is a rf pr which is not active nowtake
that and swap the tree builder with the fast tree builder that's in
mllib...search for the spark jira...the code is based on google planet
paper. ..
I am sure people in devlist are already working on it...send an email to
Here's a way to debug something like this:
scala> d5.keyBy(_.split(" ")(0)).reduceByKey((v1,v2) => {
println("v1: " + v1)
println("v2: " + v2)
(v1.split(" ")(1).toInt + v2.split(" ")(1).toInt).toString
}).collect
You get:
v1: 1 2 3 4 5
v2: 1 2 3 4 5
v1: 4
v
HI,
I am new to spark,when try to write some simple tests in spark shell, I met
following problem.
I create a very small text file,name it as 5.txt
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
and experiment in spark shell:
scala> val d5 = sc.textFile("5.txt").cache()
d5: org.apache.spark.rdd.RDD[String] = Ma
how many tasks are there in your job?
发自我的 iPhone
在 2014-4-17,16:24,Qin Wei 写道:
> Hi, Andre, thanks a lot for you reply, but i still get the same exception,
> the complete exception message is as below:
>
> Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task
> 1.0:9
Hi all,
Just a quick email to share a new GitHub project we've just released at
Snowplow:
https://github.com/snowplow/spark-example-project
It's an example Scala & SBT project which can assemble a fat jar ready for
running on Amazon Elastic MapReduce. It includes Specs2 tests too.
The blog post
Hi Arpit,
I didn't build it, I am using the prebuild version described here:
http://www.abcn.net/2014/04/install-shark-on-cdh5-hadoop2-spark.html
including adding e.g. the mentioned jar
br...Gerd...
On 17 April 2014 15:49, Arpit Tak wrote:
> Just for curiosity , as you are using Cloudera-Mana
Just for curiosity , as you are using Cloudera-Manager hadoop and spark..
How you build shark .for it??
are you able to read any file from hdfs ...did you tried that out..???
Regards,
Arpit Tak
On Thu, Apr 17, 2014 at 7:07 PM, ge ko wrote:
> Hi,
>
> the error java.lang.ClassNotFoundE
Hi,
the error java.lang.ClassNotFoundException:
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat has been
resolved by adding
parquet-hive-bundle-1.4.1.jar to shark's lib folder.
Now the Hive metastore can be read successfully (also the parquet based
table).
But if I want to select fr
Hi Wel,
Take a look at this post...
http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-td2016.html
Regards,
Arpit Tak
On Thu, Apr 17, 2014 at 3:42 PM, Wei Wang wrote:
> Hi, there
>
> I would like to know is there any differences
Hi, there
I would like to know is there any differences between Spark on Yarn and
Spark on Mesos. Is there any comparision between them? What are the
advantages and disadvantages for each of them. Is there any criterion for
choosing between Yarn and Mesos?
BTW, we need MPI in our framework, and I
Hi,
I want to select from a parquet based table in shark, but receive the error:
shark> select * from wl_parquet;
14/04/17 11:33:49 INFO shark.SharkCliDriver: Execution Mode: shark
14/04/17 11:33:49 INFO ql.Driver:
14/04/17 11:33:49 INFO ql.Driver:
14/04/17 11:33:49 INFO ql.Driver:
14/04/17 11
what do you mean by "element"?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/what-is-the-difference-between-element-and-partition-tp4317p4378.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
thanks Arpit, gotcha ;)
On 16 April 2014 20:08, Arpit Tak wrote:
> just set your java class path properly
>
> export JAVA_HOME=/usr/lib/jvm/java-7-. (somewhat like this...whatever
> version you having)
>
> it will work
>
> Regards,
> Arpit
>
>
> On Wed, Apr 16, 2014 at 1:24 AM, ge ko w
No, partition number is determined by the parameter you set in groupByKey,
see
http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions
for details, suggest you reading some docs before ask questions
Joe L wrote
> I was wonder if groupByKey returns 2 partition
Hi,
For one of my application, I want to use Random forests(RF) on top of spark. I
see that currenlty MLLib does not have implementation for RF. What other
opensource RF implementations will be great to use with spark in terms of speed?
Regards,
Laeeq Ahmed,
KTH, Sweden.
According to the Spark SQL documentation, indeed, this project allows
python to be used while reading/writing table ie data which not necessarily
in text format.
Thanks a lot!
Bertrand Dechoux
On Thu, Apr 17, 2014 at 10:06 AM, Bertrand Dechoux wrote:
> Thanks for the IRA reference. I really ne
ES formats are pretty easy to use:
Reading:
val conf = new Configuration()
conf.set("es.resource", "index/type")
conf.set("es.query", "?q=*")
val rdd = sc.newAPIHadoopRDD(
conf,
classOf[EsInputFormat[NullWritable, LinkedMapWritable]],
classOf[NullWritable],
classOf[LinkedMapWritable]
)
The only g
Hello!
On Wed, Apr 16, 2014 at 7:59 PM, Aureliano Buendia wrote:
> Hi,
>
> Google has publisheed a new connector for hadoop: google cloud storage,
> which is an equivalent of amazon s3:
>
>
> googlecloudplatform.blogspot.com/2014/04/google-bigquery-and-datastore-connectors-for-hadoop.html
>
This
Hi, Andre, thanks a lot for you reply, but i still get the same exception, the
complete exception message is as below:
Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task
1.0:9 failed 4 times (most recent failure: Exception failure:
java.lang.OutOfMemoryError: Jav
It could be GC issue, first time it triggers a full GC that takes too
much time ?
Make sure you have Xms and Xms at the same values, and try
-XX:+UseConcMarkSweepGC
And analyse GC logs.
André Bois-Crettez
On 2014-04-16 16:44, Arpit Tak wrote:
I am loading some data(25GB) in shark from hdfs : sp
Thanks for the IRA reference. I really need to look at Spark SQL.
Am I right to understand that due to Spark SQL, hive data can be read (and
it does not need to be a text format) and then 'classical' Spark can work
on this extraction?
It seems logical but I haven't worked with Spark SQL as of now
64 matches
Mail list logo