Thanks Tobias,
I also found this: https://issues.apache.org/jira/browse/SPARK-3299
Looks like it's been working on.
Jianshi
On Mon, Sep 8, 2014 at 9:28 AM, Tobias Pfeiffer wrote:
> Hi,
>
> On Sat, Sep 6, 2014 at 1:40 AM, Jianshi Huang
> wrote:
>
>> Err... there's no such feature?
>>
>
> The
Thank you Aaron for pointing out problem. This only happens when I run this
code in spark-shell but not when i submit the job.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/error-type-mismatch-while-Union-tp13547p13677.html
Sent from the Apache Spark User
There is an undocumented configuration to put users jars in front of
spark jar. But I'm not very certain that it works as expected (and
this is why it is undocumented). Please try turning on
spark.yarn.user.classpath.first . -Xiangrui
On Sat, Sep 6, 2014 at 5:13 PM, Victor Tso-Guillen wrote:
> I
You can try LinearRegression with sparse input. It converges the least
squares solution if the linear system is over-determined, while the
convergence rate depends on the condition number. Applying standard
scaling is popular heuristic to reduce the condition number.
If you are interested in spars
Doing a quick Google search, it appears to me that there is a number people
who have implemented algorithms for solving systems of (sparse) linear
equations on Hadoop MapReduce.
However, I can find no such thing for Spark.
Has anyone information on whether there are attempts of creating such an
I would say that the first three are all used pretty heavily. Mesos
was the first one supported (long ago), the standalone is the
simplest and most popular today, and YARN is newer but growing a lot
in activity.
SIMR is not used as much... it was designed mostly for environments
where users had a
Hi,
I'm trying to determine which Spark deployment models are the most popular
- Standalone, YARN, Mesos, or SIMR. Anyone knows?
I thought I'm use search-hadoop.com to help me figure this out and this is
what I found:
1) Standalone
http://search-hadoop.com/?q=standalone&fc_project=Spark&fc_typ
Hi,
On Sat, Sep 6, 2014 at 1:40 AM, Jianshi Huang
wrote:
> Err... there's no such feature?
>
The problem is that the SQLContext's `catalog` member is protected, so you
can't access it from outside. If you subclass SQLContext, and make sure
that `catalog` is always a `SimpleCatalog`, you can che
Hi,
On Fri, Sep 5, 2014 at 6:16 PM, Deep Pradhan
wrote:
>
> Does Spark support recursive calls?
>
Can you give an example of which kind of recursion you would like to use?
Tobias
When a MappedRDD is handled by groupByKey transformation, tuples distributed
in different worker nodes with the same key will be collected into one
worker nodes, say,
(K, V1), (K, V2), ..., (K, Vn) -> (K, Seq(V1, V2, ..., Vn)).
I want to know whether the value /Seq(V1, V2, ..., Vn)/ of a tuple i
I have the following code:
stream foreachRDD { rdd =>
if (rdd.take (1).size == 1) {
rdd foreachPartition { iterator =>
initDbConnection ()
iterator foreach {
write to db
Sometimes the underlying Hive code will also print exceptions during
successful execution (for example CREATE TABLE IF NOT EXISTS). If there is
actually a problem Spark SQL should throw an exception.
What is the command you are running and what is the error you are seeing?
On Sat, Sep 6, 2014 a
... I'd call out that last bit as actually tricky: "close off the driver"
See this message for the right-est way to do that, along with the
right way to open DB connections remotely instead of trying to
serialize them:
http://mail-archives.apache.org/mod_mbox/spark-user/201407.mbox/%3CCAPH-c_O9kQ
Also keep in mind there is a non-trivial amount of traffic between the
driver and cluster. It's not something I would do by default, running
the driver so remotely. With enough ports open it should work though.
On Sun, Sep 7, 2014 at 7:05 PM, Ognen Duzlevski
wrote:
> Horacio,
>
> Thanks, I have n
Hi Tathagata,
I have managed to implement the logic into the Kafka-Spark consumer to
recover from Driver failure. This is just a interim fix till actual fix is
done from Spark side.
The logic is something like this.
1. When the Individual Receivers starts for every Topic partition, it
writes the
If I recall, you should be able to start Hadoop MapReduce using
~/ephemeral-hdfs/sbin/start-mapred.sh.
On Sun, Sep 7, 2014 at 6:42 AM, Tomer Benyamini wrote:
> Hi,
>
> I would like to copy log files from s3 to the cluster's
> ephemeral-hdfs. I tried to use distcp, but I guess mapred is not
> run
I think you need to run start-all.sh or something similar on the EC2
cluster. MR is installed but is not running by default on EC2 clusters spun
up by spark-ec2.
On Sun, Sep 7, 2014 at 12:33 PM, Tomer Benyamini
wrote:
> I've installed a spark standalone cluster on ec2 as defined here -
> https
Horacio,
Thanks, I have not tried that, however, I am not after security right
now - I am just wondering why something so obvious won't work ;)
Ognen
On 9/7/2014 12:38 PM, Horacio G. de Oro wrote:
Have you tryied with ssh? It will be much secure (only 1 port open),
and you'll be able to run
Have you tryied with ssh? It will be much secure (only 1 port open),
and you'll be able to run spark-shell over the networ. I'm using that
way in my project (https://github.com/data-tsunami/smoke) with good
results.
I can't make a try now, but something like this should work:
ssh -tt ec2-user@YOU
Have you actually tested this?
I have two instances, one is standalone master and the other one just
has spark installed, same versions of spark (1.0.0).
The security group on the master allows all (0-65535) TCP and UDP
traffic from the other machine and the other machine allows all TCP/UDP
I've installed a spark standalone cluster on ec2 as defined here -
https://spark.apache.org/docs/latest/ec2-scripts.html. I'm not sure if
mr1/2 is part of this installation.
On Sun, Sep 7, 2014 at 7:25 PM, Ye Xianjin wrote:
> Distcp requires a mr1(or mr2) cluster to start. Do you have a mapreduc
Distcp requires a mr1(or mr2) cluster to start. Do you have a mapreduce cluster
on your hdfs?
And from the error message, it seems that you didn't specify your jobtracker
address.
--
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
On Sunday, September 7, 2014 at 9:42 PM, T
Spark will simply have a backlog of tasks, it'll manage to process them
nonetheless, though if it keeps falling behind, you may run out of memory
or have unreasonable latency. For momentary spikes, Spark streaming will
manage.
Mostly if you are looking to do 100% processing, you'll have to go with
Your question is a bit confusing..
I assume you have a RDD containing nodes & some meta data (child nodes
maybe) & you are trying to attach another metadata to it (bye array). if
its just same byte array for all nodes you can generate rdd with the count
of nodes & zip the two rdd together, you can
Statements are executed only when you try to cause some effect on the
server (produce data, collect data on driver). At time of execution Spark
does all the depedency resolution & truncates paths that dont go anywhere
as well as optimize execution pipelines. So you really dont have to worry
about t
Standard pattern is to initialize the mysql jdbc driver in your
mappartition call , update database & then close off the driver.
Couple of gotchas
1. New driver initiated for all your partitions
2. If the effect(inserts & updates) is not idempotent, so if your server
crashes, Spark will replay upda
Hi,
I would like to copy log files from s3 to the cluster's
ephemeral-hdfs. I tried to use distcp, but I guess mapred is not
running on the cluster - I'm getting the exception below.
Is there a way to activate it, or is there a spark alternative to distcp?
Thanks,
Tomer
mapreduce.Cluster (Clust
Thanks! I found the hdfs ui via this port - http://[master-ip]:50070/.
It shows 1 node hdfs though, although I have 4 slaves on my cluster.
Any idea why?
On Sun, Sep 7, 2014 at 4:29 PM, Ognen Duzlevski
wrote:
>
> On 9/7/2014 7:27 AM, Tomer Benyamini wrote:
>>
>> 2. What should I do to increase th
I keep getting below reply every time I send a message to the Spark user
list? Can this person be taken off the list by powers that be?
Thanks!
Ognen
Forwarded Message
Subject: DELIVERY FAILURE: Error transferring to
QCMBSJ601.HERMES.SI.SOCGEN; Maximum hop count exceeded. M
On 9/7/2014 7:27 AM, Tomer Benyamini wrote:
2. What should I do to increase the quota? Should I bring down the
existing slaves and upgrade to ones with more storage? Is there a way
to add disks to existing slaves? I'm using the default m1.large slaves
set up using the spark-ec2 script.
Take a l
Hi,
I would like to make sure I'm not exceeding the quota on the local
cluster's hdfs. I have a couple of questions:
1. How do I know the quota? Here's the output of hadoop fs -count -q
which essentially does not tell me a lot
root@ip-172-31-7-49 ~]$ hadoop fs -count -q /
2147483647 21474
After looking at the source code of SparkConf.scala, I found the following
solution.
Just set the following Java system property :
-Dspark.master=local
Shing
On Monday, 1 September 2014, 22:09, Shing Hing Man
wrote:
Hi,
I have noticed that the GroupByTest example in
https://github.c
Hi all,
I am Implementing a Crawler, Scraper. The It should be able to process the
request for crawling & scraping, within few seconds of submitting the
job(around 1mil/sec), for rest I can take some time(scheduled evenly all
over the day). What is the best way to implement this?
Thanks.
--
Vi
33 matches
Mail list logo