Re: How to list all registered tables in a sql context?

2014-09-07 Thread Jianshi Huang
Thanks Tobias, I also found this: https://issues.apache.org/jira/browse/SPARK-3299 Looks like it's been working on. Jianshi On Mon, Sep 8, 2014 at 9:28 AM, Tobias Pfeiffer wrote: > Hi, > > On Sat, Sep 6, 2014 at 1:40 AM, Jianshi Huang > wrote: > >> Err... there's no such feature? >> > > The

Re: error: type mismatch while Union

2014-09-07 Thread Dhimant
Thank you Aaron for pointing out problem. This only happens when I run this code in spark-shell but not when i submit the job. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/error-type-mismatch-while-Union-tp13547p13677.html Sent from the Apache Spark User

Re: prepending jars to the driver class path for spark-submit on YARN

2014-09-07 Thread Xiangrui Meng
There is an undocumented configuration to put users jars in front of spark jar. But I'm not very certain that it works as expected (and this is why it is undocumented). Please try turning on spark.yarn.user.classpath.first . -Xiangrui On Sat, Sep 6, 2014 at 5:13 PM, Victor Tso-Guillen wrote: > I

Re: Solving Systems of Linear Equations Using Spark?

2014-09-07 Thread Xiangrui Meng
You can try LinearRegression with sparse input. It converges the least squares solution if the linear system is over-determined, while the convergence rate depends on the condition number. Applying standard scaling is popular heuristic to reduce the condition number. If you are interested in spars

Solving Systems of Linear Equations Using Spark?

2014-09-07 Thread durin
Doing a quick Google search, it appears to me that there is a number people who have implemented algorithms for solving systems of (sparse) linear equations on Hadoop MapReduce. However, I can find no such thing for Spark. Has anyone information on whether there are attempts of creating such an

Re: Deployment model popularity - Standard vs. YARN vs. Mesos vs. SIMR

2014-09-07 Thread Patrick Wendell
I would say that the first three are all used pretty heavily. Mesos was the first one supported (long ago), the standalone is the simplest and most popular today, and YARN is newer but growing a lot in activity. SIMR is not used as much... it was designed mostly for environments where users had a

Deployment model popularity - Standard vs. YARN vs. Mesos vs. SIMR

2014-09-07 Thread Otis Gospodnetic
Hi, I'm trying to determine which Spark deployment models are the most popular - Standalone, YARN, Mesos, or SIMR. Anyone knows? I thought I'm use search-hadoop.com to help me figure this out and this is what I found: 1) Standalone http://search-hadoop.com/?q=standalone&fc_project=Spark&fc_typ

Re: How to list all registered tables in a sql context?

2014-09-07 Thread Tobias Pfeiffer
Hi, On Sat, Sep 6, 2014 at 1:40 AM, Jianshi Huang wrote: > Err... there's no such feature? > The problem is that the SQLContext's `catalog` member is protected, so you can't access it from outside. If you subclass SQLContext, and make sure that `catalog` is always a `SimpleCatalog`, you can che

Re: Recursion

2014-09-07 Thread Tobias Pfeiffer
Hi, On Fri, Sep 5, 2014 at 6:16 PM, Deep Pradhan wrote: > > Does Spark support recursive calls? > Can you give an example of which kind of recursion you would like to use? Tobias

Spark groupByKey partition out of memory

2014-09-07 Thread julyfire
When a MappedRDD is handled by groupByKey transformation, tuples distributed in different worker nodes with the same key will be collected into one worker nodes, say, (K, V1), (K, V2), ..., (K, Vn) -> (K, Seq(V1, V2, ..., Vn)). I want to know whether the value /Seq(V1, V2, ..., Vn)/ of a tuple i

Re: Spark Streaming and database access (e.g. MySQL)

2014-09-07 Thread Soumitra Kumar
I have the following code: stream foreachRDD { rdd => if (rdd.take (1).size == 1) { rdd foreachPartition { iterator => initDbConnection () iterator foreach { write to db

Re: Spark SQL check if query is completed (pyspark)

2014-09-07 Thread Michael Armbrust
Sometimes the underlying Hive code will also print exceptions during successful execution (for example CREATE TABLE IF NOT EXISTS). If there is actually a problem Spark SQL should throw an exception. What is the command you are running and what is the error you are seeing? On Sat, Sep 6, 2014 a

Re: Spark Streaming and database access (e.g. MySQL)

2014-09-07 Thread Sean Owen
... I'd call out that last bit as actually tricky: "close off the driver" See this message for the right-est way to do that, along with the right way to open DB connections remotely instead of trying to serialize them: http://mail-archives.apache.org/mod_mbox/spark-user/201407.mbox/%3CCAPH-c_O9kQ

Re: Running spark-shell (or queries) over the network (not from master)

2014-09-07 Thread Sean Owen
Also keep in mind there is a non-trivial amount of traffic between the driver and cluster. It's not something I would do by default, running the driver so remotely. With enough ports open it should work though. On Sun, Sep 7, 2014 at 7:05 PM, Ognen Duzlevski wrote: > Horacio, > > Thanks, I have n

Re: Low Level Kafka Consumer for Spark

2014-09-07 Thread Dibyendu Bhattacharya
Hi Tathagata, I have managed to implement the logic into the Kafka-Spark consumer to recover from Driver failure. This is just a interim fix till actual fix is done from Spark side. The logic is something like this. 1. When the Individual Receivers starts for every Topic partition, it writes the

Re: distcp on ec2 standalone spark cluster

2014-09-07 Thread Josh Rosen
If I recall, you should be able to start Hadoop MapReduce using ~/ephemeral-hdfs/sbin/start-mapred.sh. On Sun, Sep 7, 2014 at 6:42 AM, Tomer Benyamini wrote: > Hi, > > I would like to copy log files from s3 to the cluster's > ephemeral-hdfs. I tried to use distcp, but I guess mapred is not > run

Re: distcp on ec2 standalone spark cluster

2014-09-07 Thread Nicholas Chammas
I think you need to run start-all.sh or something similar on the EC2 cluster. MR is installed but is not running by default on EC2 clusters spun up by spark-ec2. ​ On Sun, Sep 7, 2014 at 12:33 PM, Tomer Benyamini wrote: > I've installed a spark standalone cluster on ec2 as defined here - > https

Re: Running spark-shell (or queries) over the network (not from master)

2014-09-07 Thread Ognen Duzlevski
Horacio, Thanks, I have not tried that, however, I am not after security right now - I am just wondering why something so obvious won't work ;) Ognen On 9/7/2014 12:38 PM, Horacio G. de Oro wrote: Have you tryied with ssh? It will be much secure (only 1 port open), and you'll be able to run

Re: Running spark-shell (or queries) over the network (not from master)

2014-09-07 Thread Horacio G. de Oro
Have you tryied with ssh? It will be much secure (only 1 port open), and you'll be able to run spark-shell over the networ. I'm using that way in my project (https://github.com/data-tsunami/smoke) with good results. I can't make a try now, but something like this should work: ssh -tt ec2-user@YOU

Re: Running spark-shell (or queries) over the network (not from master)

2014-09-07 Thread Ognen Duzlevski
Have you actually tested this? I have two instances, one is standalone master and the other one just has spark installed, same versions of spark (1.0.0). The security group on the master allows all (0-65535) TCP and UDP traffic from the other machine and the other machine allows all TCP/UDP

Re: distcp on ec2 standalone spark cluster

2014-09-07 Thread Tomer Benyamini
I've installed a spark standalone cluster on ec2 as defined here - https://spark.apache.org/docs/latest/ec2-scripts.html. I'm not sure if mr1/2 is part of this installation. On Sun, Sep 7, 2014 at 7:25 PM, Ye Xianjin wrote: > Distcp requires a mr1(or mr2) cluster to start. Do you have a mapreduc

Re: distcp on ec2 standalone spark cluster

2014-09-07 Thread Ye Xianjin
Distcp requires a mr1(or mr2) cluster to start. Do you have a mapreduce cluster on your hdfs? And from the error message, it seems that you didn't specify your jobtracker address. -- Ye Xianjin Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Sunday, September 7, 2014 at 9:42 PM, T

Re: how to choose right DStream batch interval

2014-09-07 Thread Mayur Rustagi
Spark will simply have a backlog of tasks, it'll manage to process them nonetheless, though if it keeps falling behind, you may run out of memory or have unreasonable latency. For momentary spikes, Spark streaming will manage. Mostly if you are looking to do 100% processing, you'll have to go with

Re: Array and RDDs

2014-09-07 Thread Mayur Rustagi
Your question is a bit confusing.. I assume you have a RDD containing nodes & some meta data (child nodes maybe) & you are trying to attach another metadata to it (bye array). if its just same byte array for all nodes you can generate rdd with the count of nodes & zip the two rdd together, you can

Re: Q: About scenarios where driver execution flow may block...

2014-09-07 Thread Mayur Rustagi
Statements are executed only when you try to cause some effect on the server (produce data, collect data on driver). At time of execution Spark does all the depedency resolution & truncates paths that dont go anywhere as well as optimize execution pipelines. So you really dont have to worry about t

Re: Spark Streaming and database access (e.g. MySQL)

2014-09-07 Thread Mayur Rustagi
Standard pattern is to initialize the mysql jdbc driver in your mappartition call , update database & then close off the driver. Couple of gotchas 1. New driver initiated for all your partitions 2. If the effect(inserts & updates) is not idempotent, so if your server crashes, Spark will replay upda

distcp on ec2 standalone spark cluster

2014-09-07 Thread Tomer Benyamini
Hi, I would like to copy log files from s3 to the cluster's ephemeral-hdfs. I tried to use distcp, but I guess mapred is not running on the cluster - I'm getting the exception below. Is there a way to activate it, or is there a spark alternative to distcp? Thanks, Tomer mapreduce.Cluster (Clust

Re: Adding quota to the ephemeral hdfs on a standalone spark cluster on ec2

2014-09-07 Thread Tomer Benyamini
Thanks! I found the hdfs ui via this port - http://[master-ip]:50070/. It shows 1 node hdfs though, although I have 4 slaves on my cluster. Any idea why? On Sun, Sep 7, 2014 at 4:29 PM, Ognen Duzlevski wrote: > > On 9/7/2014 7:27 AM, Tomer Benyamini wrote: >> >> 2. What should I do to increase th

Fwd: DELIVERY FAILURE: Error transferring to QCMBSJ601.HERMES.SI.SOCGEN; Maximum hop count exceeded. Message probably in a routing loop.

2014-09-07 Thread Ognen Duzlevski
I keep getting below reply every time I send a message to the Spark user list? Can this person be taken off the list by powers that be? Thanks! Ognen Forwarded Message Subject: DELIVERY FAILURE: Error transferring to QCMBSJ601.HERMES.SI.SOCGEN; Maximum hop count exceeded. M

Re: Adding quota to the ephemeral hdfs on a standalone spark cluster on ec2

2014-09-07 Thread Ognen Duzlevski
On 9/7/2014 7:27 AM, Tomer Benyamini wrote: 2. What should I do to increase the quota? Should I bring down the existing slaves and upgrade to ones with more storage? Is there a way to add disks to existing slaves? I'm using the default m1.large slaves set up using the spark-ec2 script. Take a l

Adding quota to the ephemeral hdfs on a standalone spark cluster on ec2

2014-09-07 Thread Tomer Benyamini
Hi, I would like to make sure I'm not exceeding the quota on the local cluster's hdfs. I have a couple of questions: 1. How do I know the quota? Here's the output of hadoop fs -count -q which essentially does not tell me a lot root@ip-172-31-7-49 ~]$ hadoop fs -count -q / 2147483647 21474

Re: Spark 1.0.2 Can GroupByTest example be run in Eclipse without change

2014-09-07 Thread Shing Hing Man
After looking at the source code of SparkConf.scala, I found the following solution. Just set the following Java system property : -Dspark.master=local Shing On Monday, 1 September 2014, 22:09, Shing Hing Man wrote: Hi, I have noticed that the GroupByTest example in https://github.c

Crawler and Scraper with different priorities

2014-09-07 Thread Sandeep Singh
Hi all, I am Implementing a Crawler, Scraper. The It should be able to process the request for crawling & scraping, within few seconds of submitting the job(around 1mil/sec), for rest I can take some time(scheduled evenly all over the day). What is the best way to implement this? Thanks. -- Vi