Re: SparkSQL HiveContext No Suitable Driver / Cannot Find Driver

2014-08-29 Thread Denny Lee
Oh, forgot to add the managed libraries and the Hive libraries within the CLASSPATH.  As soon as I did that, we’re good to go now. On August 29, 2014 at 22:55:47, Denny Lee (denny.g@gmail.com) wrote: My issue is similar to the issue as noted  http://mail-archives.apache.org/mod_mbox/incuba

SparkSQL HiveContext No Suitable Driver / Cannot Find Driver

2014-08-29 Thread Denny Lee
My issue is similar to the issue as noted  http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3ccadoad2ks9_qgeign5-w7xogmrotrlbchvfukctgstj5qp9q...@mail.gmail.com%3E. Currently using Spark-1.1 (grabbed from git two days ago) and using Hive 0.12 with my metastore in MySQL.

Re: Too many open files

2014-08-29 Thread Ye Xianjin
Ops,the last reply didn't go to the user list. Mail app's fault. Shuffling happens in the cluster, so you need change all the nodes in the cluster. Sent from my iPhone > On 2014年8月30日, at 3:10, Sudha Krishna wrote: > > Hi, > > Thanks for your response. Do you know if I need to change this

What does "appMasterRpcPort: -1" indicate ?

2014-08-29 Thread Tao Xiao
I'm using CDH 5.1.0, which bundles Spark 1.0.0 with it. Following How-to: Run a Simple Apache Spark App in CDH 5 , I tried to submit my job in local mode, Spark Standalone mode and YARN mode. I successfully subm

Re: [Streaming] Akka-based receiver with messages defined in uploaded jar

2014-08-29 Thread Tathagata Das
Can you try adding the JAR to the class path of the executors directly, by setting the config "spark.executor.extraClassPath" in the SparkConf. See Configuration page - http://spark.apache.org/docs/latest/configuration.html#runtime-environment I think what you guessed is correct. The Akka actor sy

RE: [Streaming] Akka-based receiver with messages defined in uploaded jar

2014-08-29 Thread Anton Brazhnyk
Just checked it with 1.0.2 Still same exception. From: Anton Brazhnyk [mailto:anton.brazh...@genesys.com] Sent: Wednesday, August 27, 2014 6:46 PM To: Tathagata Das Cc: user@spark.apache.org Subject: RE: [Streaming] Akka-based receiver with messages defined in uploaded jar Sorry for the delay wi

Re: Low Level Kafka Consumer for Spark

2014-08-29 Thread Tim Smith
Ok, so I did this: val kInStreams = (1 to 10).map{_ => KafkaUtils.createStream(ssc,"zkhost1:2181/zk_kafka","testApp",Map("rawunstruct" -> 1)) } val kInMsg = ssc.union(kInStreams) val outdata = kInMsg.map(x=>normalizeLog(x._2,configMap)) This has improved parallelism. Earlier I would only get a "St

Re: Possible to make one executor be able to work on multiple tasks simultaneously?

2014-08-29 Thread Victor Tso-Guillen
Any more thoughts on this? I'm not sure how to do this yet. On Fri, Aug 29, 2014 at 12:10 PM, Victor Tso-Guillen wrote: > Standalone. I'd love to tell it that my one executor can simultaneously > serve, say, 16 tasks at once for an arbitrary number of distinct jobs. > > > On Fri, Aug 29, 2014 a

Re: Spark Streaming reset state

2014-08-29 Thread Christophe Sebastien
You can use a tuple associating a timestamp to your running sum; and have COMPUTE_RUNNING_SUM to reset the running sum to zero when the timestamp is more than 5 minutes old. You'll still have a leak doing so if your keys keep changing, though. --Christophe 2014-08-29 9:00 GMT-07:00 Eko Susilo :

Re: Low Level Kafka Consumer for Spark

2014-08-29 Thread Tim Smith
I create my DStream very simply as: val kInMsg = KafkaUtils.createStream(ssc,"zkhost1:2181/zk_kafka","testApp",Map("rawunstruct" -> 8)) . . eventually, before I operate on the DStream, I repartition it: kInMsg.repartition(512) Are you saying that ^^ repartition doesn't split by dstream into multip

What is the better data structure in an RDD

2014-08-29 Thread cjwang
I need some advices regarding how data are stored in an RDD. I have millions of records, called "Measures". They are bucketed with keys of String type. I wonder if I need to store them as RDD[(String, Measure)] or RDD[(String, Iterable[Measure])], and why? Data in each bucket are not related mo

Re: Spark SQL : how to find element where a field is in a given set

2014-08-29 Thread Michael Armbrust
This feature was not part of that version. It will be in 1.1. On Fri, Aug 29, 2014 at 12:33 PM, Jaonary Rabarisoa wrote: > > 1.0.2 > > > On Friday, August 29, 2014, Michael Armbrust > wrote: > >> What version are you using? >> >> >> >> On Fri, Aug 29, 2014 at 2:22 AM, Jaonary Rabarisoa >> wr

Re: [Spark Streaming] kafka consumer announce

2014-08-29 Thread Evgeniy Shishkin
TD, can you please comment on this code? I am really interested in including this code in Spark. But i am bothering about some point about persistence: 1. When we extend Receiver and call store, is it blocking call? Does it return only when spark stores rdd as requested (i.e. replicated or on

Re: Low Level Kafka Consumer for Spark

2014-08-29 Thread Tim Smith
Good to see I am not the only one who cannot get incoming Dstreams to repartition. I tried repartition(512) but still no luck - the app stubbornly runs only on two nodes. Now this is 1.0.0 but looking at release notes for 1.0.1 and 1.0.2, I don't see anything that says this was an issue and has bee

[PySpark] large # of partitions causes OOM

2014-08-29 Thread Nick Chammas
Here’s a repro for PySpark: a = sc.parallelize(["Nick", "John", "Bob"]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) When I try this on an EC2 cluster with 1.1.0-rc2 and Python 2.7, this is what I get: >>> a = sc.parallelize(["Nick", "John", "Bob"])>>

Re: Anyone know hot to submit spark job to yarn in java code?

2014-08-29 Thread Archit Thakur
Hi, I am facing the same problem. Did you find any solution or work around? Thanks and Regards, Archit Thakur. On Thu, Jan 16, 2014 at 6:22 AM, Liu, Raymond wrote: > Hi > > Regarding your question > > 1) when I run the above script, which jar is beed submitted to the yarn > server ? > > What

Re: SparkSql is slow over yarn

2014-08-29 Thread Nishkam Ravi
Can you share more details about your job, cluster properties and configuration parameters? Thanks, Nishkam On Fri, Aug 29, 2014 at 11:33 AM, Chirag Aggarwal < chirag.aggar...@guavus.com> wrote: > When I run SparkSql over yarn, it runs 2-4 times slower as compared to > when its run in local mo

Spark Streaming with Kafka, building project with 'sbt assembly' is extremely slow

2014-08-29 Thread Aris
Hi folks, I am trying to use Kafka with Spark Streaming, and it appears I cannot do the normal 'sbt package' as I do with other Spark applications, such as Spark alone or Spark with MLlib. I learned I have to build with the sbt-assembly plugin. OK, so here is my build.sbt file for my extremely si

Re: DStream repartitioning, performance tuning processing

2014-08-29 Thread Tim Smith
Crash again. On the driver, logs say: 14/08/29 19:04:55 INFO BlockManagerMaster: Removed 7 successfully in removeExecutor org.apache.spark.SparkException: Job aborted due to stage failure: Task 2.0:0 failed 4 times, most recent failure: TID 6383 on host node-dn1-2-acme.com failed for unknown reason

Re: Spark SQL : how to find element where a field is in a given set

2014-08-29 Thread Jaonary Rabarisoa
1.0.2 On Friday, August 29, 2014, Michael Armbrust wrote: > What version are you using? > > > > On Fri, Aug 29, 2014 at 2:22 AM, Jaonary Rabarisoa > wrote: > >> Still not working for me. I got a compilation error : *value in is not a >> member of Symbol.* Any ideas ? >> >> >> On Fri, Aug 29, 2

Re: Possible to make one executor be able to work on multiple tasks simultaneously?

2014-08-29 Thread Victor Tso-Guillen
Standalone. I'd love to tell it that my one executor can simultaneously serve, say, 16 tasks at once for an arbitrary number of distinct jobs. On Fri, Aug 29, 2014 at 11:29 AM, Matei Zaharia wrote: > Yes, executors run one task per core of your machine by default. You can > also manually launch

Re: DStream repartitioning, performance tuning processing

2014-08-29 Thread Tim Smith
I wrote a long post about how I arrived here but in a nutshell I don't see evidence of re-partitioning and workload distribution across the cluster. My new fangled way of starting the job is: run=`date +"%m-%d-%YT%T"`; \ nohup spark-submit --class logStreamNormalizer \ --master yarn log-stream-nor

SparkSql is slow over yarn

2014-08-29 Thread Chirag Aggarwal
When I run SparkSql over yarn, it runs 2-4 times slower as compared to when its run in local mode. Please note that I have a four node yarn setup. Has anyone else also witnessed the same.

Re: Possible to make one executor be able to work on multiple tasks simultaneously?

2014-08-29 Thread Matei Zaharia
Yes, executors run one task per core of your machine by default. You can also manually launch them with more worker threads than you have cores. What cluster manager are you on? Matei On August 29, 2014 at 11:24:33 AM, Victor Tso-Guillen (v...@paxata.com) wrote: I'm thinking of local mode wher

Possible to make one executor be able to work on multiple tasks simultaneously?

2014-08-29 Thread Victor Tso-Guillen
I'm thinking of local mode where multiple virtual executors occupy the same vm. Can we have the same configuration in spark standalone cluster mode?

Re: Too many open files

2014-08-29 Thread SK
Hi, I am having the same problem reported by Michael. I am trying to open 30 files. ulimit -n shows the limit is 1024. So I am not sure why the program is failing with "Too many open files" error. The total size of all the 30 files is 230 GB. I am running the job on a cluster with 10 nodes, eac

Problem Accessing Hive Table from hiveContext

2014-08-29 Thread Zitser, Igor
Hi All, New to spark and using Spark 1.0.2 and hive 0.12. If hive table created as test_datatypes(testbigint bigint, ss bigint )  "select * from test_datatypes" from spark works fine. For "create table test_datatypes(testbigint bigint, testdec decimal(5,2) )" scala> val dataTypes=hiveContext.

Re: Change delimiter when collecting SchemaRDD

2014-08-29 Thread yadid ayzenberg
Thanks Michael, that makes total sense. It works perfectly. Yadid On Thu, Aug 28, 2014 at 9:19 PM, Michael Armbrust wrote: > The comma is just the way the default toString works for Row objects. > Since SchemaRDDs are also RDDs, you can do arbitrary transformations on > the Row objects that a

Re: Spark Hive max key length is 767 bytes

2014-08-29 Thread Michael Armbrust
Spark SQL is based on Hive 12. They must have changed the maximum key size between 12 and 13. On Fri, Aug 29, 2014 at 4:38 AM, arthur.hk.c...@gmail.com < arthur.hk.c...@gmail.com> wrote: > Hi, > > > Tried the same thing in HIVE directly without issue: > > HIVE: > hive> create table test_datatyp

Announce: Smoke - a web frontend to Spark

2014-08-29 Thread Horacio G. de Oro
Hi everyone! I've been working on Smoke, a web frontend to interactively launch Spark jobs without compiling it (only support Scala right now, and launching the jobs on yarn-client mode). It works executing the Scala script using "spark-shell" in the Spark server. It's developed in Python, uses Ce

RE: Q on downloading spark for standalone cluster

2014-08-29 Thread Sagar, Sanjeev
Hello Sparkies ! Could anyone please answer this? This is not an Hadoop cluster, so which download option should I use to download for standalone cluster ? Also what are the best practices if you’ve 1TB of data and want to use spark ? Do you’ve to use Hadoop/CDH or some other option ? Apprecia

Re: Spark webUI - application details page

2014-08-29 Thread Sudha Krishna
I specified as follows: spark.eventLog.dir /mapr/spark_io We use mapr fs for sharing files. I did not provide an ip address or port number - just the directory name on the shared filesystem. On Aug 29, 2014 8:28 AM, "Brad Miller" wrote: > How did you specify the HDFS path? When i put > > spark

Re: Spark Streaming reset state

2014-08-29 Thread Eko Susilo
so the "codes" currently holding RDD containing codes and its respective counter. I would like to find a way to reset those RDD after some period of time. On Fri, Aug 29, 2014 at 5:55 PM, Sean Owen wrote: > "codes" is a DStream, not an RDD. The remember() method controls how > long Spark Stream

Re: Low Level Kafka Consumer for Spark

2014-08-29 Thread Jonathan Hodges
'this 2-node replication is mainly for failover in case the receiver dies while data is in flight. there's still chance for data loss as there's no write ahead log on the hot path, but this is being addressed.' Can you comment a little on how this will be addressed, will there be a durable WAL?

Re: Spark Streaming reset state

2014-08-29 Thread Sean Owen
"codes" is a DStream, not an RDD. The remember() method controls how long Spark Streaming holds on to the RDDs itself. Clarify what you mean by "reset"? codes provides a stream of RDDs that contain your computation over a window of time. New RDDs come with the computation over new data. On Fri, Au

Re: Spark SQL : how to find element where a field is in a given set

2014-08-29 Thread Michael Armbrust
What version are you using? On Fri, Aug 29, 2014 at 2:22 AM, Jaonary Rabarisoa wrote: > Still not working for me. I got a compilation error : *value in is not a > member of Symbol.* Any ideas ? > > > On Fri, Aug 29, 2014 at 9:46 AM, Michael Armbrust > wrote: > >> To pass a list to a variadic f

/tmp/spark-events permissions problem

2014-08-29 Thread Brad Miller
Hi All, Yesterday I restarted my cluster, which had the effect of clearing /tmp. When I brought Spark back up and ran my first job, /tmp/spark-events was re-created and the job ran fine. I later learned that other users were receiving errors when trying to create a spark context. It turned out

Re: Low Level Kafka Consumer for Spark

2014-08-29 Thread bharatvenkat
Chris, I did the Dstream.repartition mentioned in the document on parallelism in receiving, as well as set "spark.default.parallelism" and it still uses only 2 nodes in my cluster. I notice there is another email thread on the same topic: http://apache-spark-user-list.1001560.n3.nabble.com/DStre

Spark Streaming reset state

2014-08-29 Thread Eko Susilo
Hi all, I would like to ask some advice about resetting spark stateful operation. so i tried like this: JavaStreamingContext jssc = new JavaStreamingContext(context, new Duration(5000)); jssc.remember(Duration(5*60*1000)); jssc.checkpoint(ApplicationConstants.HDFS_STREAM_DIRECTORIES); JavaPairRec

Re: Spark webUI - application details page

2014-08-29 Thread Brad Miller
How did you specify the HDFS path? When i put spark.eventLog.dir hdfs:// crosby.research.intel-research.net:54310/tmp/spark-events in my spark-defaults.conf file, I receive the following error: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : java.io.IOEx

Re: Spark Streaming checkpoint recovery causes IO re-execution

2014-08-29 Thread Yana Kadiyska
I understand that the DB writes are happening from the workers unless you collect. My confusion is that you believe workers recompute on recovery("nodes computations which get redone upon recovery"). My understanding is that checkpointing dumps the RDD to disk and the cuts the RDD lineage. So I th

Re: how can I get the number of cores

2014-08-29 Thread Nicholas Chammas
What version of Spark are you running? Try calling sc.defaultParallelism. I’ve found that it is typically set to the number of worker cores in your cluster. ​ On Fri, Aug 29, 2014 at 3:39 AM, Kevin Jung wrote: > Hi all > Spark web ui gives me the information about total cores and used cores. >

Re: Where to save intermediate results?

2014-08-29 Thread huylv
Hi Daniel, Your suggestion is definitely an interesting approach. In fact, I already have another system to deal with the stream analytical processing part. So basically, the Spark job to aggregate data just accumulatively computes aggregations from historical data together with new batch, which h

Re: Running Spark On Yarn without Spark-Submit

2014-08-29 Thread Chester @work
Archit We are using yarn-cluster mode , and calling spark via Client class directly from servlet server. It works fine. To establish a communication channel to give further requests, It should be possible with yarn client, but not with yarn server. Yarn client mode, spark driver i

Re: how to filter value in spark

2014-08-29 Thread marylucy
i see it works well,thank you!!! But in follow situation how to do var a = sc.textFile("/sparktest/1/").map((_,"a")) var b = sc.textFile("/sparktest/2/").map((_,"b")) How to get (3,"a") and (4,"a") 在 Aug 28, 2014,19:54,"Matthew Farrellee" 写道: > On 08/28/2014 07:20 AM, marylucy wrote: >> f

Re: Spark Hive max key length is 767 bytes

2014-08-29 Thread arthur.hk.c...@gmail.com
Hi, Tried the same thing in HIVE directly without issue: HIVE: hive> create table test_datatype2 (testbigint bigint ); OK Time taken: 0.708 seconds hive> drop table test_datatype2; OK Time taken: 23.272 seconds Then tried again in SPARK: scala> val hiveContext = new org.apache.spark.sql.hive

RE: The concurrent model of spark job/stage/task

2014-08-29 Thread linkpatrickliu
Hi, I think an example will help illustrate the model better. /*** SimpleApp.scala ***/import org.apache.spark.SparkContextimport org.apache.spark.SparkContext._ object SimpleApp { def main(args: Array[String]) {val logFile = "$YOUR_SPARK_HOME/README.md" val sc = new SparkContext("local

Re: How to debug this error?

2014-08-29 Thread Yanbo Liang
It's not allowed to use RDD in map function. RDD can only operated at driver of spark program. At your case, group RDD can't be found at every executor. I guess you want to implement subquery like operation, try to use RDD.intersection() or join() 2014-08-29 12:43 GMT+08:00 Gary Zhao : > Hello

Re: Running Spark On Yarn without Spark-Submit

2014-08-29 Thread Archit Thakur
including user@spark.apache.org. On Fri, Aug 29, 2014 at 2:03 PM, Archit Thakur wrote: > Hi, > > My requirement is to run Spark on Yarn without using the script > spark-submit. > > I have a servlet and a tomcat server. As and when request comes, it > creates a new SC and keeps it alive for the

Re: Spark SQL : how to find element where a field is in a given set

2014-08-29 Thread Jaonary Rabarisoa
Still not working for me. I got a compilation error : *value in is not a member of Symbol.* Any ideas ? On Fri, Aug 29, 2014 at 9:46 AM, Michael Armbrust wrote: > To pass a list to a variadic function you can use the type ascription :_* > > For example: > > val longList = Seq[Expression]("a", "

Re: how to specify columns in groupby

2014-08-29 Thread MEETHU MATHEW
Thank you Yanbo for the reply.. I 've another query related to cogroup.I want to iterate over the results of cogroup operation. My code is * grp = RDD1.cogroup(RDD2) * map((lambda (x,y): (x,list(y[0]),list(y[1]))), list(grp)) My result looks like : [((u'764', u'20140826'), [0.

Running Spark On Yarn without Spark-Submit

2014-08-29 Thread Archit Thakur
Hi, My requirement is to run Spark on Yarn without using the script spark-submit. I have a servlet and a tomcat server. As and when request comes, it creates a new SC and keeps it alive for the further requests, I ma setting my master in sparkConf as sparkConf.setMaster("yarn-cluster") but the

Ensuring object in spark streaming runs on specific node

2014-08-29 Thread Filip Andrei
Say you have a spark streaming setup such as JavaReceiverInputDStream<...> rndLists = jssc.receiverStream(new JavaRandomReceiver(...)); rndLists.map(new NeuralNetMapper(...)) .foreach(new JavaSyncBarrier(...)); Is there any way of ensuring that, say, a JavaRandomReceiver and Java

Re: u'' notation with pyspark output data

2014-08-29 Thread Davies Liu
u'14.0' means a unicode string, you can convert into str by u'14.0'.encode('utf8'), or you can convert it into float by float(u'14.0') Davies On Thu, Aug 28, 2014 at 11:22 PM, Oleg Ruchovets wrote: > Hi , > I am working with pyspark and doing simple aggregation > > > def doSplit(x): >

Re: Spark SQL : how to find element where a field is in a given set

2014-08-29 Thread Michael Armbrust
To pass a list to a variadic function you can use the type ascription :_* For example: val longList = Seq[Expression]("a", "b", ...) table("src").where('key in (longList: _*)) Also, note that I had to explicitly specify Expression as the type parameter of Seq to ensure that the compiler converts

how can I get the number of cores

2014-08-29 Thread Kevin Jung
Hi all Spark web ui gives me the information about total cores and used cores. I want to get this information programmatically. How can I do this? Thanks Kevin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-can-I-get-the-number-of-cores-tp13111.html Se

Re: Spark / Thrift / ODBC connectivity

2014-08-29 Thread Cheng Lian
You can use the Thrift server to access Hive tables that locates in legacy Hive warehouse and/or those generated by Spark SQL. Simba provides Spark SQL ODBC driver that enables applications like Tableau. But right now I'm not 100% sure about whether the driver has officially released yet. On Thu,