Re: Spark caching questions

2014-09-10 Thread Mayur Rustagi
Cached RDD do not survive SparkContext deletion (they are scoped on a per sparkcontext basis). I am not sure what you mean by disk based cache eviction, if you cache more RDD than disk space the result will not be very pretty :) Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @

Re: Spark Streaming and database access (e.g. MySQL)

2014-09-10 Thread Mayur Rustagi
I think she is checking for blanks? But if the RDD is blank then nothing will happen, no db connections etc. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Mon, Sep 8, 2014 at 1:32 PM, Tobias Pfeiffer wrote: > Hi, > > O

Re: RDD memory questions

2014-09-10 Thread Boxian Dong
Thank you very much for your kindly help. I rise some another questions: - If the RDD is stored in serialized format, is that means that whenever the RDD is processed, it will be unpacked and packed again from and back to the JVM even they are located on the same machine?

Re: spark.cleaner.ttl and spark.streaming.unpersist

2014-09-10 Thread Luis Ángel Vicente Sánchez
I somehow missed that parameter when I was reviewing the documentation, that should do the trick! Thank you! 2014-09-10 2:10 GMT+01:00 Shao, Saisai : > Hi Luis, > > > > The parameter “spark.cleaner.ttl” and “spark.streaming.unpersist” can be > used to remove useless timeout streaming data, the d

Re: groupBy gives non deterministic results

2014-09-10 Thread redocpot
Hi, I am using spark 1.0.0. The bug is fixed by 1.0.1. Hao -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13864.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Spark SQL -- more than two tables for join

2014-09-10 Thread boyingk...@163.com
Hi: I Hava a question about Spark SQL。 First ,i use left join on two tables,like this: sql("SELECT * FROM youhao_data left join youhao_age on (youhao_data.rowkey=youhao_age.rowkey)").collect().foreach(println) the result is my except。 But,when i use left join on three tables or more ,like this:

Re: groupBy gives non deterministic results

2014-09-10 Thread Ye Xianjin
Great. And you should ask question in user@spark.apache.org mail list. I believe many people don't subscribe the incubator mail list now. -- Ye Xianjin Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Wednesday, September 10, 2014 at 6:03 PM, redocpot wrote: > Hi, > > I am using s

Dependency Problem with Spark / ScalaTest / SBT

2014-09-10 Thread Thorsten Bergler
Hello, I am writing a Spark App which is already working so far. Now I started to build also some UnitTests, but I am running into some dependecy problems and I cannot find a solution right now. Perhaps someone could help me. I build my Spark Project with SBT and it seems to be configured wel

Re: Dependency Problem with Spark / ScalaTest / SBT

2014-09-10 Thread arthur.hk.c...@gmail.com
Hi, What is your SBT command and the parameters? Arthur On 10 Sep, 2014, at 6:46 pm, Thorsten Bergler wrote: > Hello, > > I am writing a Spark App which is already working so far. > Now I started to build also some UnitTests, but I am running into some > dependecy problems and I cannot fin

Re: Dependency Problem with Spark / ScalaTest / SBT

2014-09-10 Thread Thorsten Bergler
Hi, I just called: > test or > run Thorsten Am 10.09.2014 um 13:38 schrieb arthur.hk.c...@gmail.com: Hi, What is your SBT command and the parameters? Arthur On 10 Sep, 2014, at 6:46 pm, Thorsten Bergler wrote: Hello, I am writing a Spark App which is already working so far. Now I s

Re: groupBy gives non deterministic results

2014-09-10 Thread redocpot
Ah, thank you. I did not notice that. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13871.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

nested rdd operation

2014-09-10 Thread Pavlos Katsogridakis
Hi , I have a question on spark this programm on spark-shell val filerdd = sc.textFile("NOTICE",2) val maprdd = filerdd.map( word => filerdd.map( word2 => (word2+word) ) ) maprdd.collect() throws NULL pointer exception , can somebody explain why i cannot have a nested rdd operation ? --pavlos

JavaPairRDD to JavaPairRDD based on key

2014-09-10 Thread Tom
Is it possible to generate a JavaPairRDD from a JavaPairRDD, where I can also use the key values? I have looked at for instance mapToPair, but this generates a new K/V pair based on the original value, and does not give me information about the key. I need this in the initialization phase, where I

Re: groupBy gives non deterministic results

2014-09-10 Thread redocpot
Hi, Xianjin I checked user@spark.apache.org, and found my post there: http://mail-archives.apache.org/mod_mbox/spark-user/201409.mbox/browser I am using nabble to send this mail, which indicates that the mail will be sent from my email address to the u...@spark.incubator.apache.org mailing list.

Re: Spark SQL -- more than two tables for join

2014-09-10 Thread arunshell87
Hi, I too had tried SQL queries with joins, MINUS , subqueries etc but they did not work in Spark Sql. I did not find any documentation on what queries work and what do not work in Spark SQL, may be we have to wait for the Spark book to be released in Feb-2015. I believe you can try HiveQL in

Re: nested rdd operation

2014-09-10 Thread Sean Owen
You can't use an RDD inside an operation on an RDD. Here you have filerdd in your map function. It sort of looks like you want a cartesian product of the RDD with itself, so look at the cartesian() method. It may not be a good idea to compute such a thing. On Wed, Sep 10, 2014 at 1:57 PM, Pavlos K

Re: groupBy gives non deterministic results

2014-09-10 Thread Ye Xianjin
| Do the two mailing lists share messages ? I don't think so. I didn't receive this message from the user list. I am not in databricks, so I can't answer your other questions. Maybe Davies Liu can answer you? -- Ye Xianjin Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Wednesday

Re: JavaPairRDD to JavaPairRDD based on key

2014-09-10 Thread Sean Owen
So, each key-value pair gets a new value for the original key? you want mapValues(). On Wed, Sep 10, 2014 at 2:01 PM, Tom wrote: > Is it possible to generate a JavaPairRDD from a > JavaPairRDD, where I can also use the key values? I have > looked at for instance mapToPair, but this generates a ne

Global Variables in Spark Streaming

2014-09-10 Thread Ravi Sharma
Hi Friends, I'm using spark streaming as for kafka consumer. I want to do the CEP by spark. So as for that I need to store my sequence of events. so that I cant detect some pattern. My question is How can I save my events in java collection temporary , So that i can detect pattern by *processed(t

Re: Global Variables in Spark Streaming

2014-09-10 Thread Akhil Das
Have a look at Broadcasting variables http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables Thanks Best Regards On Wed, Sep 10, 2014 at 7:25 PM, Ravi Sharma wrote: > Hi Friends, > > I'm using spark streaming as for kafka consumer. I want to do the CEP by > spark. So as

How to scale more consumer to Kafka stream

2014-09-10 Thread richiesgr
Hi (my previous post as been used by someone else) I'm building a application the read from kafka stream event. In production we've 5 consumers that share 10 partitions. But on spark streaming kafka only 1 worker act as a consumer then distribute the tasks to workers so I can have only 1 machine

Some techniques for improving application performance

2014-09-10 Thread Will Benton
Spark friends, I recently wrote up a blog post with examples of some of the standard techniques for improving Spark application performance: http://chapeau.freevariable.com/2014/09/improving-spark-application-performance.html The idea is that we start with readable but poorly-performing code

Re: Global Variables in Spark Streaming

2014-09-10 Thread Ravi Sharma
Akhil, By using broadcast variable Will I be able to change the values of Broadcast variable? As per my understanding It will create final variable to access the value across the cluster. Please correct me if I'm wrong. Thanks, Cheers, Ravi Sharma On Wed, Sep 10, 2014 at 7:31 PM, Akhil Das wro

Re: Spark SQL -- more than two tables for join

2014-09-10 Thread arthur.hk.c...@gmail.com
Hi, May be you can take a look about the following. http://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html Good luck. Arthur On 10 Sep, 2014, at 9:09 pm, arunshell87 wrote: > > Hi, > > I too had tried SQL queries with joins, MINUS , subqueries etc bu

Re: Task not serializable

2014-09-10 Thread Sarath Chandra
Hi Sean, The solution of instantiating the non-serializable class inside the map is working fine, but I hit a road block. The solution is not working for singleton classes like UserGroupInformation. In my logic as part of processing a HDFS file, I need to refer to some reference files which are a

Re: Spark SQL -- more than two tables for join

2014-09-10 Thread arthur.hk.c...@gmail.com
Hi Some findings: 1) spark sql does not support multiple join 2) spark left join: has performance issue 3) spark sql’s cache table: does not support two-tier query 4) spark sql does not support repartition Arthur On 10 Sep, 2014, at 10:22 pm, arthur.hk.c...@gmail.com wrote: > Hi, >

Re: Task not serializable

2014-09-10 Thread Sean Owen
You mention that you are creating a UserGroupInformation inside your function, but something is still serializing it. You should show your code since it may not be doing what you think. If you instantiate an object, it happens every time your function is called. map() is called once per data eleme

Spark & NLP

2014-09-10 Thread Paolo Platter
Hi all, What is your preferred scala NLP lib ? why ? Is there any items on the spark’s road map to integrate NLP features ? I basically need to perform NER line by line, so I don’t need a deep integration with the distributed engine. I only want simple dependencies and the chance to build a dict

Re: Spark & NLP

2014-09-10 Thread andy petrella
never tried but might fit your need: http://www.scalanlp.org/ It's the parent project of both breeze (already part of spark) and epic. However you'll have to train for IT (not part of the supported list) (actually I never used it because for my very small needs, I generally just perform a small n

Re: Spark & NLP

2014-09-10 Thread Oleksandr Olgashko
Factorie (https://github.com/factorie/factorie) might be what you need (i'd suggest conditional random field/maximum entropy Markov mode for NER). Also you can use Java's libraries (for example, i'm using opennlp's implementation of MEMM for close-to-NER-problem in Scala) Chalk (https://github.com/

Re: Task not serializable

2014-09-10 Thread Sarath Chandra
Thanks Sean. Please find attached my code. Let me know your suggestions/ideas. Regards, *Sarath* On Wed, Sep 10, 2014 at 8:05 PM, Sean Owen wrote: > You mention that you are creating a UserGroupInformation inside your > function, but something is still serializing it. You should show your > co

Re: flattening a list in spark sql

2014-09-10 Thread gtinside
Hi , Thanks it worked, really appreciate your help. I have also been trying to do multiple Lateral Views, but it doesn't seem to be working. Query : hiveContext.sql("Select t2 from fav LATERAL VIEW explode(TABS) tabs1 as t1 LATERAL VIEW explode(t1) tabs2 as t2") Exception org.apache.spark.sql.c

Re: pyspark and cassandra

2014-09-10 Thread Oleg Ruchovets
Hi , I try to evaluate different option of spark + cassandra and I have couple of additional questions. My aim is to use cassandra only without hadoop: 1) Is it possible to use only cassandra as input/output parameter for PySpark? 2) In case I'll use Spark (java,scala) is it possible to use

Cassandra connector

2014-09-10 Thread wwilkins
Hi, I am having difficulty getting the Cassandra connector running within the spark shell. My jars looks like: [wwilkins@phel-spark-001 logs]$ ls -altr /opt/connector/ total 14588 drwxr-xr-x. 5 root root4096 Sep 9 22:15 .. -rw-r--r-- 1 root root 242809 Sep 9 22:20 spark-cassandra-connect

Re: Cassandra connector

2014-09-10 Thread gtinside
Are you using spark 1.1 ? If yes you would have to update the datastax cassandra connector code and remove ref to log methods from CassandraConnector.scala Regards, Gaurav -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cassandra-connector-tp13896p13897.htm

Re: Low Level Kafka Consumer for Spark

2014-09-10 Thread Dibyendu Bhattacharya
Hi , The latest changes with Kafka message re-play by manipulating ZK offset seems to be working fine for us. This gives us some relief till actual issue is fixed in Spark 1.2 . I have some question on how Spark process the Received data. The logic I used is basically to pull messages form indivi

Re: groupBy gives non deterministic results

2014-09-10 Thread Davies Liu
I think the mails to spark.incubator.apache.org will be forwarded to spark.apache.org. Here is the header of the first mail: from: redocpot to: u...@spark.incubator.apache.org date: Mon, Sep 8, 2014 at 7:29 AM subject: groupBy gives non deterministic results mailing list: user.spark.apache.org F

Re: Global Variables in Spark Streaming

2014-09-10 Thread Akhil Das
Yes your understanding is correct. In that case one easiest option would be to Serialize the object and dump it somewhere in hdfs so that you will be able to recreate/update the object from the file. We have something similar which you can find over BroadCastServer

Re: flattening a list in spark sql

2014-09-10 Thread gtinside
My bad, please ignore, it works !!! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/flattening-a-list-in-spark-sql-tp13300p13901.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Re: Spark SQL -- more than two tables for join

2014-09-10 Thread Michael Armbrust
What version of Spark SQL are you running here? I think a lot of your concerns have likely been addressed in more recent versions of the code / documentation. (Spark 1.1 should be published in the next few days) In particular, for serious applications you should use a HiveContext and HiveQL as t

Re: Spark HiveQL support plan

2014-09-10 Thread Michael Armbrust
HiveQL is the default language for the JDBC server which will be available as part of the 1.1 release (coming very soon!). Adding support for calling MLlib and other spark libraries is on the roadmap, but not possible at this moment. On Tue, Sep 9, 2014 at 1:45 PM, XUE, Xiaohui wrote: > > Hi, >

RE: Cassandra connector

2014-09-10 Thread Wade Wilkins
Thank you! I am running spark 1.0 but, your suggestion worked for me. I rem'ed out all //logDebug In both CassandraConnector.scala and Schema.scala I am moving again. Regards, Wade -Original Message- From: gtinside [mailto:gtins...@gmail.com] Sent: Wednesday, September 10, 2014 8:49

Re: how to run python examples in spark 1.1?

2014-09-10 Thread freedafeng
Just want to provide more information on how I ran the examples. Environment: Cloudera quick start Vm 5.1.0 (HBase 0.98.1 installed). I created a table called 'data1', and 'put' two records in it. I can see the table and data are fine in hbase shell. I cloned spark repo and checked out to 1.1 br

Re: Global Variables in Spark Streaming

2014-09-10 Thread Santiago Mola
Hi Ravi, 2014-09-10 15:55 GMT+02:00 Ravi Sharma : > > > I'm using spark streaming as for kafka consumer. I want to do the CEP by spark. So as for that I need to store my sequence of events. so that I cant detect some pattern. > Depending on what you're trying to accomplish, you might implement th

Se8sIang i Atau zorOi

2014-09-10 Thread Edgar Vega
jarAk ifz ais iviiiaiiiIsirsei>kejujuran8$-88Ω=-O Ω:-P in iCdsiiisOz) :) (isuii:V) (:V) riiie89θ

Re: groupBy gives non deterministic results

2014-09-10 Thread Ye Xianjin
Well, That's weird. I don't see this thread in my mail box as sending to user list. Maybe because I also subscribe the incubator mail list? I do see mails sending to incubator mail list and no one replies. I thought it was because people don't subscribe the incubator now. -- Ye Xianjin Sent wi

Re: Task not serializable

2014-09-10 Thread Marcelo Vanzin
You're using "hadoopConf", a Configuration object, in your closure. That type is not serializable. You can use " -Dsun.io.serialization.extendedDebugInfo=true" to debug serialization issues. On Wed, Sep 10, 2014 at 8:23 AM, Sarath Chandra wrote: > Thanks Sean. > Please find attached my code. Let

Re: How to scale more consumer to Kafka stream

2014-09-10 Thread Tim Smith
How are you creating your kafka streams in Spark? If you have 10 partitions for a topic, you can call "createStream" ten times to create 10 parallel receivers/executors and then use "union" to combine all the dStreams. On Wed, Sep 10, 2014 at 7:16 AM, richiesgr wrote: > Hi (my previous post a

Re: spark.cleaner.ttl and spark.streaming.unpersist

2014-09-10 Thread Tim Smith
I am using Spark 1.0.0 (on CDH 5.1) and have a similar issue. In my case, the receivers die within an hour because Yarn kills the containers for high memory usage. I set ttl.cleaner to 30 seconds but that didn't help. So I don't think stale RDDs are an issue here. I did a "jmap -histo" on a couple

Re: DELIVERY FAILURE: Error transferring to QCMBSJ601.HERMES.SI.SOCGEN; Maximum hop count exceeded. Message probably in a routing loop.

2014-09-10 Thread Marcelo Vanzin
Yes please pretty please. This is really annoying. On Sun, Sep 7, 2014 at 6:31 AM, Ognen Duzlevski wrote: > > I keep getting below reply every time I send a message to the Spark user > list? Can this person be taken off the list by powers that be? > Thanks! > Ognen > > Forwarded Message

Re: how to choose right DStream batch interval

2014-09-10 Thread Tim Smith
http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617 Slide 39 covers it. On Tue, Sep 9, 2014 at 9:23 PM, qihong wrote: > Hi Mayur, > > Thanks for your response. I did write a simple test that set up a DStream > with > 5 batches; The batch duration

Re: spark.cleaner.ttl and spark.streaming.unpersist

2014-09-10 Thread Yana Kadiyska
Tim, I asked a similar question twice: here http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-Cannot-get-executors-to-stay-alive-tt12940.html and here http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-Executor-OOM-tt12383.html and have not yet received any responses. I noti

Re: spark-streaming "Could not compute split" exception

2014-09-10 Thread Tim Smith
I had a similar issue and many others - all were basically symptoms for yarn killing the container for high memory usage. Haven't gotten to root cause yet. On Tue, Sep 9, 2014 at 3:18 PM, Marcelo Vanzin wrote: > Your executor is exiting or crashing unexpectedly: > > On Tue, Sep 9, 2014 at 3:13 P

Re: spark.cleaner.ttl and spark.streaming.unpersist

2014-09-10 Thread Tim Smith
Actually, I am not doing any explicit shuffle/updateByKey or other transform functions. In my program flow, I take in data from Kafka, match each message against a list of regex and then if a msg matches a regex then extract groups, stuff them in json and push out back to kafka (different topic). S

Hadoop Distributed Cache

2014-09-10 Thread Maximo Gurmendez
Hi, As part of SparkContext.newAPIHadoopRDD(). Would Spark support an InputFormat that uses Hadoop’s distributed cache? Thanks, Máximo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mai

[spark upgrade] Error communicating with MapOutputTracker when running test cases in latest spark

2014-09-10 Thread Adrian Mocanu
I use https://github.com/apache/spark/blob/master/streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala to help me with testing. In spark 9.1 my tests depending on TestSuiteBase worked fine. As soon as I switched to latest (1.0.1) all tests fail. My sbt import is: "org.apache.sp

Re: pyspark and cassandra

2014-09-10 Thread Kan Zhang
Thanks for the clarification, Yadid. By "Hadoop jobs," I meant Spark jobs that use Hadoop inputformats (as shown in the cassandra_inputformat.py example). A future possibility of accessing Cassandra from PySpark is when SparkSQL supports Cassandra as a data source. On Wed, Sep 10, 2014 at 11:37

Re: Table not found: using jdbc console to query sparksql hive thriftserver

2014-09-10 Thread alexandria1101
I used the hiveContext to register the tables and the tables are still not being found by the thrift server. Do I have to pass the hiveContext to JDBC somehow? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Table-not-found-using-jdbc-console-to-query-spark

Accumulo and Spark

2014-09-10 Thread Megavolt
I've been doing some Googling and haven't found much info on how to incorporate Spark and Accumulo. Does anyone know of some examples of how to tie Spark to Accumulo (for both fetching data and dumping results)? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.c

Re: Table not found: using jdbc console to query sparksql hive thriftserver

2014-09-10 Thread Denny Lee
Actually, when registering the table, it is only available within the sc context you are running it in. For Spark 1.1, the method name is changed to RegisterAsTempTable to better reflect that. The Thrift server process runs under a different process meaning that it cannot see any of the tables

Re: Table not found: using jdbc console to query sparksql hive thriftserver

2014-09-10 Thread Du Li
Hi Denny, There is a related question by the way. I have a program that reads in a stream of RDD¹s, each of which is to be loaded into a hive table as one partition. Currently I do this by first writing the RDD¹s to HDFS and then loading them to hive, which requires multiple passes of HDFS I/O an

java.lang.ClassCastException: java.lang.Long cannot be cast to scala.Tuple2

2014-09-10 Thread Jeffrey Picard
Hey guys, After rebuilding from the master branch this morning, I’ve started to see these errors that I’ve never gotten before while running connected components. Anyone seen this before? 14/09/10 20:38:53 INFO collection.ExternalSorter: Thread 87 spilling in-memory batch of 1020 MB to disk (1

RE: how to setup steady state stream partitions

2014-09-10 Thread Anton Brazhnyk
Just a guess. updateStateByKey has overloaded variants with partitioner as parameter. Can it help? -Original Message- From: qihong [mailto:qc...@pivotal.io] Sent: Tuesday, September 09, 2014 9:13 PM To: u...@spark.incubator.apache.org Subject: Re: how to setup steady state stream partiti

Re: DELIVERY FAILURE: Error transferring to QCMBSJ601.HERMES.SI.SOCGEN; Maximum hop count exceeded. Message probably in a routing loop.

2014-09-10 Thread Andrew Or
Great, good to know it's not just me... 2014-09-10 11:25 GMT-07:00 Marcelo Vanzin : > Yes please pretty please. This is really annoying. > > On Sun, Sep 7, 2014 at 6:31 AM, Ognen Duzlevski > wrote: > >> >> I keep getting below reply every time I send a message to the Spark user >> list? Can this

Re: Accumulo and Spark

2014-09-10 Thread Russ Weeks
It's very straightforward to set up a Hadoop RDD to use AccumuloInputFormat. Something like this will do the trick: private JavaPairRDD newAccumuloRDD(JavaSparkContext sc, AgileConf agileConf, String appName, Authorizations auths) throws IOException, AccumuloSecurityException { Job had

Re: Spark + AccumuloInputFormat

2014-09-10 Thread Russ Weeks
To answer my own question... I didn't realize that I was responsible for telling Spark how much parallelism I wanted for my job. I figured that between Spark and Yarn they'd figure it out for themselves. Adding --executor-memory 3G --num-executors 24 to my spark-submit command took the query time

RE: how to setup steady state stream partitions

2014-09-10 Thread qihong
Thanks for your response! I found that too, and it does the trick! Here's refined code: val inputDStream = ... val keyedDStream = inputDStream.map(...) // use sensorId as key val partitionedDStream = keyedDstream.transform(rdd => rdd.partitionBy(new MyPartitioner(...))) val stateDStream = part

Re: Is the structure for a jar file for running Spark applications the same as that for Hadoop

2014-09-10 Thread Marcelo Vanzin
On Mon, Sep 8, 2014 at 11:15 PM, Sean Owen wrote: > This structure is not specific to Hadoop, but in theory works in any > JAR file. You can put JARs in JARs and refer to them with Class-Path > entries in META-INF/MANIFEST.MF. Funny that you mention that, since someone internally asked the same q

Re: RDD memory questions

2014-09-10 Thread Davies Liu
On Wed, Sep 10, 2014 at 1:05 AM, Boxian Dong wrote: > Thank you very much for your kindly help. I rise some another questions: > >- If the RDD is stored in serialized format, is that means that whenever > the RDD is processed, it will be unpacked and packed again from and back to > the JVM eve

Re: Is the structure for a jar file for running Spark applications the same as that for Hadoop

2014-09-10 Thread Sean Owen
Hm, so it is: http://docs.oracle.com/javase/tutorial/deployment/jar/downman.html I'm sure I've done this before though and thought is was this mechanism. It must be something custom. What's the Hadoop jar structure in question then? Is it something special like a WAR file? I confess I had never h

Re: java.lang.ClassCastException: java.lang.Long cannot be cast to scala.Tuple2

2014-09-10 Thread Ankur Dave
On Wed, Sep 10, 2014 at 2:00 PM, Jeffrey Picard wrote: > After rebuilding from the master branch this morning, I’ve started to see > these errors that I’ve never gotten before while running connected > components. Anyone seen this before? > [...] > at > org.apache.spark.shuffle.sort.SortS

PrintWriter error in foreach

2014-09-10 Thread Arun Luthra
I have a spark program that worked in local mode, but throws an error in yarn-client mode on a cluster. On the edge node in my home directory, I have an output directory (called transout) which is ready to receive files. The spark job I'm running is supposed to write a few hundred files into that d

Re: Is the structure for a jar file for running Spark applications the same as that for Hadoop

2014-09-10 Thread Steve Lewis
In modern projects there are a bazillion dependencies - when I use Hadoop I just put them in a lib directory in the jar - If I have a project that depends on 50 jars I need a way to deliver them to Spark - maybe wordcount can be written without dependencies but real projects need to deliver depende

Re: Is the structure for a jar file for running Spark applications the same as that for Hadoop

2014-09-10 Thread Marcelo Vanzin
On Wed, Sep 10, 2014 at 3:44 PM, Sean Owen wrote: > What's the Hadoop jar structure in question then? Is it something special > like a WAR file? I confess I had never heard of this so thought this was > about generic JAR stuff. What I've been told (and Steve's e-mail alludes to) is that you can p

Re: Is the structure for a jar file for running Spark applications the same as that for Hadoop

2014-09-10 Thread Marcelo Vanzin
On Wed, Sep 10, 2014 at 3:48 PM, Steve Lewis wrote: > In modern projects there are a bazillion dependencies - when I use Hadoop I > just put them in a lib directory in the jar - If I have a project that > depends on 50 jars I need a way to deliver them to Spark - maybe wordcount > can be written w

Re: PrintWriter error in foreach

2014-09-10 Thread Daniil Osipov
Try providing full path to the file you want to write, and make sure the directory exists and is writable by the Spark process. On Wed, Sep 10, 2014 at 3:46 PM, Arun Luthra wrote: > I have a spark program that worked in local mode, but throws an error in > yarn-client mode on a cluster. On the e

GraphX : AssertionError

2014-09-10 Thread Vipul Pandey
Hi, I have a small graph with about 3.3M vertices and close to 7.5M edges. It's a pretty innocent graph with the max degree of 8. Unfortunately, graph.traingleCount is failing on me with the exception below. I'm running a spark-shell on CDH5.1 with the following params : SPARK_DRIVER_MEM=10g A

Re: PrintWriter error in foreach

2014-09-10 Thread Arun Luthra
Ok, so I don't think the workers on the data nodes will be able to see my output directory on the edge node. I don't think stdout will work either, so I'll write to HDFS via rdd.saveAsTextFile(...) On Wed, Sep 10, 2014 at 3:51 PM, Daniil Osipov wrote: > Try providing full path to the file you wa

Re: Re: Spark SQL -- more than two tables for join

2014-09-10 Thread boyingk...@163.com
Hi,michael : I think Arthur.hk.chan isn't here now,I Can Show something: 1)my spark version is 1.0.1 2) when I use multiple join ,like this: sql("SELECT * FROM youhao_data left join youhao_age on (youhao_data.rowkey=youhao_age.rowkey) left join youhao_totalKiloMeter on (youhao_age.rowkey=youhao_

Re: How to scale more consumer to Kafka stream

2014-09-10 Thread Dibyendu Bhattacharya
Hi, You can use this Kafka Spark Consumer. https://github.com/dibbhatt/kafka-spark-consumer This is exactly does that . It creates parallel Receivers for every Kafka topic partitions. You can see the Consumer.java under consumer.kafka.client package to see an example how to use it. There is some

Re: spark.cleaner.ttl and spark.streaming.unpersist

2014-09-10 Thread Tim Smith
I switched from Yarn to StandAlone mode and haven't had OOM issue yet. However, now I have Akka issues killing the executor: 2014-09-11 02:43:34,543 INFO akka.actor.LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkWorker/deadLetters

Setting up jvm in pyspark from shell

2014-09-10 Thread Mohit Singh
Hi, I am using pyspark shell and am trying to create an rdd from numpy matrix rdd = sc.parallelize(matrix) I am getting the following error: JVMDUMP039I Processing dump event "systhrow", detail "java/lang/OutOfMemoryError" at 2014/09/10 22:41:44 - please wait. JVMDUMP032I JVM requested Heap dump

Spark SQL Thrift JDBC server deployment for production

2014-09-10 Thread vasiliy
Hi, i have a question about spark sql Thrift JDBC server. Is there a best practice for spark SQL deployement ? If i understand right script ./sbin/start-thriftserver.sh starts Thrift JDBC server in local mode. Is there an script options for running this server on yarn-cluster mode ? -- Vie