Re: jsonRDD: NoSuchMethodError

2014-07-15 Thread SK
I am running this in standalone mode on a single machine. I built the spark jar from scratch (sbt assembly) and then included that in my application (the same process I have done for earlier versions). thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.co

Re: truly bizarre behavior with local[n] on Spark 1.0.1

2014-07-15 Thread Walrus theCat
This is (obviously) spark streaming, by the way. On Mon, Jul 14, 2014 at 8:27 PM, Walrus theCat wrote: > Hi, > > I've got a socketTextStream through which I'm reading input. I have three > Dstreams, all of which are the same window operation over that > socketTextStream. I have a four core ma

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-15 Thread Praveen Seluka
If you want to make Twitter* classes available in your shell, I believe you could do the following 1. Change the parent pom module ordering - Move external/twitter before assembly 2. In assembly/pom.xm, add external/twitter dependency - this will package twitter* into the assembly jar Now when spa

Re: Spark SQL 1.0.1 error on reading fixed length byte array

2014-07-15 Thread Pei-Lun Lee
Hi Michael, Good to know it is being handled. I tried master branch (9fe693b5) and got another error: scala> sqlContext.parquetFile("/tmp/foo") java.lang.RuntimeException: Unsupported parquet datatype optional fixed_len_byte_array(4) b at scala.sys.package$.error(package.scala:27) at org.apache.s

Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-15 Thread anyweil
Thank you so much for the information, now i have merge the fix of #1411 and seems the HiveSQL works with: SELECT name FROM people WHERE schools[0].time>2. But one more question is: Is it possible or planed to support the "schools.time" format to filter the record that there is an element inside

Re: jsonRDD: NoSuchMethodError

2014-07-15 Thread SK
The problem is resolved. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/jsonRDD-NoSuchMethodError-tp9688p9742.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-15 Thread Nick Pentreath
You could try the following: create a minimal project using sbt or Maven, add spark-streaming-twitter as a dependency, run sbt assembly (or mvn package) on that to create a fat jar (with Spark as provided dependency), and add that to the shell classpath when starting up. On Tue, Jul 15, 2014 at 9

Re: Spark SQL 1.0.1 error on reading fixed length byte array

2014-07-15 Thread Michael Armbrust
Oh, maybe not. Please file another JIRA. On Tue, Jul 15, 2014 at 12:34 AM, Pei-Lun Lee wrote: > Hi Michael, > > Good to know it is being handled. I tried master branch (9fe693b5) and got > another error: > > scala> sqlContext.parquetFile("/tmp/foo") > java.lang.RuntimeException: Unsupported pa

Spark Streaming w/ tshark exception problem on EC2

2014-07-15 Thread Gianluca Privitera
Hi, I’ve got a problem with Spark Streaming and tshark. While I’m running locally I have no problems with this code, but when I run it on a EC2 cluster I get the exception shown just under the code. def dissection(s: String): Seq[String] = { try { Process("hadoop command to create ./lo

Kryo NoSuchMethodError on Spark 1.0.0 standalone

2014-07-15 Thread jfowkes
Hi there, I've been sucessfully using the precompiled Spark 1.0.0 Java api on a small cluster in standalone mode. However, when I try to use Kryo serializer by adding conf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer"); as suggested, Spark crashes out with the following error

Re: Catalyst dependency on Spark Core

2014-07-15 Thread Sean Owen
Agree. You end up with a "core" and a "corer core" to distinguish between and it ends up just being more complicated. This sounds like something that doesn't need a module. On Tue, Jul 15, 2014 at 5:59 AM, Patrick Wendell wrote: > Adding new build modules is pretty high overhead, so if this is a

Re: "the default GraphX graph-partition strategy on multicore machine"?

2014-07-15 Thread Yifan LI
Dear Ankur, Thanks so much! Btw, is there any possibility to customise the partition strategy as we expect? Best, Yifan On Jul 11, 2014, at 10:20 PM, Ankur Dave wrote: > Hi Yifan, > > When you run Spark on a single machine, it uses a local mode where one task > per core can be executed at a

Re: Spark SQL 1.0.1 error on reading fixed length byte array

2014-07-15 Thread Pei-Lun Lee
Filed SPARK-2446 2014-07-15 16:17 GMT+08:00 Michael Armbrust : > Oh, maybe not. Please file another JIRA. > > > On Tue, Jul 15, 2014 at 12:34 AM, Pei-Lun Lee wrote: > >> Hi Michael, >> >> Good to know it is being handled. I tried master branch (9fe693b5) and >> got another error: >> >> scala>

shared object between threads

2014-07-15 Thread Wanda Hawk
How can I declare in spark a shared object by all the threads that does not block execution by locking the entire array (threads are supposed to access different lines from a 2 dimensional array) ? For example, I would like to declare a 2 dimensional array. Each thread should write on its corre

Re: Spark SQL 1.0.1 error on reading fixed length byte array

2014-07-15 Thread Pei-Lun Lee
Sorry, should be SPARK-2489 2014-07-15 19:22 GMT+08:00 Pei-Lun Lee : > Filed SPARK-2446 > > > > 2014-07-15 16:17 GMT+08:00 Michael Armbrust : > > Oh, maybe not. Please file another JIRA. >> >> >> On Tue, Jul 15, 2014 at 12:34 AM, Pei-Lun Lee wrote: >> >>> Hi Michael, >>> >>> Good to know it is

Re: shared object between threads

2014-07-15 Thread Frank Austin Nothaft
Hi Wanda, What exactly is the use case for this? Nominally, you wouldn’t want to do that sort of access, as a single datum can’t be shared across machines when running distributed. Instead, you might want to use an accumulator to manage the aggregation of data in a distributed form. Regards,

Re: running spark from intellj

2014-07-15 Thread jamborta
fyi - the problem here was related to dependencies. SBT pulls spark-core built for hadoop1.x, I am running it on hadoop2.x. I built the jar manually and added to the classpath, which solved the problem. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/runni

Re: KMeansModel Construtor error

2014-07-15 Thread Rohit Pujari
Thanks Xiangrui for your response. Rohit Sent from my iPhone > On Jul 14, 2014, at 11:41 PM, Xiangrui Meng wrote: > > I don't think MLlib supports model serialization/deserialization. You > got the error because the constructor is private. I created a JIRA for > this: https://issues.apache.org/

Re: "the default GraphX graph-partition strategy on multicore machine"?

2014-07-15 Thread Yifan LI
Hi Ankur, I have another question, w.r.t edges/partitions scheduling: For instance, I have a 2*4 cores(L1 cache: 32K) machine, with 32GB memory, a 80GB size of local edges file on disk, when I load the file using sc.textFile (minPartitions = 16, PartitionStrategy.RandomVertexCut), Then, what ha

Need help on spark Hbase

2014-07-15 Thread Madabhattula Rajesh Kumar
Hi Team, Could you please help me to resolve the issue. *Issue *: I'm not able to connect HBase from Spark-submit. Below is my code. When i execute below program in standalone, i'm able to connect to Hbase and doing the operation. When i execute below program using spark submit ( ./bin/spark-su

Store one to many relation ship in parquet file with spark sql

2014-07-15 Thread Jaonary Rabarisoa
Hi all, How should I store a one to many relationship using spark sql and parquet format. For example I the following case class case class Person(key: String, name: String, friends: Array[String]) gives an error when I try to insert the data in a parquet file. It doesn't like the Array[String]

Re: Need help on spark Hbase

2014-07-15 Thread Jerry Lam
Hi Rajesh, can you describe your spark cluster setup? I saw localhost:2181 for zookeeper. Best Regards, Jerry On Tue, Jul 15, 2014 at 9:47 AM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi Team, > > Could you please help me to resolve the issue. > > *Issue *: I'm not able to co

persistence state of an RDD

2014-07-15 Thread Nathan Kronenfeld
Is there a way of determining programatically the cache state of an RDD? Not its storage level, but how much of it is actually cached? Thanks, -Nathan -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J

Re: persistence state of an RDD

2014-07-15 Thread Praveen Seluka
Nathan, you are looking for SparkContext.getRDDStorageInfo which returns the information on how much is cached. On Tue, Jul 15, 2014 at 8:01 PM, Nathan Kronenfeld < nkronenf...@oculusinfo.com> wrote: > Is there a way of determining programatically the cache state of an RDD? > Not its storage lev

Re: How to kill running spark yarn application

2014-07-15 Thread Jerry Lam
when I use yarn application -kill, both SparkSubmit and ApplicationMaster are killed. I also checked jps at the machine that has SparkSubmit running, it is terminated as well. Sorry, I cannot reproduce it. On Mon, Jul 14, 2014 at 7:36 PM, hsy...@gmail.com wrote: > Before "yarn application -kill

Re: How to kill running spark yarn application

2014-07-15 Thread Jerry Lam
For your information, the SparkSubmit runs at the host you executed the spark-submit shell script (which in turns invoke the SparkSubmit program). Since you are running in yarn-cluster mode, the SparkSubmit program just reported the status of the job submitted to Yarn. So when you killed the Applic

Re: persistence state of an RDD

2014-07-15 Thread Nathan Kronenfeld
Thanks On Tue, Jul 15, 2014 at 10:39 AM, Praveen Seluka wrote: > Nathan, you are looking for SparkContext.getRDDStorageInfo which returns > the information on how much is cached. > > > On Tue, Jul 15, 2014 at 8:01 PM, Nathan Kronenfeld < > nkronenf...@oculusinfo.com> wrote: > >> Is there a way

Ambiguous references to id : what does it mean ?

2014-07-15 Thread Jaonary Rabarisoa
Hi all, When running a join operation with Spark SQL I got the following error : Exception in thread "main" org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Ambiguous references to id: (id#303,List()),(id#0,List()), tree: Filter ('videoId = 'id) Join Inner, None ParquetRelation

RE: writing FLume data to HDFS

2014-07-15 Thread Sundaram, Muthu X.
My intention is not to write data directly from flume to hdfs. I have to collect messages from queue using flume and send it to spark streaming for additional processing. I will try what you have suggested. Thanks, Muthu From: Tathagata Das [mailto:tathagata.das1...@gmail.com] Sent: Monday, Jul

Re: Need help on spark Hbase

2014-07-15 Thread Madabhattula Rajesh Kumar
Hi Nathan and Jerry, Thank you for the details. Jerry, I've installed Spark, Hbase and Hadoop in same machine. Please let me know do you need more information. I'm not able to identify the issue why it is not connected to Hbase when i use spark-submit Do you have a example program? if yes, plea

Re: Ambiguous references to id : what does it mean ?

2014-07-15 Thread Yin Huai
Hi Jao, Seems the SQL analyzer cannot resolve the references in the Join condition. What is your query? Did you use the Hive Parser (your query was submitted through hql(...)) or the basic SQL Parser (your query was submitted through sql(...)). Thanks, Yin On Tue, Jul 15, 2014 at 8:52 AM, Jaon

Re: Need help on spark Hbase

2014-07-15 Thread Krishna Sankar
One vector to check is the HBase libraries in the --jars as in : spark-submit --class --master --jars hbase-client-0.98.3-hadoop2.jar,commons-csv-1.0-SNAPSHOT.jar,hbase-common-0.98.3-hadoop2.jar,hbase-hadoop2-compat-0.98.3-hadoop2.jar,hbase-it-0.98.3-hadoop2.jar,hbase-protocol-0.98.3-hadoop2.jar,

Re: NotSerializableException in Spark Streaming

2014-07-15 Thread Nicholas Chammas
Hey Diana, Did you ever figure this out? I’m running into the same exception, except in my case the function I’m calling is a KMeans model.predict(). In regular Spark it works, and Spark Streaming without the call to model.predict() also works, but when put together I get this serialization exce

Driver cannot receive StatusUpdate message for FINISHED

2014-07-15 Thread 林武康
Hi all, I got a strange problem, I submit a reduce job(any one split), it finished normally on Executor, log is: 14/07/15 21:08:56 INFO Executor: Serialized size of result for 0 is 10476031 14/07/15 21:08:56 INFO Executor: Sending result for 0 directly to driver 14/07/15 21:08:56 INFO Executor: F

Re: Spark-Streaming collect/take functionality.

2014-07-15 Thread jon.burns
It works perfect, thanks!. I feel like I should have figured that out, I'll chalk it up to inexperience with Scala. Thanks again. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-collect-take-functionality-tp9670p9772.html Sent from the Apach

Re: Need help on spark Hbase

2014-07-15 Thread Jerry Lam
Hi Rajesh, I have a feeling that this is not directly related to spark but I might be wrong. The reason why is that when you do: Configuration configuration = HBaseConfiguration.create(); by default, it reads the configuration files hbase-site.xml in your classpath and ... (I don't remember a

Re: Iteration question

2014-07-15 Thread Matei Zaharia
Hi Nathan, I think there are two possible reasons for this. One is that even though you are caching RDDs, their lineage chain gets longer and longer, and thus serializing each RDD takes more time. You can cut off the chain by using RDD.checkpoint() periodically, say every 5-10 iterations. The s

Re: Store one to many relation ship in parquet file with spark sql

2014-07-15 Thread Michael Armbrust
Make the Array a Seq. On Tue, Jul 15, 2014 at 7:12 AM, Jaonary Rabarisoa wrote: > Hi all, > > How should I store a one to many relationship using spark sql and parquet > format. For example I the following case class > > case class Person(key: String, name: String, friends: Array[String]) > > g

Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Keith Simmons
HI folks, I'm running into the following error when trying to perform a join in my code: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.catalyst.types.LongType$ I see similar errors for StringType$ and also: scala.reflect.runtime.ReflectError: value apache is n

How does Spark speculation prevent duplicated work?

2014-07-15 Thread Mingyu Kim
Hi all, I was curious about the details of Spark speculation. So, my understanding is that, when ³speculated² tasks are newly scheduled on other machines, the original tasks are still running until the entire stage completes. This seems to leave some room for duplicated work because some spark act

Re: How to kill running spark yarn application

2014-07-15 Thread hsy...@gmail.com
Interesting, I run on my local one node cluster using apache hadoop On Tue, Jul 15, 2014 at 7:55 AM, Jerry Lam wrote: > For your information, the SparkSubmit runs at the host you executed the > spark-submit shell script (which in turns invoke the SparkSubmit program). > Since you are running in

Spark Performance Bench mark

2014-07-15 Thread Malligarjunan S
Hello All, I am a newbie to Apache Spark, I would like to know the performance benchmark of Apache Spark. My current requirement is as follows I have few files in 2 s3 buckets Each file may have minimum of 1 million records. File data are tab separated. Have to compare few columns and filter the

MLLib - Regularized logistic regression in python

2014-07-15 Thread fjeg
Hi All, I am trying to perform regularized logistic regression with mllib in python. I have seen that this is possible in the following scala example: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala But I do not see an

Count distinct with groupBy usage

2014-07-15 Thread buntu
Hi -- New to Spark and trying to figure out how to do a generate unique counts per page by date given this raw data: timestamp,page,userId 1405377264,google,user1 1405378589,google,user2 1405380012,yahoo,user1 .. I can do a groupBy a field and get the count: val lines=sc.textFile("data.csv") va

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Michael Armbrust
Are you registering multiple RDDs of case classes as tables concurrently? You are possibly hitting SPARK-2178 which is caused by SI-6240 . On Tue, Jul 15, 2014 at 10:49 AM, Keith Simmons wrote: > H

Re: Count distinct with groupBy usage

2014-07-15 Thread Nick Pentreath
You can use .distinct.count on your user RDD. What are you trying to achieve with the time group by? — Sent from Mailbox On Tue, Jul 15, 2014 at 8:14 PM, buntu wrote: > Hi -- > New to Spark and trying to figure out how to do a generate unique counts per > page by date given this raw data: > ti

Re: Count distinct with groupBy usage

2014-07-15 Thread Zongheng Yang
Sounds like a job for Spark SQL: http://spark.apache.org/docs/latest/sql-programming-guide.html ! On Tue, Jul 15, 2014 at 11:25 AM, Nick Pentreath wrote: > You can use .distinct.count on your user RDD. > > What are you trying to achieve with the time group by? > — > Sent from Mailbox > > > On Tue

Spark Performance issue

2014-07-15 Thread Malligarjunan S
Hello all, I am a newbie to Spark, Just analyzing the product. I am facing a performance problem with hive, Trying analyse whether the Spark will solve it or not. but it seems that Spark also taking lot of time.Let me know if I miss anything. shark> select count(time) from table2; OK 6050 Time ta

Re: Count distinct with groupBy usage

2014-07-15 Thread buntu
We have CDH 5.0.2 which doesn't include Spark SQL yet and may only be available in CDH 5.1 which is yet to be released. If Spark SQL is the only option then I might need to hack around to add it into the current CDH deployment if thats possible. -- View this message in context: http://apache-s

Re: Count distinct with groupBy usage

2014-07-15 Thread buntu
Thanks Nick. All I'm attempting is to report number of unique visitors per page by date. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781p9786.html Sent from the Apache Spark User List mailing list archive at Nabble.c

Question on Apache Spark custom InputFormat Integration

2014-07-15 Thread Nick R. Katsipoulakis
Hello, I am currently working with Spark and I have posted the following question on StackOverflow.com : http://stackoverflow.com/questions/24765063/apache-spark-with-custom-inputformat-for-hadooprdd Any advice/answer is welcome. Thank you, Nick

Re: Count distinct with groupBy usage

2014-07-15 Thread Raffael Marty
> All I'm attempting is to report number of unique visitors per page by date. But the way you are doing it currently, you will get a count per second. You have to bucketize your dates by whatever time resolution you want. -raffy

Re: Count distinct with groupBy usage

2014-07-15 Thread buntu
Thats is correct Raffy. Assume I convert the timestamp field to date and in the required format, is it possible to report it by date? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781p9790.html Sent from the Apache Spar

Re: How does Spark speculation prevent duplicated work?

2014-07-15 Thread Nan Zhu
Hi, Mingyuan, According to my understanding, Spark processes the result generated from each partition by passing them to resultHandler (SparkContext.scala L1056) This resultHandler is usually just put the result in a driver-side array, the length of which is always partitions.size this d

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Keith Simmons
Nope. All of them are registered from the driver program. However, I think we've found the culprit. If the join column between two tables is not in the same column position in both tables, it triggers what appears to be a bug. For example, this program fails: import org.apache.spark.SparkConte

Re: Large Task Size?

2014-07-15 Thread Aaron Davidson
Ah, I didn't realize this was non-MLLib code. Do you mean to be sending stochasticLossHistory in the closure as well? On Sun, Jul 13, 2014 at 1:05 AM, Kyle Ellrott wrote: > It uses the standard SquaredL2Updater, and I also tried to broadcast it as > well. > > The input is a RDD created by takin

Re: Count distinct with groupBy usage

2014-07-15 Thread Sean Owen
If you are counting per time and per page, then you need to group by time and page not just page. Something more like: csv.groupBy(csv => (csv(0),csv(1))) ... This gives a list of users per (time,page). As Nick suggests, then you count the distinct values for each key: ... .mapValues(_.distinct.

Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-15 Thread Jerry Lam
Hi guys, Sorry, I'm also interested in this nested json structure. I have a similar SQL in which I need to query a nested field in a json. Does the above query works if it is used with sql(sqlText) assuming the data is coming directly from hdfs via sqlContext.jsonFile? The SPARK-2483

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Zongheng Yang
FWIW, I am unable to reproduce this using the example program locally. On Tue, Jul 15, 2014 at 11:56 AM, Keith Simmons wrote: > Nope. All of them are registered from the driver program. > > However, I think we've found the culprit. If the join column between two > tables is not in the same colu

count vs countByValue in for/yield

2014-07-15 Thread Ognen Duzlevski
Hello, I am curious about something: val result = for { (dt,evrdd) <- evrdds val ct = evrdd.count } yield (dt->ct) works. val result = for { (dt,evrdd) <- evrdds val ct = evrdd.countByValue } yield (dt->ct) does not work. I get: 14/07/15 16:46:33 WARN TaskSetMa

Re: "the default GraphX graph-partition strategy on multicore machine"?

2014-07-15 Thread Ankur Dave
On Jul 15, 2014, at 12:06 PM, Yifan LI wrote: > Btw, is there any possibility to customise the partition strategy as we > expect? I'm not sure I understand. Are you asking about defining a custom

Re: getting ClassCastException on collect()

2014-07-15 Thread _soumya_
Not sure I can help, but I ran into the same problem. Basically my use case is a that I have a List of strings - which I then convert into a RDD using sc.parallelize(). This RDD is then operated on by the foreach() function. Same as you, I get a runtime exception : java.lang.ClassCastException: c

Re: How does Spark speculation prevent duplicated work?

2014-07-15 Thread Andrew Ash
Hi Nan, Great digging in -- that makes sense to me for when a job is producing some output handled by Spark like a .count or .distinct or similar. For the other part of the question, I'm also interested in side effects like an HDFS disk write. If one task is writing to an HDFS path and another t

parallel stages?

2014-07-15 Thread Wei Tan
Hi, I wonder if I do wordcount on two different files, like this: val file1 = sc.textFile("/...") val file2 = sc.textFile("/...") val wc1= file.flatMap(..).reduceByKey(_ + _,1) val wc2= file.flatMap(...).reduceByKey(_ + _,1) wc1.saveAsTextFile("titles.out") wc2.saveAsTextFile("tables.out") Wou

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Keith Simmons
To give a few more details of my environment in case that helps you reproduce: * I'm running spark 1.0.1 downloaded as a tar ball, not built myself * I'm running in stand alone mode, with 1 master and 1 worker, both on the same machine (though the same error occurs with two workers on two machines

Re: Count distinct with groupBy usage

2014-07-15 Thread buntu
Thanks Sean!! Thats what I was looking for -- group by on mulitple fields. I'm gonna play with it now. Thanks again! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781p9803.html Sent from the Apache Spark User List mail

Re: Possible bug in Spark Streaming :: TextFileStream

2014-07-15 Thread Tathagata Das
On second thought I am not entirely sure whether that bug is the issue. Are you continuously appending to the file that you have copied to the directory? Because filestream works correctly when the files are atomically moved to the monitored directory. TD On Mon, Jul 14, 2014 at 9:08 PM, Madabha

Re: SQL + streaming

2014-07-15 Thread Tathagata Das
Could you run it locally first to make sure it works, and you see output? Also, I recommend going through the previous step-by-step approach to narrow down where the problem is. TD On Mon, Jul 14, 2014 at 9:15 PM, hsy...@gmail.com wrote: > Actually, I deployed this on yarn cluster(spark-submit

Re: Spark Streaming Json file groupby function

2014-07-15 Thread Tathagata Das
I see you have the code to convert to Record class but commented it out. That is the right way to go. When you are converting it to a 4-tuple with " (data("type"),data("name"),data("score"),data("school"))" ... its of type (Any, Any, Any, Any) as data("xyz") returns Any. And registerAsTable probab

Help with Json array parsing

2014-07-15 Thread SK
Hi, I have a json file where the object definition in each line includes an array component "obj" that contains 0 or more elements as shown by the example below. {"name": "16287e9cdf", "obj": [{"min": 50,"max": 59 }, {"min": 20, "max": 29}]}, {"name": "17087e9cdf", "obj": [{"min": 30,"max": 3

Re: truly bizarre behavior with local[n] on Spark 1.0.1

2014-07-15 Thread Tathagata Das
This sounds really really weird. Can you give me a piece of code that I can run to reproduce this issue myself? TD On Tue, Jul 15, 2014 at 12:02 AM, Walrus theCat wrote: > This is (obviously) spark streaming, by the way. > > > On Mon, Jul 14, 2014 at 8:27 PM, Walrus theCat > wrote: > >> Hi, >

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Michael Armbrust
Can you print out the queryExecution? (i.e. println(sql().queryExecution)) On Tue, Jul 15, 2014 at 12:44 PM, Keith Simmons wrote: > To give a few more details of my environment in case that helps you > reproduce: > > * I'm running spark 1.0.1 downloaded as a tar ball, not built myself > *

Re: Ideal core count within a single JVM

2014-07-15 Thread lokesh.gidra
It makes sense what you said. But, when I proportionately reduce the heap size, then also the problem persists. For instance, if I use 160 GB heap for 48 cores, whereas 80 GB heap for 24 cores, then also with 24 cores the performance is better (although worse than 160 GB with 24 cores) than 48-core

Re: truly bizarre behavior with local[n] on Spark 1.0.1

2014-07-15 Thread Walrus theCat
Will do. On Tue, Jul 15, 2014 at 12:56 PM, Tathagata Das wrote: > This sounds really really weird. Can you give me a piece of code that I > can run to reproduce this issue myself? > > TD > > > On Tue, Jul 15, 2014 at 12:02 AM, Walrus theCat > wrote: > >> This is (obviously) spark streaming, by

Re: How does Spark speculation prevent duplicated work?

2014-07-15 Thread Bertrand Dechoux
I haven't look at the implementation but what you would do with any filesystem is write to a file inside the workspace directory of the task. And then only the attempt of the task that should be kept will perform a move to the final path. The other attempts are simply discarded. For most filesystem

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Keith Simmons
Sure thing. Here you go: == Logical Plan == Sort [key#0 ASC] Project [key#0,value#1,value#2] Join Inner, Some((key#0 = key#3)) SparkLogicalPlan (ExistingRdd [key#0,value#1], MapPartitionsRDD[2] at mapPartitions at basicOperators.scala:176) SparkLogicalPlan (ExistingRdd [value#2,key#3], M

Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-15 Thread Michael Armbrust
No, that is why I included the link to SPARK-2096 as well. You'll need to use HiveQL at this time. Is it possible or planed to support the "schools.time" format to filter the >> record that there is an element inside array of schools satisfy time

Re: parallel stages?

2014-07-15 Thread Sean Owen
The last two lines are what trigger the operations, and they will each block until the result is computed and saved. So if you execute this code as-is, no. You could write a Scala program that invokes these two operations in parallel, like: Array((wc1,"titles.out"), (wc2,"tables.out")).par.foreach

Re: Spark Streaming Json file groupby function

2014-07-15 Thread srinivas
I am still getting the error...even if i convert it to record object KafkaWordCount { def main(args: Array[String]) { if (args.length < 4) { System.err.println("Usage: KafkaWordCount ") System.exit(1) } //StreamingExamples.setStreamingLogLevels() val Array(zkQuorum

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread gorenuru
Just my "few cents" on this. I having the same problems with v 1.0.1 but this bug is sporadic and looks like is relayed to object initialization. Even more, i'm not using any SQL or something. I just have utility class like this: object DataTypeDescriptor { type DataType = String val BOOLE

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Michael Armbrust
Thanks for the extra info. At a quick glance the query plan looks fine to me. The class IntegerType does build a type tag I wonder if you are seeing the Scala issue manifest in some new way. We will attempt to reproduce locally. On Tue, Jul 15, 2014 at 1:41 PM, gorenuru wrote: > Just my

Multiple streams at the same time

2014-07-15 Thread gorenuru
Hi everyone. I have some problems running multiple streams at the same time. What i am doing is: object Test { import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.api.java.function._ import org.apache.spark.streaming._ import

Re: Help with Json array parsing

2014-07-15 Thread SK
To add to my previous post, the error at runtime is teh following: Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 1 times, most recent failure: Exception failure in TID 0 on host localhost: org.json4s.package$MappingException: Expect

Retrieve dataset of Big Data Benchmark

2014-07-15 Thread Tom
Hi, I would like to use the dataset used in the Big Data Benchmark on my own cluster, to run some tests between Hadoop and Spark. The dataset should be available at s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix],

Re: Retrieve dataset of Big Data Benchmark

2014-07-15 Thread Burak Yavuz
Hi Tom, If you wish to load the file in Spark directly, you can use sc.textFile("s3n://big-data-benchmark/pavlo/...") where sc is your SparkContext. This can be done because the files should be publicly available and you don't need AWS Credentials to access them. If you want to download the fi

Re: Cassandra driver Spark question

2014-07-15 Thread Tathagata Das
Can you find out what is the class that is causing the NotSerializable exception? In fact, you can enabled extended serialization debugging to figure out object structure through the foreachRDD's

Re: How does Spark speculation prevent duplicated work?

2014-07-15 Thread Mingyu Kim
Thanks for the explanation, guys. I looked into the saveAsHadoopFile implementation a little bit. If you see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/s park/rdd/PairRDDFunctions.scala at line 843, the HDFS write happens at per-partition processing, not at the resu

Re: Recommended pipeline automation tool? Oozie?

2014-07-15 Thread Dean Wampler
If you're already using Scala for Spark programming and you hate Oozie XML as much as I do ;), you might check out Scoozie, a Scala DSL for Oozie: https://github.com/klout/scoozie On Thu, Jul 10, 2014 at 5:52 PM, Andrei wrote: > I used both - Oozie and Luigi - but found them inflexible and stil

can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-15 Thread Matt Work Coarr
Hello spark folks, I have a simple spark cluster setup but I can't get jobs to run on it. I am using the standlone mode. One master, one slave. Both machines have 32GB ram and 8 cores. The slave is setup with one worker that has 8 cores and 24GB memory allocated. My application requires 2 cor

Re: Large Task Size?

2014-07-15 Thread Kyle Ellrott
Yes, this is a proposed patch to MLLib so that you can use 1 RDD to train multiple models at the same time. I am hoping that by multiplexing several models in the same RDD will be more efficient then trying to get the Spark scheduler to manage a few 100 tasks simultaneously. I don't think I see st

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-15 Thread Marcelo Vanzin
Have you looked at the slave machine to see if the process has actually launched? If it has, have you tried peeking into its log file? (That error is printed whenever the executors fail to report back to the driver. Insufficient resources to launch the executor is the most common cause of that, bu

Re: How does Spark speculation prevent duplicated work?

2014-07-15 Thread Tathagata Das
The way the HDFS file writing works at a high level is that each attempt to write a partition to a file starts writing to unique temporary file (say, something like targetDirectory/_temp/part-X_attempt-). If the writing into the file successfully completes, then the temporary file is moved

Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-15 Thread Jerry Lam
Hi Michael, I don't understand the difference between hql (HiveContext) and sql (SQLContext). My previous understanding was that hql is hive specific. Unless the table is managed by Hive, we should use sql. For instance, RDD (hdfsRDD) created from files in HDFS and registered as a table should use

Re: Error when testing with large sparse svm

2014-07-15 Thread crater
I got a bit progress. I think the problem is with the "BinaryClassificationMetrics", as long as I comment out all the prediction related metrics, I can run the svm example with my data. So the problem should be there I guess. -- View this message in context: http://apache-spark-user-list.1001

Re: NotSerializableException in Spark Streaming

2014-07-15 Thread Tathagata Das
I am not entire sure off the top of my head. But a possible (usually works) workaround is to define the function as a val instead of a def. For example def func(i: Int): Boolean = { true } can be written as val func = (i: Int) => { true } Hope this helps for now. TD On Tue, Jul 15, 2014 at 9

Re: NotSerializableException in Spark Streaming

2014-07-15 Thread Tathagata Das
I am very curious though. Can you post a concise code example which we can run to reproduce this problem? TD On Tue, Jul 15, 2014 at 2:54 PM, Tathagata Das wrote: > I am not entire sure off the top of my head. But a possible (usually > works) workaround is to define the function as a val inste

Spark misconfigured? Small input split sizes in shark query

2014-07-15 Thread David Rosenstrauch
Got a spark/shark cluster up and running recently, and have been kicking the tires on it. However, been wrestling with an issue on it that I'm not quite sure how to solve. (Or, at least, not quite sure about the correct way to solve it.) I ran a simple Hive query (select count ...) against a

Re: How does Spark speculation prevent duplicated work?

2014-07-15 Thread Matei Zaharia
Yeah, this is handled by the "commit" call of the FileOutputFormat. In general Hadoop OutputFormats have a concept called "committing" the output, which you should do only once per partition. In the file ones it does an atomic rename to make sure that the final output is a complete file. Matei

Re: Need help on spark Hbase

2014-07-15 Thread Tathagata Das
Also, it helps if you post us logs, stacktraces, exceptions, etc. TD On Tue, Jul 15, 2014 at 10:07 AM, Jerry Lam wrote: > Hi Rajesh, > > I have a feeling that this is not directly related to spark but I might be > wrong. The reason why is that when you do: > >Configuration configuration =

Re: Can we get a spark context inside a mapper

2014-07-15 Thread Rahul Bhojwani
Thanks a lot Sean, Daniel, Matei and Jerry. I really appreciate your reply. And I also understand that I should be a little more patient. When I myself is only not able to reply within next 5 hours how can I expect question to be answered in that time. And yes the Idea of using a separate Clusteri

  1   2   >