Re: Ambiguous references to id : what does it mean ?

2014-07-15 Thread Jaonary Rabarisoa
My query is just a simple query that use the spark sql dsl : tagCollection.join(selectedVideos).where('videoId === 'id) On Tue, Jul 15, 2014 at 6:03 PM, Yin Huai wrote: > Hi Jao, > > Seems the SQL analyzer cannot resolve the references in the Join > condition. What is your query? Did you use

Re: Error: No space left on device

2014-07-15 Thread Chris Gore
Hi Chris, I've encountered this error when running Spark’s ALS methods too. In my case, it was because I set spark.local.dir improperly, and every time there was a shuffle, it would spill many GB of data onto the local drive. What fixed it was setting it to use the /mnt directory, where a net

Re: Error: No space left on device

2014-07-15 Thread Chris DuBois
df -i # on a slave FilesystemInodes IUsed IFree IUse% Mounted on /dev/xvda1524288 277701 246587 53% / tmpfs1917974 1 19179731% /dev/shm On Tue, Jul 15, 2014 at 11:39 PM, Xiangrui Meng wrote: > Check the number of inodes (df -i). The as

Re: Error: No space left on device

2014-07-15 Thread Xiangrui Meng
Check the number of inodes (df -i). The assembly build may create many small files. -Xiangrui On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois wrote: > Hi all, > > I am encountering the following error: > > INFO scheduler.TaskSetManager: Loss was due to java.io.IOException: No space > left on devic

Error: No space left on device

2014-07-15 Thread Chris DuBois
Hi all, I am encountering the following error: INFO scheduler.TaskSetManager: Loss was due to java.io.IOException: No space left on device [duplicate 4] For each slave, df -h looks roughtly like this, which makes the above error surprising. FilesystemSize Used Avail Use% Mounted on

Re: Error when testing with large sparse svm

2014-07-15 Thread Xiangrui Meng
Then it may be a new issue. Do you mind creating a JIRA to track this issue? It would be great if you can help locate the line in BinaryClassificationMetrics that caused the problem. Thanks! -Xiangrui On Tue, Jul 15, 2014 at 10:56 PM, crater wrote: > I don't really have "my code", I was just runn

Re: akka disassociated on GC

2014-07-15 Thread Xiangrui Meng
Hi Makoto, I don't remember I wrote that but thanks for bringing this issue up! There are two important settings to check: 1) driver memory (you can see it from the executor tab), 2) number of partitions (try to use small number of partitions). I put two PRs to fix the problem: 1) use broadcast i

Re: Error when testing with large sparse svm

2014-07-15 Thread crater
I don't really have "my code", I was just running example program in : examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala What I did was simple try this example on a 13M sparse data, and I got the error I posted. Today I managed to ran it after I commented out th

akka disassociated on GC

2014-07-15 Thread Makoto Yui
Hello, (2014/06/19 23:43), Xiangrui Meng wrote: The execution was slow for more large KDD cup 2012, Track 2 dataset (235M+ records of 16.7M+ (2^24) sparse features in about 33.6GB) due to the sequential aggregation of dense vectors on a single driver node. It took about 7.6m for aggregation f

Re: Kyro deserialisation error

2014-07-15 Thread Tathagata Das
Are you using classes from external libraries that have not been added to the sparkContext, using sparkcontext.addJar()? TD On Tue, Jul 15, 2014 at 8:36 PM, Hao Wang wrote: > I am running the WikipediaPageRank in Spark example and share the same > problem with you: > > 4/07/16 11:31:06 DEBUG D

Re: Error when testing with large sparse svm

2014-07-15 Thread Xiangrui Meng
crater, was the error message the same as what you posted before: 14/07/14 11:32:20 ERROR TaskSchedulerImpl: Lost executor 1 on node7: remote Akka client disassociated 14/07/14 11:32:20 WARN TaskSetManager: Lost TID 20 (task 13.0:0) 14/07/14 11:32:21 ERROR TaskSchedulerImpl: Lost executor 3 on nod

Re: SQL + streaming

2014-07-15 Thread Tathagata Das
Aah, glad you found it out. TD On Tue, Jul 15, 2014 at 7:52 PM, hsy...@gmail.com wrote: > Thanks Tathagata, we actually found the problem. I created SQLContext and > StreamContext from different SparkContext. But thanks for your help > > Best, > Siyuan > > > On Tue, Jul 15, 2014 at 6:53 PM, T

Re: Spark Streaming Json file groupby function

2014-07-15 Thread Tathagata Das
Can you try defining the case class outside the main function. In fact outside the object? TD On Tue, Jul 15, 2014 at 8:20 PM, srinivas wrote: > Hi TD, > > I uncomment import sqlContext._ and tried to compile the code > > import java.util.Properties > import kafka.producer._ > import org.apac

executor-cores vs. num-executors

2014-07-15 Thread innowireless TaeYun Kim
Hi, On running yarn-client mode, the following options can be specified: l --executor-cores l --num-executors If we have following machines: l 3 data nodes l 8 cores each node Which is the better? 1. --executor-cores 7 --num-executors 3 (more core for each executor, leavi

Re: Can Spark stack scale to petabyte scale without performance degradation?

2014-07-15 Thread Matei Zaharia
Yup, as mentioned in the FAQ, we are aware of multiple deployments running jobs on over 1000 nodes. Some of our proof of concepts involved people running a 2000-node job on EC2. I wouldn't confuse buzz with FUD :). Matei On Jul 15, 2014, at 9:17 PM, Sonal Goyal wrote: > Hi Rohit, > > I thin

Re: can't print DStream after reduce

2014-07-15 Thread Tobias Pfeiffer
Hi, thanks for creating the issue. It feels like in the last week, more or less half of the questions about Spark Streaming rooted in setting the master to "local" ;-) Tobias On Wed, Jul 16, 2014 at 11:03 AM, Tathagata Das wrote: > Aah, right, copied from the wrong browser tab i guess. Thanks

Re: Spark 1.0.1 akka connection refused

2014-07-15 Thread Kevin Jung
UPDATES: It happens only when I use 'case class' and map RDD to this class in spark-shell. The other RDD transform, SchemaRDD with parquet file and any SparkSQL operation work fine. Is there some changes related to case class operation between 1.0.0 and 1.0.1? Best regards Kevin -- View this m

Re: Can Spark stack scale to petabyte scale without performance degradation?

2014-07-15 Thread Sonal Goyal
Hi Rohit, I think the 3rd question on the FAQ may help you. https://spark.apache.org/faq.html Some other links that talk about building bigger clusters and processing more data: http://spark-summit.org/wp-content/uploads/2014/07/Building-1000-node-Spark-Cluster-on-EMR.pdf http://apache-spark-us

Re: MLLib - Regularized logistic regression in python

2014-07-15 Thread Yanbo Liang
1) AFAIK Spark Python API does not supply interface to set regType and regParam. If you want to personalize your own LR model with proper regularized parameters, strong recommend to user scala API. You can reference the following code at spark-1.0.0/python/pyspark/mllib/classification.py. class Log

Re: parallel stages?

2014-07-15 Thread Wei Tan
Thanks Sean. In Oozie you can use fork-join, however using Oozie to drive Spark jobs, jobs will not be able to share RDD (Am I right? I think multiple jobs submitted by Oozie will have different context). Wonder if Spark wants to add more workflow feature in future. Best regards, Wei -

Can Spark stack scale to petabyte scale without performance degradation?

2014-07-15 Thread Rohit Pujari
Hello Folks: There is lot of buzz in the hadoop community around Spark's inability to scale beyond the 1 TB datasets ( or 10-20 nodes). It is being regarded as great tech for cpu intensive workloads on smaller data( less that TB) but fails to scale and perform effectively on larger datasets. How t

Re: No parallelism in map transformation

2014-07-15 Thread Roch Denis
Well, for what it's worth I found the answer on the Mesos spark documentation: https://github.com/mesos/spark/wiki/Spark-Programming-GuideThe quick start guide, say to use "--master local[4]" with spark-submit and that implies that it would indicate to use more than on processor. However that doesn

Re: Kyro deserialisation error

2014-07-15 Thread Hao Wang
I am running the WikipediaPageRank in Spark example and share the same problem with you: 4/07/16 11:31:06 DEBUG DAGScheduler: submitStage(Stage 6) 14/07/16 11:31:06 ERROR TaskSetManager: Task 6.0:450 failed 4 times; aborting job 14/07/16 11:31:06 INFO DAGScheduler: Failed to run foreach at Bagel.s

Re: Spark Streaming Json file groupby function

2014-07-15 Thread srinivas
Hi TD, I uncomment import sqlContext._ and tried to compile the code import java.util.Properties import kafka.producer._ import org.apache.spark.streaming._ import org.apache.spark.streaming.kafka._ import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.SparkConf import scal

Re: SQL + streaming

2014-07-15 Thread hsy...@gmail.com
Thanks Tathagata, we actually found the problem. I created SQLContext and StreamContext from different SparkContext. But thanks for your help Best, Siyuan On Tue, Jul 15, 2014 at 6:53 PM, Tathagata Das wrote: > Oh yes, we have run sql, streaming and mllib all together. > > You can take a look

Re: Spark 1.0.1 akka connection refused

2014-07-15 Thread Walrus theCat
I'm getting similar errors on spark streaming -- but at this point in my project I don't need a cluster and can develop locally. Will write it up later, though, if it persists. On Tue, Jul 15, 2014 at 7:44 PM, Kevin Jung wrote: > Hi, > I recently upgrade my spark 1.0.0 cluster to 1.0.1. > But

Re: truly bizarre behavior with local[n] on Spark 1.0.1

2014-07-15 Thread Walrus theCat
Tathagata determined that the reason it was failing was the accidental creation of multiple input streams. Thanks! On Tue, Jul 15, 2014 at 1:09 PM, Walrus theCat wrote: > Will do. > > > On Tue, Jul 15, 2014 at 12:56 PM, Tathagata Das < > tathagata.das1...@gmail.com> wrote: > >> This sounds rea

Spark 1.0.1 akka connection refused

2014-07-15 Thread Kevin Jung
Hi, I recently upgrade my spark 1.0.0 cluster to 1.0.1. But it gives me "ERROR remote.EndpointWriter: AssociationError" when I run simple SparkSQL job in spark-shell. here is code, val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ case class Person(name:String, Age:Int,

No parallelism in map transformation

2014-07-15 Thread Roch Denis
Hello, Obviously I'm new to spark and I assume I'm missing something really obvious but all my map operations are run on only one processor even if they have many partitions. I've tried to google for the issue but everything seems good, I use local[8] and my file has more than one partition ( chec

Re: can't print DStream after reduce

2014-07-15 Thread Tathagata Das
Aah, right, copied from the wrong browser tab i guess. Thanks! TD On Tue, Jul 15, 2014 at 5:57 PM, Michael Campbell < michael.campb...@gmail.com> wrote: > I think you typo'd the jira id; it should be > https://issues.apache.org/jira/browse/SPARK-2475 "Check whether #cores > > #receivers in loc

Re: Spark Streaming w/ tshark exception problem on EC2

2014-07-15 Thread Tathagata Das
Quick google search of that exception says this occurs when there is an error in the initialization of static methods. Could be some issue related to how dissection is defined. Maybe try putting the function in a different static class that is unrelated to the Main class, which may have other stati

Re: Multiple streams at the same time

2014-07-15 Thread Tathagata Das
You could set them up in the same streaming context. Have the batch interval as 10 second, do the 10 second operations on the input stream, then apply a window of 5 minutes on the input stream (assuming same input stream in both cases) and then do the 5 minute operation on the windowed stream. TD

Re: SQL + streaming

2014-07-15 Thread Tathagata Das
Oh yes, we have run sql, streaming and mllib all together. You can take a look at the demo that DataBricks gave at the spark summit. I think I get the problem is. Sql("") returns a RDD, and println(rdd) prints only the RDD's name. And rdd.foreach(println) prints

Re: Spark Streaming Json file groupby function

2014-07-15 Thread Tathagata Das
You need to have import sqlContext._ so just uncomment that and it should work. TD On Tue, Jul 15, 2014 at 1:40 PM, srinivas wrote: > I am still getting the error...even if i convert it to record > object KafkaWordCount { > def main(args: Array[String]) { > if (args.length < 4) { >

Re: hdfs replication on saving RDD

2014-07-15 Thread Kan Zhang
Andrew, there are overloaded versions of saveAsHadoopFile or saveAsNewAPIHadoopFile that allow you to pass in a per-job Hadoop conf. saveAsTextFile is just a convenience wrapper on top of saveAsHadoopFile. On Mon, Jul 14, 2014 at 11:22 PM, Andrew Ash wrote: > In general it would be nice to be a

Re: can't print DStream after reduce

2014-07-15 Thread Michael Campbell
I think you typo'd the jira id; it should be https://issues.apache.org/jira/browse/SPARK-2475 "Check whether #cores > #receivers in local mode" On Mon, Jul 14, 2014 at 3:57 PM, Tathagata Das wrote: > The problem is not really for local[1] or local. The problem arises when > there are more inpu

Re: can't print DStream after reduce

2014-07-15 Thread Michael Campbell
Thank you Tathagata. This had me going for far too long. On Mon, Jul 14, 2014 at 3:57 PM, Tathagata Das wrote: > The problem is not really for local[1] or local. The problem arises when > there are more input streams than there are cores. > But I agree, for people who are just beginning to use

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Keith Simmons
Cool. So Michael's hunch was correct, it is a thread issue. I'm currently using a tarball build, but I'll do a spark build with the patch as soon as I have a chance and test it out. Keith On Tue, Jul 15, 2014 at 4:14 PM, Zongheng Yang wrote: > Hi Keith & gorenuru, > > This patch (https://git

Re: Multiple streams at the same time

2014-07-15 Thread gorenuru
Because I want to have different streams with different durations. Fornexample, one triggers snapshot analysis each 5 minutes and another each 10 seconds On Tue, Jul 15, 2014 at 3:59 pm, Tathagata Das [via Apache Spark User List] wrote:

Re: Need help on spark Hbase

2014-07-15 Thread Krishna Sankar
Good catch. I thought the largest port number is 65535. Cheers On Tue, Jul 15, 2014 at 4:33 PM, Spark DevUser wrote: > Are you able to launch *hbase shell* and run some commands (list, > describe, scan, etc)? Seems *configuration.set("hbase.**master", > "localhost:60")* is wrong. > > > On

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-15 Thread Tathagata Das
Yes, what Nick said is the recommended way. In most usecases, a spark streaming program in production is not usually run from the shell. Hence, we chose not to make the external stuff (twitter, kafka, etc.) available to spark shell to avoid dependency conflicts brought it by them with spark's depen

Re: Need help on spark Hbase

2014-07-15 Thread Spark DevUser
Are you able to launch *hbase shell* and run some commands (list, describe, scan, etc)? Seems *configuration.set("hbase.**master", "localhost:60")* is wrong. On Tue, Jul 15, 2014 at 3:00 PM, Tathagata Das wrote: > Also, it helps if you post us logs, stacktraces, exceptions, etc. > > TD > >

Re: SQL + streaming

2014-07-15 Thread hsy...@gmail.com
By the way, have you ever run SQL and stream together? Do you know any example that works? Thanks! On Tue, Jul 15, 2014 at 4:28 PM, hsy...@gmail.com wrote: > Hi Tathagata, > > I could see the output of count, but no sql results. Run in standalone is > meaningless for me and I just run in my loc

Re: SQL + streaming

2014-07-15 Thread hsy...@gmail.com
Hi Tathagata, I could see the output of count, but no sql results. Run in standalone is meaningless for me and I just run in my local single node yarn cluster. Thanks On Tue, Jul 15, 2014 at 12:48 PM, Tathagata Das wrote: > Could you run it locally first to make sure it works, and you see outp

Re: Submitting to a cluster behind a VPN, configuring different IP address

2014-07-15 Thread Aris Vlasakakis
Hello! Just curious if anybody could respond to my original message, if anybody knows about how to set the configuration variables that are handles by Jetty and not Spark's native framework..which is Akka I think? Thanks On Thu, Jul 10, 2014 at 4:04 PM, Aris Vlasakakis wrote: > Hi Spark folks

Re: Client application that calls Spark and receives an MLlib *model* Scala Object, not just result

2014-07-15 Thread Aris
Thanks Soumya - I guess the next step from here is to move the MLlib model from the Spark application with simply does the training, and giving to the client application which simply does the predictions. I will try the Kryo library to physically serialize the object and trade it across machines /

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Zongheng Yang
Hi Keith & gorenuru, This patch (https://github.com/apache/spark/pull/1423) solves the errors for me in my local tests. If possible, can you guys test this out to see if it solves your test programs? Thanks, Zongheng On Tue, Jul 15, 2014 at 3:08 PM, Zongheng Yang wrote: > - user@incubator > > H

Re: Multiple streams at the same time

2014-07-15 Thread Tathagata Das
Why do you need to create multiple streaming contexts at all? TD On Tue, Jul 15, 2014 at 3:43 PM, gorenuru wrote: > Oh, sad to hear that :( > From my point of view, creating separate spark context for each stream is > to > expensive. > Also, it's annoying because we have to be responsible for

Re: Multiple streams at the same time

2014-07-15 Thread gorenuru
Oh, sad to hear that :( >From my point of view, creating separate spark context for each stream is to expensive. Also, it's annoying because we have to be responsible for proper akka and UI port determination for each context. Do you know about any plans to support it? -- View this message in c

Re: Multiple streams at the same time

2014-07-15 Thread Tathagata Das
Creating multiple StreamingContexts using the same SparkContext is currently not supported. :) Guess it was not clear in the docs. Note to self. TD On Tue, Jul 15, 2014 at 1:50 PM, gorenuru wrote: > Hi everyone. > > I have some problems running multiple streams at the same time. > > What i am

Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-15 Thread Michael Armbrust
hql and sql are just two different dialects for interacting with data. After parsing is complete and the logical plan is constructed, the execution is exactly the same. On Tue, Jul 15, 2014 at 2:50 PM, Jerry Lam wrote: > Hi Michael, > > I don't understand the difference between hql (HiveContex

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Zongheng Yang
- user@incubator Hi Keith, I did reproduce this using local-cluster[2,2,1024], and the errors look almost the same. Just wondering, despite the errors did your program output any result for the join? On my machine, I could see the correct output. Zongheng On Tue, Jul 15, 2014 at 1:46 PM, Micha

Re: Can we get a spark context inside a mapper

2014-07-15 Thread Rahul Bhojwani
Thanks a lot Sean, Daniel, Matei and Jerry. I really appreciate your reply. And I also understand that I should be a little more patient. When I myself is only not able to reply within next 5 hours how can I expect question to be answered in that time. And yes the Idea of using a separate Clusteri

Re: Need help on spark Hbase

2014-07-15 Thread Tathagata Das
Also, it helps if you post us logs, stacktraces, exceptions, etc. TD On Tue, Jul 15, 2014 at 10:07 AM, Jerry Lam wrote: > Hi Rajesh, > > I have a feeling that this is not directly related to spark but I might be > wrong. The reason why is that when you do: > >Configuration configuration =

Re: How does Spark speculation prevent duplicated work?

2014-07-15 Thread Matei Zaharia
Yeah, this is handled by the "commit" call of the FileOutputFormat. In general Hadoop OutputFormats have a concept called "committing" the output, which you should do only once per partition. In the file ones it does an atomic rename to make sure that the final output is a complete file. Matei

Spark misconfigured? Small input split sizes in shark query

2014-07-15 Thread David Rosenstrauch
Got a spark/shark cluster up and running recently, and have been kicking the tires on it. However, been wrestling with an issue on it that I'm not quite sure how to solve. (Or, at least, not quite sure about the correct way to solve it.) I ran a simple Hive query (select count ...) against a

Re: NotSerializableException in Spark Streaming

2014-07-15 Thread Tathagata Das
I am very curious though. Can you post a concise code example which we can run to reproduce this problem? TD On Tue, Jul 15, 2014 at 2:54 PM, Tathagata Das wrote: > I am not entire sure off the top of my head. But a possible (usually > works) workaround is to define the function as a val inste

Re: NotSerializableException in Spark Streaming

2014-07-15 Thread Tathagata Das
I am not entire sure off the top of my head. But a possible (usually works) workaround is to define the function as a val instead of a def. For example def func(i: Int): Boolean = { true } can be written as val func = (i: Int) => { true } Hope this helps for now. TD On Tue, Jul 15, 2014 at 9

Re: Error when testing with large sparse svm

2014-07-15 Thread crater
I got a bit progress. I think the problem is with the "BinaryClassificationMetrics", as long as I comment out all the prediction related metrics, I can run the svm example with my data. So the problem should be there I guess. -- View this message in context: http://apache-spark-user-list.1001

Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-15 Thread Jerry Lam
Hi Michael, I don't understand the difference between hql (HiveContext) and sql (SQLContext). My previous understanding was that hql is hive specific. Unless the table is managed by Hive, we should use sql. For instance, RDD (hdfsRDD) created from files in HDFS and registered as a table should use

Re: How does Spark speculation prevent duplicated work?

2014-07-15 Thread Tathagata Das
The way the HDFS file writing works at a high level is that each attempt to write a partition to a file starts writing to unique temporary file (say, something like targetDirectory/_temp/part-X_attempt-). If the writing into the file successfully completes, then the temporary file is moved

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-15 Thread Marcelo Vanzin
Have you looked at the slave machine to see if the process has actually launched? If it has, have you tried peeking into its log file? (That error is printed whenever the executors fail to report back to the driver. Insufficient resources to launch the executor is the most common cause of that, bu

Re: Large Task Size?

2014-07-15 Thread Kyle Ellrott
Yes, this is a proposed patch to MLLib so that you can use 1 RDD to train multiple models at the same time. I am hoping that by multiplexing several models in the same RDD will be more efficient then trying to get the Spark scheduler to manage a few 100 tasks simultaneously. I don't think I see st

can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-15 Thread Matt Work Coarr
Hello spark folks, I have a simple spark cluster setup but I can't get jobs to run on it. I am using the standlone mode. One master, one slave. Both machines have 32GB ram and 8 cores. The slave is setup with one worker that has 8 cores and 24GB memory allocated. My application requires 2 cor

Re: Recommended pipeline automation tool? Oozie?

2014-07-15 Thread Dean Wampler
If you're already using Scala for Spark programming and you hate Oozie XML as much as I do ;), you might check out Scoozie, a Scala DSL for Oozie: https://github.com/klout/scoozie On Thu, Jul 10, 2014 at 5:52 PM, Andrei wrote: > I used both - Oozie and Luigi - but found them inflexible and stil

Re: How does Spark speculation prevent duplicated work?

2014-07-15 Thread Mingyu Kim
Thanks for the explanation, guys. I looked into the saveAsHadoopFile implementation a little bit. If you see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/s park/rdd/PairRDDFunctions.scala at line 843, the HDFS write happens at per-partition processing, not at the resu

Re: Cassandra driver Spark question

2014-07-15 Thread Tathagata Das
Can you find out what is the class that is causing the NotSerializable exception? In fact, you can enabled extended serialization debugging to figure out object structure through the foreachRDD's

Re: Retrieve dataset of Big Data Benchmark

2014-07-15 Thread Burak Yavuz
Hi Tom, If you wish to load the file in Spark directly, you can use sc.textFile("s3n://big-data-benchmark/pavlo/...") where sc is your SparkContext. This can be done because the files should be publicly available and you don't need AWS Credentials to access them. If you want to download the fi

Retrieve dataset of Big Data Benchmark

2014-07-15 Thread Tom
Hi, I would like to use the dataset used in the Big Data Benchmark on my own cluster, to run some tests between Hadoop and Spark. The dataset should be available at s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix],

Re: Help with Json array parsing

2014-07-15 Thread SK
To add to my previous post, the error at runtime is teh following: Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 1 times, most recent failure: Exception failure in TID 0 on host localhost: org.json4s.package$MappingException: Expect

Multiple streams at the same time

2014-07-15 Thread gorenuru
Hi everyone. I have some problems running multiple streams at the same time. What i am doing is: object Test { import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.api.java.function._ import org.apache.spark.streaming._ import

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Michael Armbrust
Thanks for the extra info. At a quick glance the query plan looks fine to me. The class IntegerType does build a type tag I wonder if you are seeing the Scala issue manifest in some new way. We will attempt to reproduce locally. On Tue, Jul 15, 2014 at 1:41 PM, gorenuru wrote: > Just my

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread gorenuru
Just my "few cents" on this. I having the same problems with v 1.0.1 but this bug is sporadic and looks like is relayed to object initialization. Even more, i'm not using any SQL or something. I just have utility class like this: object DataTypeDescriptor { type DataType = String val BOOLE

Re: Spark Streaming Json file groupby function

2014-07-15 Thread srinivas
I am still getting the error...even if i convert it to record object KafkaWordCount { def main(args: Array[String]) { if (args.length < 4) { System.err.println("Usage: KafkaWordCount ") System.exit(1) } //StreamingExamples.setStreamingLogLevels() val Array(zkQuorum

Re: parallel stages?

2014-07-15 Thread Sean Owen
The last two lines are what trigger the operations, and they will each block until the result is computed and saved. So if you execute this code as-is, no. You could write a Scala program that invokes these two operations in parallel, like: Array((wc1,"titles.out"), (wc2,"tables.out")).par.foreach

Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-15 Thread Michael Armbrust
No, that is why I included the link to SPARK-2096 as well. You'll need to use HiveQL at this time. Is it possible or planed to support the "schools.time" format to filter the >> record that there is an element inside array of schools satisfy time

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Keith Simmons
Sure thing. Here you go: == Logical Plan == Sort [key#0 ASC] Project [key#0,value#1,value#2] Join Inner, Some((key#0 = key#3)) SparkLogicalPlan (ExistingRdd [key#0,value#1], MapPartitionsRDD[2] at mapPartitions at basicOperators.scala:176) SparkLogicalPlan (ExistingRdd [value#2,key#3], M

Re: How does Spark speculation prevent duplicated work?

2014-07-15 Thread Bertrand Dechoux
I haven't look at the implementation but what you would do with any filesystem is write to a file inside the workspace directory of the task. And then only the attempt of the task that should be kept will perform a move to the final path. The other attempts are simply discarded. For most filesystem

Re: truly bizarre behavior with local[n] on Spark 1.0.1

2014-07-15 Thread Walrus theCat
Will do. On Tue, Jul 15, 2014 at 12:56 PM, Tathagata Das wrote: > This sounds really really weird. Can you give me a piece of code that I > can run to reproduce this issue myself? > > TD > > > On Tue, Jul 15, 2014 at 12:02 AM, Walrus theCat > wrote: > >> This is (obviously) spark streaming, by

Re: Ideal core count within a single JVM

2014-07-15 Thread lokesh.gidra
It makes sense what you said. But, when I proportionately reduce the heap size, then also the problem persists. For instance, if I use 160 GB heap for 48 cores, whereas 80 GB heap for 24 cores, then also with 24 cores the performance is better (although worse than 160 GB with 24 cores) than 48-core

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Michael Armbrust
Can you print out the queryExecution? (i.e. println(sql().queryExecution)) On Tue, Jul 15, 2014 at 12:44 PM, Keith Simmons wrote: > To give a few more details of my environment in case that helps you > reproduce: > > * I'm running spark 1.0.1 downloaded as a tar ball, not built myself > *

Re: truly bizarre behavior with local[n] on Spark 1.0.1

2014-07-15 Thread Tathagata Das
This sounds really really weird. Can you give me a piece of code that I can run to reproduce this issue myself? TD On Tue, Jul 15, 2014 at 12:02 AM, Walrus theCat wrote: > This is (obviously) spark streaming, by the way. > > > On Mon, Jul 14, 2014 at 8:27 PM, Walrus theCat > wrote: > >> Hi, >

Help with Json array parsing

2014-07-15 Thread SK
Hi, I have a json file where the object definition in each line includes an array component "obj" that contains 0 or more elements as shown by the example below. {"name": "16287e9cdf", "obj": [{"min": 50,"max": 59 }, {"min": 20, "max": 29}]}, {"name": "17087e9cdf", "obj": [{"min": 30,"max": 3

Re: Spark Streaming Json file groupby function

2014-07-15 Thread Tathagata Das
I see you have the code to convert to Record class but commented it out. That is the right way to go. When you are converting it to a 4-tuple with " (data("type"),data("name"),data("score"),data("school"))" ... its of type (Any, Any, Any, Any) as data("xyz") returns Any. And registerAsTable probab

Re: SQL + streaming

2014-07-15 Thread Tathagata Das
Could you run it locally first to make sure it works, and you see output? Also, I recommend going through the previous step-by-step approach to narrow down where the problem is. TD On Mon, Jul 14, 2014 at 9:15 PM, hsy...@gmail.com wrote: > Actually, I deployed this on yarn cluster(spark-submit

Re: Possible bug in Spark Streaming :: TextFileStream

2014-07-15 Thread Tathagata Das
On second thought I am not entirely sure whether that bug is the issue. Are you continuously appending to the file that you have copied to the directory? Because filestream works correctly when the files are atomically moved to the monitored directory. TD On Mon, Jul 14, 2014 at 9:08 PM, Madabha

Re: Count distinct with groupBy usage

2014-07-15 Thread buntu
Thanks Sean!! Thats what I was looking for -- group by on mulitple fields. I'm gonna play with it now. Thanks again! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781p9803.html Sent from the Apache Spark User List mail

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Keith Simmons
To give a few more details of my environment in case that helps you reproduce: * I'm running spark 1.0.1 downloaded as a tar ball, not built myself * I'm running in stand alone mode, with 1 master and 1 worker, both on the same machine (though the same error occurs with two workers on two machines

parallel stages?

2014-07-15 Thread Wei Tan
Hi, I wonder if I do wordcount on two different files, like this: val file1 = sc.textFile("/...") val file2 = sc.textFile("/...") val wc1= file.flatMap(..).reduceByKey(_ + _,1) val wc2= file.flatMap(...).reduceByKey(_ + _,1) wc1.saveAsTextFile("titles.out") wc2.saveAsTextFile("tables.out") Wou

Re: How does Spark speculation prevent duplicated work?

2014-07-15 Thread Andrew Ash
Hi Nan, Great digging in -- that makes sense to me for when a job is producing some output handled by Spark like a .count or .distinct or similar. For the other part of the question, I'm also interested in side effects like an HDFS disk write. If one task is writing to an HDFS path and another t

Re: getting ClassCastException on collect()

2014-07-15 Thread _soumya_
Not sure I can help, but I ran into the same problem. Basically my use case is a that I have a List of strings - which I then convert into a RDD using sc.parallelize(). This RDD is then operated on by the foreach() function. Same as you, I get a runtime exception : java.lang.ClassCastException: c

Re: "the default GraphX graph-partition strategy on multicore machine"?

2014-07-15 Thread Ankur Dave
On Jul 15, 2014, at 12:06 PM, Yifan LI wrote: > Btw, is there any possibility to customise the partition strategy as we > expect? I'm not sure I understand. Are you asking about defining a custom

count vs countByValue in for/yield

2014-07-15 Thread Ognen Duzlevski
Hello, I am curious about something: val result = for { (dt,evrdd) <- evrdds val ct = evrdd.count } yield (dt->ct) works. val result = for { (dt,evrdd) <- evrdds val ct = evrdd.countByValue } yield (dt->ct) does not work. I get: 14/07/15 16:46:33 WARN TaskSetMa

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Zongheng Yang
FWIW, I am unable to reproduce this using the example program locally. On Tue, Jul 15, 2014 at 11:56 AM, Keith Simmons wrote: > Nope. All of them are registered from the driver program. > > However, I think we've found the culprit. If the join column between two > tables is not in the same colu

Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-15 Thread Jerry Lam
Hi guys, Sorry, I'm also interested in this nested json structure. I have a similar SQL in which I need to query a nested field in a json. Does the above query works if it is used with sql(sqlText) assuming the data is coming directly from hdfs via sqlContext.jsonFile? The SPARK-2483

Re: Count distinct with groupBy usage

2014-07-15 Thread Sean Owen
If you are counting per time and per page, then you need to group by time and page not just page. Something more like: csv.groupBy(csv => (csv(0),csv(1))) ... This gives a list of users per (time,page). As Nick suggests, then you count the distinct values for each key: ... .mapValues(_.distinct.

Re: Large Task Size?

2014-07-15 Thread Aaron Davidson
Ah, I didn't realize this was non-MLLib code. Do you mean to be sending stochasticLossHistory in the closure as well? On Sun, Jul 13, 2014 at 1:05 AM, Kyle Ellrott wrote: > It uses the standard SquaredL2Updater, and I also tried to broadcast it as > well. > > The input is a RDD created by takin

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Keith Simmons
Nope. All of them are registered from the driver program. However, I think we've found the culprit. If the join column between two tables is not in the same column position in both tables, it triggers what appears to be a bug. For example, this program fails: import org.apache.spark.SparkConte

Re: How does Spark speculation prevent duplicated work?

2014-07-15 Thread Nan Zhu
Hi, Mingyuan, According to my understanding, Spark processes the result generated from each partition by passing them to resultHandler (SparkContext.scala L1056) This resultHandler is usually just put the result in a driver-side array, the length of which is always partitions.size this d

Re: Count distinct with groupBy usage

2014-07-15 Thread buntu
Thats is correct Raffy. Assume I convert the timestamp field to date and in the required format, is it possible to report it by date? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781p9790.html Sent from the Apache Spar

  1   2   >