How to join two PairRDD together?

2014-08-24 Thread Gefei Li
Hello everyone, I am transplanting a clustering algorithm to spark platform, and I meet a problem confusing me for a long time, can someone help me? I have a PairRDD named patternRDD, which the key represents a number and the value stores an information of the key. And I want to use two of

Re: Printing the RDDs in SparkPageRank

2014-08-24 Thread Deep Pradhan
When I add parts(0).collect().foreach(println) parts(1).collect().foreach(println), for printing parts, I get the following error *not enough arguments for method collect: (pf: PartialFunction[Char,B])(implicit bf:scala.collection.generic.CanBuildFrom[String,B,That])That.Unspecified value parame

Re: multiple windows from the same DStream ?

2014-08-24 Thread Tobias Pfeiffer
Hi, computations are triggered by an output operation. No output operation, no computation. Therefore in your code example, On Thu, Aug 21, 2014 at 11:58 PM, Josh J wrote: > > JavaPairReceiverInputDStream messages = > KafkaUtils.createStream(jssc, args[0], args[1], topicM

Re: Spark Stream + HDFS Append

2014-08-24 Thread Tobias Pfeiffer
Hi, On Mon, Aug 25, 2014 at 9:56 AM, Dean Chen wrote: > We are using HDFS for log storage where logs are flushed to HDFS every > minute, with a new file created for each hour. We would like to consume > these logs using spark streaming. > > The docs state that new HDFS will be picked up, but d

Spark Stream + HDFS Append

2014-08-24 Thread Dean Chen
We are using HDFS for log storage where logs are flushed to HDFS every minute, with a new file created for each hour. We would like to consume these logs using spark streaming.  The docs state that new HDFS will be picked up, but does Spark Streaming support HDFS appends? — Dean Chen

pipe raw binary data

2014-08-24 Thread Emeric, Viel
Hello, I am trying to use the RDD pipe method to integrate Spark with external commands to be executed on each partition. My program roughly looks like: rdd.pipe(cmd1).pipe(cmd2) The output of cmd1 and input of cmd2 is raw binary data. However, the pipe method in RDD requires converting data to

Re: What about implementing various hypothesis test for LogisticRegression in MLlib

2014-08-24 Thread Xiangrui Meng
Thanks for the reference! Many tests are not designed for big data: http://magazine.amstat.org/blog/2010/09/01/statrevolution/ . So we need to understand which tests are proper. Feel free to create a JIRA and let's move our discussion there. -Xiangrui On Fri, Aug 22, 2014 at 8:44 PM, guxiaobo1982

Re: Return multiple [K,V] pairs from a Java Function

2014-08-24 Thread Sean Owen
You are looking for the method "flatMapToPair". It takes a PairFlatMapFunction, which is something that returns an Iterable of Tuple2 of K,V. You end up with a JavaPairRDD of K and V as desired. On Sun, Aug 24, 2014 at 9:15 PM, Tom wrote: > Hi, > > I would like to create multiple key-value pairs,

Return multiple [K,V] pairs from a Java Function

2014-08-24 Thread Tom
Hi, I would like to create multiple key-value pairs, where all keys still can be reduced. For instance, I have the following 2 lines: A,B,C B,D I would like to return the following pairs for the first line: A,B A,C B,A B,C C,A C,B And for the second B,D D,B After a reduce by key, I want to end u

Re: amp lab spark streaming twitter example

2014-08-24 Thread Jonathan Haddad
Could you be hitting this? https://issues.apache.org/jira/browse/SPARK-3178 On Sun, Aug 24, 2014 at 10:21 AM, Forest D wrote: > Hi folks, > > I have been trying to run the AMPLab’s twitter streaming example > (http://ampcamp.berkeley.edu/big-data-mini-course/realtime-processing-with-spark-stream

amp lab spark streaming twitter example

2014-08-24 Thread Forest D
Hi folks, I have been trying to run the AMPLab’s twitter streaming example (http://ampcamp.berkeley.edu/big-data-mini-course/realtime-processing-with-spark-streaming.html) in the last 2 days.I have encountered the same error messages as shown below: 14/08/24 17:14:22 ERROR client.AppClient$Clien

Spark Streaming API and Performance Clarifications

2014-08-24 Thread didi
bs"d I am new to the Spark Streaming and have some issues which i can't find any documentation "stuff" to answer them I believe a lot of Spark users in general and Spark Streaming in particular use it for analysis of events by calculation of distributed large aggregations. In case i have to "dige

Re: Printing the RDDs in SparkPageRank

2014-08-24 Thread Jörn Franke
Hi, What kind of error do you receive? Best regards, Jörn Le 24 août 2014 08:29, "Deep Pradhan" a écrit : > Hi, > I was going through the SparkPageRank code and want to see the > intermediate steps, like the RDDs formed in the intermediate steps. > Here is a part of the code along with the lin

Re: Spark SQL Parser error

2014-08-24 Thread S Malligarjunan
Hello Yin, Additional note: In ./bin/spark-shell --jars "s3n:/mybucket/myudf.jar"  I got the following message in console. Waring skipped external jar..   Thanks and Regards, Sankar S.   On , S Malligarjunan wrote: Hello Yin, I have tried use sc.addJar and hiveContext.sparkContext.addJar

Re: Spark SQL Parser error

2014-08-24 Thread S Malligarjunan
Hello Yin, I have tried use sc.addJar and hiveContext.sparkContext.addJar and ./bin/spark-shell --jars option, In all three option when I try to create temporary funtion i get the classNotFoundException. What would be the issue here?   Thanks and Regards, Sankar S.   On Saturday, 23 August 2

RE: How to make Spark Streaming write its output so that Impala can read it?

2014-08-24 Thread Silvio Fiorito
One option is to use SparkSQL with HiveContext to insert into a table. That's worked well for me, but you still need to periodically run a refresh on the table in Impala so it sees the new data. From: rafeeq s Sent: ‎8/‎24/‎2014 4:20

Re: How to make Spark Streaming write its output so that Impala can read it?

2014-08-24 Thread Sean Owen
As for Impala, subdirectories are typically used with partitions, and so this is a way to read the subdirectories: http://grokbase.com/p/cloudera/impala-user/1387dvdzev/creating-impala-external-tables-from-partitioned-dir-file-structures The catch is that you have to create those partitions at som

How to make Spark Streaming write its output so that Impala can read it?

2014-08-24 Thread rafeeq s
I have the following problem with Spark Streaming API. I am currently streaming input data via KAFKA to Spark Streaming, with which I plan to do some preprocessing for the data. Then, I'd like to save the data to Parquet file system and query it with Impala. However, Spark is writing the data file