Re: Spark 1.4.0 compute-classpath.sh

2015-07-15 Thread Lokesh Kumar Padhnavis
Thanks a lot :) On Wed, Jul 15, 2015 at 11:48 PM Marcelo Vanzin wrote: > That has never been the correct way to set you app's classpath. > > Instead, look at http://spark.apache.org/docs/latest/configuration.html > and search for "extraClassPath". > > On Wed, Jul 15, 2015 at 9:43 AM, lokeshkumar

Re: Running mllib from R in Spark 1.4

2015-07-15 Thread madhu phatak
Hi, Thank you. On Wed, Jul 15, 2015 at 9:07 PM, Burak Yavuz wrote: > Hi, > There is no MLlib support in SparkR in 1.4. There will be some support in > 1.5. You can check these JIRAs for progress: > https://issues.apache.org/jira/browse/SPARK-6805 > https://issues.apache.org/jira/browse/SPARK-682

RE: [SparkR] creating dataframe from json file

2015-07-15 Thread Sun, Rui
You can try selectExpr() of DataFrame. for example, y<-selectExpr(df, "concat(hashtags.text[0],hashtags.text[1])") # [] operator is used to extract an item from an array or sql(hiveContext, "select concat(hashtags.text[0],hashtags.text[1]) from table") Yeah, the documentation of SparkR is

Spark streaming Processing time keeps increasing

2015-07-15 Thread N B
Hello, We have a Spark streaming application and the problem that we are encountering is that the batch processing time keeps on increasing and eventually causes the application to start lagging. I am hoping that someone here can point me to any underlying cause of why this might happen. The batc

Re: Possible to combine all RDDs from a DStream batch into one?

2015-07-15 Thread N B
Hi Jon, In Spark streaming, 1 batch = 1 RDD. Essentially, the terms are used interchangeably. If you are trying to collect multiple batches across a DStream into a single RDD, look at the window() operations. Hope this helps Nikunj On Wed, Jul 15, 2015 at 7:00 PM, Jon Chase wrote: > I should

Re: Possible to combine all RDDs from a DStream batch into one?

2015-07-15 Thread Ted Yu
Looks like this method should serve Jon's needs: def reduceByWindow( reduceFunc: (T, T) => T, windowDuration: Duration, slideDuration: Duration On Wed, Jul 15, 2015 at 8:23 PM, N B wrote: > Hi Jon, > > In Spark streaming, 1 batch = 1 RDD. Essentially, the terms are used > in

Re: fileStream with old files

2015-07-15 Thread Terry Hole
Hi, Hunter, *What **behavior do you see with the HDFS? The local file system and HDFS should have the same ** behavior.* *Thanks!* *- Terry* Hunter Morgan 于2015年7月16日周四 上午2:04写道: > After moving the setting of the parameter to SparkConf initialization > instead of after the context is already i

Re: HiBench test for hadoop/hive/spark cluster

2015-07-15 Thread Ted Yu
>From log file: 15/07/16 11:16:56 INFO mapred.LocalDistributedCacheManager: Creating symlink: /tmp/hadoop-root/mapred/local/1437016615898/user_agents <- /opt/HiBench-master/user_agents 15/07/16 11:16:56 INFO mapred.LocalDistributedCacheManager: Localized hdfs://spark-study:9000/HiBench/Aggregation

回复:Re: HiBench test for hadoop/hive/spark cluster

2015-07-15 Thread luohui20001
Hi Ted Thanks for your advice, i found that there is something wrong with "hadoop fs -get" command, 'cause I believe the localization of hdfs://spark-study:9000/HiBench/Aggregation/temp/user_agents to /tmp/hadoop-root/mapred/local/1437016615898/user_agents is a behaviour like "hadoop fs -g

RE: HiveThriftServer2.startWithContext error with registerTempTable

2015-07-15 Thread Cheng, Hao
Have you ever try query the “select * from temp_table” from the spark shell? Or can you try the option --jars while starting the spark shell? From: Srikanth [mailto:srikanth...@gmail.com] Sent: Thursday, July 16, 2015 9:36 AM To: user Subject: Re: HiveThriftServer2.startWithContext error with reg

Re: Java 8 vs Scala

2015-07-15 Thread spark user
I struggle lots in Scala , almost 10 days n0 improvement , but when i switch to Java 8 , things are so smooth , and I used Data Frame with Redshift and Hive all are looking good .if you are very good In Scala the go with Scala otherwise Java is best fit  . This is just my openion because I am Ja

Running foreach on a list of rdds in parallel

2015-07-15 Thread Brandon White
Hello, I have a list of rdds List(rdd1, rdd2, rdd3,rdd4) I would like to save these rdds in parallel. Right now, it is running each operation sequentially. I tried using a rdd of rdd but that does not work. list.foreach { rdd => rdd.saveAsTextFile("/tmp/cache/") } Any ideas?

Re: Running foreach on a list of rdds in parallel

2015-07-15 Thread Davies Liu
sc.union(rdds).saveAsTextFile() On Wed, Jul 15, 2015 at 10:37 PM, Brandon White wrote: > Hello, > > I have a list of rdds > > List(rdd1, rdd2, rdd3,rdd4) > > I would like to save these rdds in parallel. Right now, it is running each > operation sequentially. I tried using a rdd of rdd but that do

Re: Running foreach on a list of rdds in parallel

2015-07-15 Thread Vetle Leinonen-Roeim
On Thu, Jul 16, 2015 at 7:37 AM Brandon White wrote: > Hello, > > I have a list of rdds > > List(rdd1, rdd2, rdd3,rdd4) > > I would like to save these rdds in parallel. Right now, it is running each > operation sequentially. I tried using a rdd of rdd but that does not work. > > list.foreach { rd

<    1   2