Re: How do you access a cached Spark SQL Table from a JBDC connection?

2015-07-14 Thread Brandon White
I cache the tell with hiveContext.cacheTable("tableName") On Tue, Jul 14, 2015 at 5:43 PM, Cheng, Hao wrote: > Can you describe how did you cache the tables? In another HiveContext? > AFAIK, cached table only be visible within the same HiveContext, you > probably need to execute the sql query

Re: Is IndexedRDD available in Spark 1.4.0?

2015-07-14 Thread Tathagata Das
I do not recommend using IndexRDD for state management in Spark Streaming. What it does not solve out-of-the-box is checkpointing of indexRDDs, which important because long running streaming jobs can lead to infinite chain of RDDs. Spark Streaming solves it for the updateStateByKey operation which

Re: Is IndexedRDD available in Spark 1.4.0?

2015-07-14 Thread Ted Yu
bq. that is, key-value stores Please consider HBase for this purpose :-) On Tue, Jul 14, 2015 at 5:55 PM, Tathagata Das wrote: > I do not recommend using IndexRDD for state management in Spark Streaming. > What it does not solve out-of-the-box is checkpointing of indexRDDs, which > important be

RE: How do you access a cached Spark SQL Table from a JBDC connection?

2015-07-14 Thread Cheng, Hao
So you’re with different HiveContext instances for the caching. We are not expected to see the cached tables cached with the other HiveContext instance. From: Brandon White [mailto:bwwintheho...@gmail.com] Sent: Wednesday, July 15, 2015 8:48 AM To: Cheng, Hao Cc: user Subject: Re: How do you acc

Re: spark streaming with kafka reset offset

2015-07-14 Thread Chen Song
Thanks TD and Cody. I saw that. 1. By doing that (foreachRDD), does KafkaDStream checkpoints its offsets on HDFS at the end of each batch interval? 2. In the code, if I first apply transformations and actions on the directKafkaStream and then use foreachRDD on the original KafkaDStream to commit o

rest on streaming

2015-07-14 Thread Chen Song
I have been POC adding a rest service in a Spark Streaming job. Say I create a stateful DStream X by using updateStateByKey, and each time there is a HTTP request, I want to apply some transformations/actions on the latest RDD of X and collect the results immediately but not scheduled by streaming

Re: spark streaming with kafka reset offset

2015-07-14 Thread Tathagata Das
On Tue, Jul 14, 2015 at 6:42 PM, Chen Song wrote: > Thanks TD and Cody. I saw that. > > 1. By doing that (foreachRDD), does KafkaDStream checkpoints its offsets > on HDFS at the end of each batch interval? > The timing is not guaranteed. > 2. In the code, if I first apply transformations and a

Re: rest on streaming

2015-07-14 Thread Tathagata Das
You can do this. // global variable to keep track of latest stuff var latestTime = _ var latestRDD = _ dstream.foreachRDD((rdd: RDD[..], time: Time) => { latestTime = time latestRDD = rdd }) Now you can asynchronously access the latest RDD. However if you are going to run jobs on the la

MLlib LogisticRegressionWithLBFGS error

2015-07-14 Thread Vi Ngo Van
Hi All, I've met a issue with MLlib when i use LogisticRegressionWithLBFGS my sample data : *0 863:1 40646:1 37697:1 1423:1 38648:1 4230:1 23823:1 41594:1 27614:1 5689:1 18493:1 44187:1 5694:1 27799:1 12010:1* *0 863:1 40646:1 37697:1 1423:1 38648:1 4230:1 23823:1 41594:1 27614:1 5689:1 18493:1 4

Re: rest on streaming

2015-07-14 Thread Chen Song
Thanks TD, that is very useful. On Tue, Jul 14, 2015 at 10:19 PM, Tathagata Das wrote: > You can do this. > > // global variable to keep track of latest stuff > var latestTime = _ > var latestRDD = _ > > > dstream.foreachRDD((rdd: RDD[..], time: Time) => { > latestTime = time > latestRDD

Re: spark streaming with kafka reset offset

2015-07-14 Thread Chen Song
Thanks TD. As for 1), if timing is not guaranteed, how does exactly once semantics supported? It feels like exactly once receiving is not necessarily exactly once processing. Chen On Tue, Jul 14, 2015 at 10:16 PM, Tathagata Das wrote: > > > On Tue, Jul 14, 2015 at 6:42 PM, Chen Song wrote: >

Re: Spark Streaming - Inserting into Tables

2015-07-14 Thread Tathagata Das
Why is .remember not ideal? On Sun, Jul 12, 2015 at 7:22 PM, Brandon White wrote: > Hi Yin, > > Yes there were no new rows. I fixed it by doing a .remember on the > context. Obviously, this is not ideal. > > On Sun, Jul 12, 2015 at 6:31 PM, Yin Huai wrote: > >> Hi Brandon, >> >> Can you explai

Re: spark streaming with kafka reset offset

2015-07-14 Thread Tathagata Das
Of course, exactly once receiving is not same as exactly once. In case of direct kafka stream, the data may actually be pulled multiple time. But even if the data of a batch is pulled twice because of some failure, the final result (that is, transformed data accessed through foreachRDD) will always

Using reference for RDD is safe?

2015-07-14 Thread Abarah
Hello, I am wondering what will happen if I use a reference for transforming rdd, for example: def func1(rdd: RDD[Int]): RDD[Int] = { rdd.map(x => x * 2) // example transformation, but I am using a more complex function } def main() { . val myrdd = sc.parallelize(1 to 100) va

Re: Stopping StreamingContext before receiver has started

2015-07-14 Thread Tathagata Das
This is a known race condition - root cause of SPARK-5681 On Mon, Jul 13, 2015 at 3:35 AM, Juan Rodríguez Hortalá < juan.rodriguez.hort...@gmail.com> wrote: > Hi, > > I have noticed that when StreamingContext.stop is called when no receiver > has

Re:SparkSQL 1.4 can't accept registration of UDF?

2015-07-14 Thread prosp4300
What's the result of "list jar" in both 1.3.1 and 1.4.0, please check if there is any difference At 2015-07-15 08:10:44, "ogoh" wrote: >Hello, >I am using SparkSQL along with ThriftServer so that we can access using Hive >queries. >With Spark 1.3.1, I can register UDF function. But, Spark

Re: Spark Intro

2015-07-14 Thread vinod kumar
Thank you Hafsa On Tue, Jul 14, 2015 at 11:09 AM, Hafsa Asif wrote: > Hi, > I was also in the same situation as we were using MySQL. Let me give some > clearfications: > 1. Spark provides a great methodology for big data analysis. So, if you > want to make your system more analytical and want de

Re: fileStream with old files

2015-07-14 Thread Tathagata Das
It was added, but its not documented publicly. I am planning to change the name of the conf to spark.streaming.fileStream.minRememberDuration to make it easier to understand On Mon, Jul 13, 2015 at 9:43 PM, Terry Hole wrote: > A new configuration named *spark.streaming.minRememberDuration* was a

Efficiency of leftOuterJoin a cassandra rdd

2015-07-14 Thread Wush Wu
Dear all, I am trying to join two RDDs, named rdd1 and rdd2. rdd1 is loaded from a textfile with about 33000 records. rdd2 is loaded from a table in cassandra which has about 3 billions records. I tried the following code: ```scala val rdd1 : (String, XXX) = sc.textFile(...).map(...) import o

Re: Ordering of Batches in Spark streaming

2015-07-14 Thread Tathagata Das
This has been discussed in a number of threads in this mailing list. Here is a summary. 1. Processing of batch T+1 always starts after all the processing of batch T has completed. But here a "batch" is defined by data of all the receivers running the in the system receiving within the batch interv

Re: SparkSQL 1.4 can't accept registration of UDF?

2015-07-14 Thread Okehee Goh
The command "list jar" doesn't seem accepted in beeline with Spark's ThriftServer in both Spark 1.3.1 and Spark1.4. 0: jdbc:hive2://localhost:1> list jar; Error: org.apache.spark.sql.AnalysisException: cannot recognize input near 'list' 'jar' ''; line 1 pos 0 (state=,code=0) Thanks On Tue,

Re: Efficiency of leftOuterJoin a cassandra rdd

2015-07-14 Thread Wush Wu
Dear all, I have found a post discussing the same thing: https://groups.google.com/a/lists.datastax.com/forum/#!searchin/spark-connector-user/join/spark-connector-user/q3GotS-n0Wk/g-LPTteCEg0J The solution is using "joinWithCassandraTable" and the documentation is here: https://github.com/datast

spark cache issue while doing saveAsTextFile and saveAsParquetFile

2015-07-14 Thread mathewvinoj
Hi There, I am using cache mapPartition to do some processing and cache the result as below I am storing the file as both format (parquet and textfile) where recomputing is happening both time.Eventhough i put the cache its not working as expected. below is the code snippet.Any help is really

Re: Efficiency of leftOuterJoin a cassandra rdd

2015-07-14 Thread ๏̯͡๏
I have explored spark joins for last few months (you can search my posts) and its frustrating useless. On Tue, Jul 14, 2015 at 9:35 PM, Wush Wu wrote: > Dear all, > > I have found a post discussing the same thing: > > https://groups.google.com/a/lists.datastax.com/forum/#!searchin/spark-connecto

Re: Java 8 vs Scala

2015-07-14 Thread Tristan Blakers
We have had excellent results operating on RDDs using Java 8 with Lambdas. It’s slightly more verbose than Scala, but I haven’t found this an issue, and haven’t missed any functionality. The new DataFrame API makes the Spark platform even more language agnostic. Tristan On 15 July 2015 at 06:40,

Re: Efficiency of leftOuterJoin a cassandra rdd

2015-07-14 Thread Wush Wu
I don't understand. By the way, the `joinWithCassandraTable` does improve my query time from 40 mins to 3 mins. 2015-07-15 13:19 GMT+08:00 ÐΞ€ρ@Ҝ (๏̯͡๏) : > I have explored spark joins for last few months (you can search my posts) > and its frustrating useless. > > On Tue, Jul 14, 2015 at 9:35 P

Strange behavoir of pyspark with --jars option

2015-07-14 Thread gen tang
Hi, I met some interesting problems with --jars options As I use the third party dependencies: elasticsearch-spark, I pass this jar with the following command: ./bin/spark-submit --jars path-to-dependencies ... It works well. However, if I use HiveContext.sql, spark will lost the dependencies that

Re: Research ideas using spark

2015-07-14 Thread Akhil Das
Try to repartition it to a higher number (at least 3-4 times the total # of cpu cores). What operation are you doing? It may happen that if you are doing a join/groupBy sort of operation that task which is taking time is having all the values, in that case you need to use a Partitioner which will e

Re: Efficiency of leftOuterJoin a cassandra rdd

2015-07-14 Thread Deepak Jain
Leftouterjoin and join apis are super slow in spark. 100x slower than hadoop Sent from my iPhone > On 14-Jul-2015, at 10:59 PM, Wush Wu wrote: > > I don't understand. > > By the way, the `joinWithCassandraTable` does improve my query time > from 40 mins to 3 mins. > > > 2015-07-15 13:19 GMT

Re: Strange behavoir of pyspark with --jars option

2015-07-14 Thread Burak Yavuz
Hi, I believe the HiveContext uses a different class loader. It then falls back to the system class loader if it can't find the classes in the context class loader. The system class loader contains the classpath passed through --driver-class-path and spark.executor.extraClassPath. The JVM is alread

Re: creating a distributed index

2015-07-14 Thread Burak Yavuz
Hi Swetha, IndexedRDD is available as a package on Spark Packages . Best, Burak On Tue, Jul 14, 2015 at 5:23 PM, swetha wrote: > Hi Ankur, > > Is IndexedRDD available in Spark 1.4.0? We would like to use this in Spark > Streaming to do

Re: MLlib LogisticRegressionWithLBFGS error

2015-07-14 Thread Burak Yavuz
Hi, Is this in LibSVM format? If so, the indices should be sorted in increasing order. It seems like they are not sorted. Best, Burak On Tue, Jul 14, 2015 at 7:31 PM, Vi Ngo Van wrote: > Hi All, > I've met a issue with MLlib when i use LogisticRegressionWithLBFGS > > my sample data : > > *0 86

Re: MLlib LogisticRegressionWithLBFGS error

2015-07-14 Thread Vi Ngo Van
This is a LibSVM format. I can use this data with libsvm library. In this sample, they are not sorted. I will sort them and try it again. Thanks you, On Wed, Jul 15, 2015 at 1:47 PM, Burak Yavuz wrote: > Hi, > > Is this in LibSVM format? If so, the indices should be sorted in > increasing order.

Re: Spark on EMR with S3 example (Python)

2015-07-14 Thread Akhil Das
I think any requests going to s3*:// requires the credentials. If they have made it public (via http) then you won't require the keys. Thanks Best Regards On Wed, Jul 15, 2015 at 2:26 AM, Pagliari, Roberto wrote: > Hi Sujit, > > I just wanted to access public datasets on Amazon. Do I still need

<    1   2