from:"Rishi Mishra"

Re: Joining streaming data with static table data.

2017-12-11 Thread Rishi Mishra

You can do a join between streaming dataset and a static dataset. I would prefer your first approach. But the problem with this approach is performance. Unless you cache the dataset , every time you fire a join query it will fetch the latest records from the table. Regards, Rishitesh Mishra, Sna

Re: DataFrame joins with Spark-Java

2017-11-29 Thread Rishi Mishra

Hi Sushma, can you try as below with a left anti join ..In my example name & id consists of a key. df1.alias("a").join(df2.alias("b"), col("a.name").equalTo(col("b.name")) .and(col("a.id").equalTo(col("b.id"))) , "left_anti").selectExpr("name", "id").show(10, false)

Re: Extend Dataframe API

2016-07-08 Thread Rishi Mishra

Or , you can extend SQLContext to add your plans . Not sure if it fits your requirement , but answered to highlight an option. Regards, Rishitesh Mishra, SnappyData . (http://www.snappydata.io/) https://in.linkedin.com/in/rishiteshmishra On Thu, Jul 7, 2016 at 8:39 PM, tan shai wrote: > That w

Re: RDD and Dataframes

2016-07-07 Thread Rishi Mishra

Yes, finally it will be converted to an RDD internally. However DataFrame queries are passed through catalyst , which provides several optimizations e.g. code generation, intelligent shuffle etc , which is not the case for pure RDDs. Regards, Rishitesh Mishra, SnappyData . (http://www.snappydata.i

Re: Spark 1.6 Catalyst optimizer

2016-05-11 Thread Rishi Mishra

unhandled predicates? > > Telmo Rodrigues > > No dia 11/05/2016, às 09:49, Rishi Mishra > escreveu: > > It does push the predicate. But as a relations are generic and might or > might not handle some of the predicates , it needs to apply filter of > un-handled predic

Re: Spark 1.6 Catalyst optimizer

2016-05-11 Thread Rishi Mishra

It does push the predicate. But as a relations are generic and might or might not handle some of the predicates , it needs to apply filter of un-handled predicates. Regards, Rishitesh Mishra, SnappyData . (http://www.snappydata.io/) https://in.linkedin.com/in/rishiteshmishra On Wed, May 11, 2016

Re: partitioner aware subtract

2016-05-10 Thread Rishi Mishra

As you have same partitioner and number of partitions probably you can use zipPartition and provide a user defined function to substract . A very primitive example being. val data1 = Seq(1->1,2->2,3->3,4->4,5->5,6->6,7->7) val data2 = Seq(1->1,2->2,3->3,4->4,5->5,6->6) val rdd1 = sc.parallelize(

Re: Updating Values Inside Foreach Rdd loop

2016-05-10 Thread Rishi Mishra

Hi Harsh, Probably you need to maintain some state for your values, as you are updating some of the keys in a batch and check for a global state of your equation. Can you check the API mapWithState of DStream ? Regards, Rishitesh Mishra, SnappyData . (http://www.snappydata.io/) https://in.linkedi

Re: Accumulator question

2016-05-10 Thread Rishi Mishra

Your mail does not describe much , but wont a simple reduce function help you ? Something like as below val data = Seq(1,2,3,4,5,6,7) val rdd = sc.parallelize(data, 2) val sum = rdd.reduce((a,b) => a+b) Regards, Rishitesh Mishra, SnappyData . (http://www.snappydata.io/) https://in.linkedin.co

Re: Spark Streaming share state between two streams

2016-04-08 Thread Rishi Mishra

Hi Shekhar, As both of your state functions does the same thing can't you do a union of dtsreams before applying mapWithState() ? It might be difficult if one state function is dependent on other state. This requires a named state, which can be accessed in other state functions. I have not gone thr

Re: About nested RDD

2016-04-08 Thread Rishi Mishra

As mentioned earlier you can create a broadcast variable containing all the small RDD elements. I hope they are really small. Then you can fire A.updatae(broadcastVariable). Regards, Rishitesh Mishra, SnappyData . (http://www.snappydata.io/) https://in.linkedin.com/in/rishiteshmishra On Fri, Ap

Re: About nested RDD

2016-04-08 Thread Rishi Mishra

rdd.count() is a fairly straightforward operations which can be calculated on a driver and then the value can be included in the map function. Is your goal is to write a generic function which operates on two rdds, one rdd being evaluated for each partition of the other ? Here also you can use bro

Re: Spark SQL Optimization

2016-03-21 Thread Rishi Mishra

What we have observed so far is Spark picks join order in the same order as tables in from clause is specified. Sometimes reordering benefits the join query. This can be an inbuilt optimization in Spark. But again its not going to be straight forward, where rather than table size, selectivity of

Re: sliding Top N window

2016-03-21 Thread Rishi Mishra

Hi Alexy, We are also trying to solve similar problems using approximation. Would like to hear more about your usage. We can discuss this offline without boring others. :) Regards, Rishitesh Mishra, SnappyData . (http://www.snappydata.io/) https://in.linkedin.com/in/rishiteshmishra On Tue, Mar

Re: Joins in Spark

2016-03-19 Thread Rishi Mishra

My suspect is your input file partitions are small. Hence small number of tasks are started. Can you provide some more details like how you load the files and how the result size is around 500GBs ? Regards, Rishitesh Mishra, SnappyData . (http://www.snappydata.io/) https://in.linkedin.com/in/ri

Re: How to use a custom partitioner in a dataframe in Spark

2016-02-18 Thread Rishi Mishra

Michael, Is there any specific reason why DataFrames does not have partitioners like RDDs ? This will be very useful if one is writing custom datasources , which keeps data in partitions. While storing data one can pre-partition the data at Spark level rather than at the datasource. Regards, Rishi

Re: How to use a custom partitioner in a dataframe in Spark

2016-02-17 Thread Rishi Mishra

Unfortunately there is not any, at least till 1.5. Have not gone through the new DataSet of 1.6. There is some basic support for Parquet like partitionByColumn. If you want to partition your dataset on a certain way you have to use an RDD to partition & convert that into a DataFrame before stori

Re: SparkSQL parallelism

2016-02-11 Thread Rishi Mishra

I am not sure why all 3 nodes should query. If you have not mentioned any partitions it should only be one partition of JDBCRDD where all dataset should reside. On Fri, Feb 12, 2016 at 10:15 AM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi, > > I have a spark cluster with One Ma

Re: Spark : Unable to connect to Oracle

2016-02-10 Thread Rishi Mishra

ASFIK sc.addJar() will add the jars to executor's classpath . The datasource resolution ( createRelation) happens at driver side and driver classpath should contain the ojdbc6.jar. You can use "spark.driver.extraClassPath" config parameter to set the same. On Wed, Feb 10, 2016 at 3:08 PM, Jorge M

Re: spark.executor.memory ? is used just for cache RDD or both cache RDD and the runtime of cores on worker?

2016-02-04 Thread Rishi Mishra

You would probably like to see http://spark.apache.org/docs/latest/configuration.html#memory-management. Other config parameters are also explained there. On Fri, Feb 5, 2016 at 10:56 AM, charles li wrote: > if set spark.executor.memory = 2G for each worker [ 10 in total ] > > does it mean I can

Re: Unit test with sqlContext

2016-02-04 Thread Rishi Mishra

Hi Steve, Have you cleaned up your SparkContext ( sc.stop()) , in a afterAll(). The error suggests you are creating more than one SparkContext. On Fri, Feb 5, 2016 at 10:04 AM, Holden Karau wrote: > Thanks for recommending spark-testing-base :) Just wanted to add if anyone > has feature reques

Re: Union of RDDs without the overhead of Union

2016-02-02 Thread Rishi Mishra

Agree with Koert that UnionRDD should have a narrow dependencies . Although union of two RDDs increases the number of tasks to be executed ( rdd1.partitions + rdd2.partitions) . If your two RDDs have same number of partitions , you can also use zipPartitions, which causes lesser number of tasks, he

Re: How to use collections inside foreach block

2015-12-09 Thread Rishi Mishra

Your list is defined on the driver, whereas function specified in forEach will be evaluated on each executor. You might want to add an accumulator or handle a Sequence of list from each partition. On Wed, Dec 9, 2015 at 11:19 AM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi, > > I

Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-02 Thread Rishi Mishra

As long as all your data is being inserted by Spark , hence using the same hash partitioner, what Fengdong mentioned should work. On Wed, Dec 2, 2015 at 9:32 AM, Fengdong Yu wrote: > Hi > you can try: > > if your table under location “/test/table/“ on HDFS > and has partitions: > > “/test/tabl

Re: ClassLoader resources on executor

2015-12-02 Thread Rishi Mishra

Did you try to use *spark.executor.extraClassPath*. The classpath resources will be accessible through the executors class loader which executes your job. On Wed, Dec 2, 2015 at 2:15 AM, Charles Allen wrote: > Is there a way to pass configuration file resources to be resolvable > through the cla

Re: Join and HashPartitioner question

2015-11-16 Thread Rishi Mishra

AFAIK and can see in the code both of them should behave same. On Sat, Nov 14, 2015 at 2:10 AM, Alexander Pivovarov wrote: > Hi Everyone > > Is there any difference in performance btw the following two joins? > > > val r1: RDD[(String, String]) = ??? > val r2: RDD[(String, String]) = ??? > > val

Re: Joining streaming data with static table data.

Re: DataFrame joins with Spark-Java

Re: Extend Dataframe API

Re: RDD and Dataframes

Re: Spark 1.6 Catalyst optimizer

Re: Spark 1.6 Catalyst optimizer

Re: partitioner aware subtract

Re: Updating Values Inside Foreach Rdd loop

Re: Accumulator question

Re: Spark Streaming share state between two streams

Re: About nested RDD

Re: About nested RDD

Re: Spark SQL Optimization

Re: sliding Top N window

Re: Joins in Spark

Re: How to use a custom partitioner in a dataframe in Spark

Re: How to use a custom partitioner in a dataframe in Spark

Re: SparkSQL parallelism

Re: Spark : Unable to connect to Oracle

Re: spark.executor.memory ? is used just for cache RDD or both cache RDD and the runtime of cores on worker?

Re: Unit test with sqlContext

Re: Union of RDDs without the overhead of Union

Re: How to use collections inside foreach block

Re: SparkSQL API to insert DataFrame into a static partition?

Re: ClassLoader resources on executor

Re: Join and HashPartitioner question

26 matches

Site Navigation

Mail list logo

Footer information