You can do a join between streaming dataset and a static dataset. I would
prefer your first approach. But the problem with this approach is
performance.
Unless you cache the dataset , every time you fire a join query it will
fetch the latest records from the table.
Regards,
Rishitesh Mishra,
Sna
Hi Sushma,
can you try as below with a left anti join ..In my example name & id
consists of a key.
df1.alias("a").join(df2.alias("b"),
col("a.name").equalTo(col("b.name"))
.and(col("a.id").equalTo(col("b.id"))) ,
"left_anti").selectExpr("name", "id").show(10, false)
Or , you can extend SQLContext to add your plans . Not sure if it fits your
requirement , but answered to highlight an option.
Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)
https://in.linkedin.com/in/rishiteshmishra
On Thu, Jul 7, 2016 at 8:39 PM, tan shai wrote:
> That w
Yes, finally it will be converted to an RDD internally. However DataFrame
queries are passed through catalyst , which provides several optimizations
e.g. code generation, intelligent shuffle etc , which is not the case for
pure RDDs.
Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.i
unhandled predicates?
>
> Telmo Rodrigues
>
> No dia 11/05/2016, às 09:49, Rishi Mishra
> escreveu:
>
> It does push the predicate. But as a relations are generic and might or
> might not handle some of the predicates , it needs to apply filter of
> un-handled predic
It does push the predicate. But as a relations are generic and might or
might not handle some of the predicates , it needs to apply filter of
un-handled predicates.
Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)
https://in.linkedin.com/in/rishiteshmishra
On Wed, May 11, 2016
As you have same partitioner and number of partitions probably you can use
zipPartition and provide a user defined function to substract .
A very primitive example being.
val data1 = Seq(1->1,2->2,3->3,4->4,5->5,6->6,7->7)
val data2 = Seq(1->1,2->2,3->3,4->4,5->5,6->6)
val rdd1 = sc.parallelize(
Hi Harsh,
Probably you need to maintain some state for your values, as you are
updating some of the keys in a batch and check for a global state of your
equation.
Can you check the API mapWithState of DStream ?
Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)
https://in.linkedi
Your mail does not describe much , but wont a simple reduce function help
you ?
Something like as below
val data = Seq(1,2,3,4,5,6,7)
val rdd = sc.parallelize(data, 2)
val sum = rdd.reduce((a,b) => a+b)
Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)
https://in.linkedin.co
Hi Shekhar,
As both of your state functions does the same thing can't you do a union of
dtsreams before applying mapWithState() ? It might be difficult if one
state function is dependent on other state. This requires a named state,
which can be accessed in other state functions. I have not gone thr
As mentioned earlier you can create a broadcast variable containing all the
small RDD elements. I hope they are really small. Then you can fire
A.updatae(broadcastVariable).
Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)
https://in.linkedin.com/in/rishiteshmishra
On Fri, Ap
rdd.count() is a fairly straightforward operations which can be calculated
on a driver and then the value can be included in the map function.
Is your goal is to write a generic function which operates on two rdds, one
rdd being evaluated for each partition of the other ?
Here also you can use bro
What we have observed so far is Spark picks join order in the same order as
tables in from clause is specified. Sometimes reordering benefits the join
query.
This can be an inbuilt optimization in Spark. But again its not going to be
straight forward, where rather than table size, selectivity of
Hi Alexy,
We are also trying to solve similar problems using approximation. Would
like to hear more about your usage. We can discuss this offline without
boring others. :)
Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)
https://in.linkedin.com/in/rishiteshmishra
On Tue, Mar
My suspect is your input file partitions are small. Hence small number of
tasks are started. Can you provide some more details like how you load the
files and how the result size is around 500GBs ?
Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)
https://in.linkedin.com/in/ri
Michael,
Is there any specific reason why DataFrames does not have partitioners like
RDDs ? This will be very useful if one is writing custom datasources ,
which keeps data in partitions. While storing data one can pre-partition
the data at Spark level rather than at the datasource.
Regards,
Rishi
Unfortunately there is not any, at least till 1.5. Have not gone through
the new DataSet of 1.6. There is some basic support for Parquet like
partitionByColumn.
If you want to partition your dataset on a certain way you have to use an
RDD to partition & convert that into a DataFrame before stori
I am not sure why all 3 nodes should query. If you have not mentioned any
partitions it should only be one partition of JDBCRDD where all dataset
should reside.
On Fri, Feb 12, 2016 at 10:15 AM, Madabhattula Rajesh Kumar <
mrajaf...@gmail.com> wrote:
> Hi,
>
> I have a spark cluster with One Ma
ASFIK sc.addJar() will add the jars to executor's classpath . The
datasource resolution ( createRelation) happens at driver side and driver
classpath should contain the ojdbc6.jar. You can use
"spark.driver.extraClassPath"
config parameter to set the same.
On Wed, Feb 10, 2016 at 3:08 PM, Jorge M
You would probably like to see
http://spark.apache.org/docs/latest/configuration.html#memory-management.
Other config parameters are also explained there.
On Fri, Feb 5, 2016 at 10:56 AM, charles li wrote:
> if set spark.executor.memory = 2G for each worker [ 10 in total ]
>
> does it mean I can
Hi Steve,
Have you cleaned up your SparkContext ( sc.stop()) , in a afterAll(). The
error suggests you are creating more than one SparkContext.
On Fri, Feb 5, 2016 at 10:04 AM, Holden Karau wrote:
> Thanks for recommending spark-testing-base :) Just wanted to add if anyone
> has feature reques
Agree with Koert that UnionRDD should have a narrow dependencies .
Although union of two RDDs increases the number of tasks to be executed (
rdd1.partitions + rdd2.partitions) .
If your two RDDs have same number of partitions , you can also use
zipPartitions, which causes lesser number of tasks, he
Your list is defined on the driver, whereas function specified in forEach
will be evaluated on each executor.
You might want to add an accumulator or handle a Sequence of list from each
partition.
On Wed, Dec 9, 2015 at 11:19 AM, Madabhattula Rajesh Kumar <
mrajaf...@gmail.com> wrote:
> Hi,
>
> I
As long as all your data is being inserted by Spark , hence using the same
hash partitioner, what Fengdong mentioned should work.
On Wed, Dec 2, 2015 at 9:32 AM, Fengdong Yu
wrote:
> Hi
> you can try:
>
> if your table under location “/test/table/“ on HDFS
> and has partitions:
>
> “/test/tabl
Did you try to use *spark.executor.extraClassPath*. The classpath resources
will be accessible through the executors class loader which executes your
job.
On Wed, Dec 2, 2015 at 2:15 AM, Charles Allen wrote:
> Is there a way to pass configuration file resources to be resolvable
> through the cla
AFAIK and can see in the code both of them should behave same.
On Sat, Nov 14, 2015 at 2:10 AM, Alexander Pivovarov
wrote:
> Hi Everyone
>
> Is there any difference in performance btw the following two joins?
>
>
> val r1: RDD[(String, String]) = ???
> val r2: RDD[(String, String]) = ???
>
> val
26 matches
Mail list logo