You need to understand how join works to make sense of it. Logically, a join
does a cartesian product of the 2 tables, and then filters the rows that
satisfy the contains UDF. So, let's say you have
Input
Allen Armstrong nishanth hemanth Allen
shivu Armstrong nishanth
shree shivu DeWALT
Replacem
In any distributed application, you scale up by splitting execution up on
multiple machines. The way Spark does this is by slicing the data into
partitions and spreading them on multiple machines. Logically, an RDD is
exactly that: data is split up and spread around on multiple machines. When
you p
Option A
If you can get all the messages in a session into the same Spark partition,
you can use df.mapWithPartition to process the whole partition. This will
allow you to control the order in which the messages are processed within
the partition.
This will work if messages are posted in Kafka in
We have a Structured Streaming application that gets accounts from Kafka into
a streaming data frame. We have a blacklist of accounts stored in S3 and we
want to filter out all the accounts that are blacklisted. So, we are loading
the blacklisted accounts into a batch data frame and joining it with
What is a good way to make a Structured Streaming application deal with bad
input? Right now, the problem is that bad input kills the Structured
Streaming application. This is highly undesirable, because a Structured
Streaming application has to be always on
For example, here is a very simple stru