Hi,
I have a spark streaming use-case ( spark 2.2.1 ). And in my spark job, I
have multiple actions. I am running them in parallel by executing the
actions in separate threads. I have a rdd.persist after which the DAG
forks into multiple actions.
but I see that rdd caching is not happening and th
It is but it happens asynchronously. If you access the same block twice
quickly, the cached block may not yet be available the second time yet.
On Wed, Sep 23, 2020, 7:17 AM Arya Ketan wrote:
> Hi,
> I have a spark streaming use-case ( spark 2.2.1 ). And in my spark job, I
> have multiple action
Hello,
I need to filter one huge table using others huge tables.
I could not avoid sort operation using `WHERE IN` or `INNER JOIN`.
Can this be avoided?
As I'm ok with false positives maybe Bloom filter is an alternative.
I saw that Scala has a builtin dataframe function
(https://spark.apache.o
Hi,
I am trying to aggregate Spark time-stamped structured stream to get
per-device (source) averages for every second of incoming data.
dataset.printSchema(); // see the output below
Dataset ds1 = dataset
.withWatermark("timestamp", "1 second")
Hi,
try this :
dataset.printSchema(); // see the output below
Dataset ds1 = dataset
.withWatermark("timestamp", "1 second")
.groupBy(
functions.window(*col("timestamp")*, "1 second", "1 second"),
*col("sour
Thanks!
It appears one should use not *dataset.col("timestamp")*
but rather* functions.col("timestamp").*
Thanks, we were able to validate the same behaviour.
On Wed, 23 Sep 2020 at 18:05, Sean Owen wrote:
> It is but it happens asynchronously. If you access the same block twice
> quickly, the cached block may not yet be available the second time yet.
>
> On Wed, Sep 23, 2020, 7:17 AM Arya Ketan wr