Thanks, we were able to validate the same behaviour.
On Wed, 23 Sep 2020 at 18:05, Sean Owen wrote:
> It is but it happens asynchronously. If you access the same block twice
> quickly, the cached block may not yet be available the second time yet.
>
> On Wed, Sep 23, 2020, 7:17 AM Arya Ketan wr
Thanks!
It appears one should use not *dataset.col("timestamp")*
but rather* functions.col("timestamp").*
Hi,
try this :
dataset.printSchema(); // see the output below
Dataset ds1 = dataset
.withWatermark("timestamp", "1 second")
.groupBy(
functions.window(*col("timestamp")*, "1 second", "1 second"),
*col("sour
Hi,
I am trying to aggregate Spark time-stamped structured stream to get
per-device (source) averages for every second of incoming data.
dataset.printSchema(); // see the output below
Dataset ds1 = dataset
.withWatermark("timestamp", "1 second")
Hello,
I need to filter one huge table using others huge tables.
I could not avoid sort operation using `WHERE IN` or `INNER JOIN`.
Can this be avoided?
As I'm ok with false positives maybe Bloom filter is an alternative.
I saw that Scala has a builtin dataframe function
(https://spark.apache.o
It is but it happens asynchronously. If you access the same block twice
quickly, the cached block may not yet be available the second time yet.
On Wed, Sep 23, 2020, 7:17 AM Arya Ketan wrote:
> Hi,
> I have a spark streaming use-case ( spark 2.2.1 ). And in my spark job, I
> have multiple action
Hi,
I have a spark streaming use-case ( spark 2.2.1 ). And in my spark job, I
have multiple actions. I am running them in parallel by executing the
actions in separate threads. I have a rdd.persist after which the DAG
forks into multiple actions.
but I see that rdd caching is not happening and th