Re: Is RDD.persist honoured if multiple actions are executed in parallel

2020-09-23 Thread Arya Ketan
Thanks, we were able to validate the same behaviour. On Wed, 23 Sep 2020 at 18:05, Sean Owen wrote: > It is but it happens asynchronously. If you access the same block twice > quickly, the cached block may not yet be available the second time yet. > > On Wed, Sep 23, 2020, 7:17 AM Arya Ketan wr

Re: Spark watermarked aggregation query and append output mode

2020-09-23 Thread Sergey Oboguev
Thanks! It appears one should use not *dataset.col("timestamp")* but rather* functions.col("timestamp").*

Re: Spark watermarked aggregation query and append output mode

2020-09-23 Thread German Schiavon
Hi, try this : dataset.printSchema(); // see the output below Dataset ds1 = dataset .withWatermark("timestamp", "1 second") .groupBy( functions.window(*col("timestamp")*, "1 second", "1 second"), *col("sour

Spark watermarked aggregation query and append output mode

2020-09-23 Thread Sergey Oboguev
Hi, I am trying to aggregate Spark time-stamped structured stream to get per-device (source) averages for every second of incoming data. dataset.printSchema(); // see the output below Dataset ds1 = dataset .withWatermark("timestamp", "1 second")

Bloom Filter to filter huge dataframes with PySpark

2020-09-23 Thread Breno Arosa
Hello, I need to filter one huge table using others huge tables. I could not avoid sort operation using `WHERE IN` or `INNER JOIN`. Can this be avoided? As I'm ok with false positives maybe Bloom filter is an alternative. I saw that Scala has a builtin dataframe function (https://spark.apache.o

Re: Is RDD.persist honoured if multiple actions are executed in parallel

2020-09-23 Thread Sean Owen
It is but it happens asynchronously. If you access the same block twice quickly, the cached block may not yet be available the second time yet. On Wed, Sep 23, 2020, 7:17 AM Arya Ketan wrote: > Hi, > I have a spark streaming use-case ( spark 2.2.1 ). And in my spark job, I > have multiple action

Is RDD.persist honoured if multiple actions are executed in parallel

2020-09-23 Thread Arya Ketan
Hi, I have a spark streaming use-case ( spark 2.2.1 ). And in my spark job, I have multiple actions. I am running them in parallel by executing the actions in separate threads. I have a rdd.persist after which the DAG forks into multiple actions. but I see that rdd caching is not happening and th