date:20200923

Is RDD.persist honoured if multiple actions are executed in parallel

2020-09-23 Thread Arya Ketan

Hi, I have a spark streaming use-case ( spark 2.2.1 ). And in my spark job, I have multiple actions. I am running them in parallel by executing the actions in separate threads. I have a rdd.persist after which the DAG forks into multiple actions. but I see that rdd caching is not happening and th

Re: Is RDD.persist honoured if multiple actions are executed in parallel

2020-09-23 Thread Sean Owen

It is but it happens asynchronously. If you access the same block twice quickly, the cached block may not yet be available the second time yet. On Wed, Sep 23, 2020, 7:17 AM Arya Ketan wrote: > Hi, > I have a spark streaming use-case ( spark 2.2.1 ). And in my spark job, I > have multiple action

Bloom Filter to filter huge dataframes with PySpark

2020-09-23 Thread Breno Arosa

Hello, I need to filter one huge table using others huge tables. I could not avoid sort operation using `WHERE IN` or `INNER JOIN`. Can this be avoided? As I'm ok with false positives maybe Bloom filter is an alternative. I saw that Scala has a builtin dataframe function (https://spark.apache.o

Spark watermarked aggregation query and append output mode

2020-09-23 Thread Sergey Oboguev

Hi, I am trying to aggregate Spark time-stamped structured stream to get per-device (source) averages for every second of incoming data. dataset.printSchema(); // see the output below Dataset ds1 = dataset .withWatermark("timestamp", "1 second")

Re: Spark watermarked aggregation query and append output mode

2020-09-23 Thread German Schiavon

Hi, try this : dataset.printSchema(); // see the output below Dataset ds1 = dataset .withWatermark("timestamp", "1 second") .groupBy( functions.window(*col("timestamp")*, "1 second", "1 second"), *col("sour

Re: Spark watermarked aggregation query and append output mode

2020-09-23 Thread Sergey Oboguev

Thanks! It appears one should use not *dataset.col("timestamp")* but rather* functions.col("timestamp").*

Re: Is RDD.persist honoured if multiple actions are executed in parallel

2020-09-23 Thread Arya Ketan

Thanks, we were able to validate the same behaviour. On Wed, 23 Sep 2020 at 18:05, Sean Owen wrote: > It is but it happens asynchronously. If you access the same block twice > quickly, the cached block may not yet be available the second time yet. > > On Wed, Sep 23, 2020, 7:17 AM Arya Ketan wr

Is RDD.persist honoured if multiple actions are executed in parallel

Re: Is RDD.persist honoured if multiple actions are executed in parallel

Bloom Filter to filter huge dataframes with PySpark

Spark watermarked aggregation query and append output mode

Re: Spark watermarked aggregation query and append output mode

Re: Spark watermarked aggregation query and append output mode

Re: Is RDD.persist honoured if multiple actions are executed in parallel

7 matches

Site Navigation

Mail list logo

Footer information