Re: StructuredStreaming error - pyspark.sql.utils.StreamingQueryException: batch 44 doesn't exist

2022-02-28 Thread Gourav Sengupta
Hi Karan, If you are running at least once operation, then you can restart the failed job with a new checkpoint area, and you will end up with duplicates in your target but the job will run fine. Since you are using stateful operations, if your keys are large to manage in a state try to use Rocks

Re: Issue while creating spark app

2022-02-28 Thread ashok34...@yahoo.com.INVALID
Thanks for all these useful info Hi all What is the current trend. Is it Spark on Scala with intellij or Spark on python with pycharm.  I am curious because I have moderate experience with Spark on both Scala and python and want to focus on Scala OR python going forward with the intention of jo

Re: [Spark SQL] Null when trying to use corr() with a Window

2022-02-28 Thread Edgar H
Oh I see now, using currentRow will give the correlation per ID within the group based on its ordering and using unbounded both will result in the overall correlation value for the whole group? El lun, 28 feb 2022 a las 16:33, Sean Owen () escribió: > The results make sense then. You want a corre

Re: [Spark SQL] Null when trying to use corr() with a Window

2022-02-28 Thread Sean Owen
The results make sense then. You want a correlation per group right? because it's over the sums by ID within the group. Then currentRow is wrong; needs to be unbounded preceding and following. On Mon, Feb 28, 2022 at 9:22 AM Edgar H wrote: > The window is defined as you said yes, unboundedPrece

Re: [Spark SQL] Null when trying to use corr() with a Window

2022-02-28 Thread Edgar H
The window is defined as you said yes, unboundedPreceding and currentRow ordering by orderCountSum. val initialSetWindow = Window .partitionBy("group") .orderBy("orderCountSum") .rowsBetween(Window.unboundedPreceding, Window.currentRow) I'm trying to obtain the correlation for each of the m

Re: [Spark SQL] Null when trying to use corr() with a Window

2022-02-28 Thread Sean Owen
How are you defining the window? It looks like it's something like "rows unbounded proceeding, current" or the reverse, as the correlation varies across the elements of the group as if it's computing them on 1, then 2, then 3 elements. Don't you want the correlation across the group? otherwise this

Re: [Spark SQL] Null when trying to use corr() with a Window

2022-02-28 Thread Edgar H
My bad completely, missed the example by a mile sorry for that, let me change a couple of things. - Got to add "id" to the initial grouping and also add more elements to the initial set; val sampleSet = Seq( ("group1", "id1", 1, 1, 6), ("group1", "id1", 4, 4, 6), ("group1", "id2", 2, 2, 5),

[Spark SQL] Null when trying to use corr() with a Window

2022-02-28 Thread Edgar H
Morning all, been struggling with this for a while and can't really seem to understand what I'm doing wrong... Having the following DataFrame I want to apply the corr function over the following DF; val sampleColumns = Seq("group", "id", "count1", "count2", "orderCount") val sampleSet =

Accumulator null pointer exception

2022-02-28 Thread Abhimanyu Kumar Singh
I;m getting an interesting null pointer exception when trying to add any value in a custom accumulator. *code*: object StagingFacade { //Accumulator declared var filesAccumulator : CollectionAccumulator[(String, String, String)] = _ def apply(appArgs: Array[String], spark: SparkSe

Re: [Spark SQL] Null when trying to use corr() with a Window

2022-02-28 Thread Sean Owen
You're computing correlations of two series of values, but each series has one value, a sum. Correlation is not defined in this case (both variances are undefined). This is sample correlation, note. On Mon, Feb 28, 2022 at 7:06 AM Edgar H wrote: > Morning all, been struggling with this for a whi

Spark and Hive Metastore Authorzation

2022-02-28 Thread Hartwig, Jonas
Hi everyone, Here is what we have: We have deployed the hive metastore on k8s. We use Spark and Presto for DDL and DQL. Both also run on k8s. As the backend store we use Minio. For rule based authorization we use Ranger. We have connected everything successfully. Here is what we need: Spark doe

[Spark SQL] Null when trying to use corr() with a Window

2022-02-28 Thread Edgar H
Morning all, been struggling with this for a while and can't really seem to understand what I'm doing wrong... Having the following DataFrame I want to apply the corr function over the following DF; val sampleColumns = Seq("group", "id", "count1", "count2", "orderCount") val sampleSet =