Re: Mock spark reads and writes

2020-07-15 Thread ed
Hi, For testing things like this you have a couple of options, you could isolate all your business logic separately from your read/write/spark code which, in my experience, makes the code harder to write and manage. The other option is to accept that tests will be slower than you would expect

Why can window functions only have fixed window sizes?

2020-07-15 Thread Daniel Stojanov
Hi, My understanding of window functions is that they can only operate on fixed window sizes. For example, I can create a window like the following:     Window.partitionBy("group_identifier").orderBy("sequencial_counter").rowsBetween(-4, 5) or even:     Window.partitionBy("group_identifier").o

download of spark

2020-07-15 Thread Ming Liao
To whom it may concern, Hope this email finds you well. I am trying to download spark but I was not able to select the release and package type. Could you please help me with this? Thank you. Best, Ming [image: screenshot.png]

Re: download of spark

2020-07-15 Thread Sean Owen
Works for me - do you have javascript disabled? it will be necessary. On Wed, Jul 15, 2020 at 11:52 AM Ming Liao wrote: > To whom it may concern, > > Hope this email finds you well. > I am trying to download spark but I was not able to select the release and > package type. Could you please help

Re: Mock spark reads and writes

2020-07-15 Thread Jeff Evans
Why do you need to mock the read/write at all? Why not have your test CSV file, and invoke it (which will perform the real Spark DF read of CSV), write it, and assert on the output? On Tue, Jul 14, 2020 at 12:19 PM Dark Crusader wrote: > Sorry I wasn't very clear in my last email. > > I have a

PySpark aggregation w/pandas_udf

2020-07-15 Thread Andrew Melo
Hi all, For our use case, we would like to perform an aggregation using a pandas_udf with dataframes that have O(100m) rows and a few 10s of columns. Conceptually, this looks a bit like pyspark.RDD.aggregate, where the user provides: * A "seqOp" which accepts pandas series(*) and outputs an inter