Hi Gourav,
The static table is broadcasted prior to the join so the shuffle is primarily to
avoid OOME during the UDF.
It's not quite a Cartesian product, but yes the join results in multiple records
per input record. The number of output records varies depending on the number of
duplicates in th
. We set
> spark.hadoop.parquet.block.size in our spark config for writing to Delta.
>
> Adam
>
> On Fri, Feb 11, 2022, 10:15 AM Chris Coutinho
> wrote:
>
>> I tried re-writing the table with the updated block size but it doesn't
>> appear to have an effect
483
num_row_groups: 1
format_version: 1.0
serialized_size: 6364
Columns
...
Chris
On Fri, Feb 11, 2022 at 3:37 PM Sean Owen wrote:
> It should just be parquet.block.size indeed.
> spark.write.option("parquet.block.size", "16m").parquet(...)
> This is an issue
(or rewriting the
> input Delta table with a much smaller row group size which wouldn't be
> ideal performance wise).
>
> Adam
>
> On Fri, Feb 11, 2022 at 7:23 AM Chris Coutinho
> wrote:
>
>> Hello,
>>
>> We have a spark structured streaming job that
Hello,
We have a spark structured streaming job that includes a stream-static join
and a Pandas UDF, streaming to/from delta tables. The primary key of the
static table is non-unique, meaning that the streaming join results in
multiple records per input record - in our case 100x increase. The Pand
I'm also curious if this is possible, so while I can't offer a solution
maybe you could try the following.
The driver and executor nodes need to have access to the same
(distributed) file system, so you could try to mount the file system to
your laptop, locally, and then try to submit jobs and/or