from:"Chris Coutinho"

Re: Unable to force small partitions in streaming job without repartitioning

2022-02-12 Thread Chris Coutinho

Hi Gourav, The static table is broadcasted prior to the join so the shuffle is primarily to avoid OOME during the UDF. It's not quite a Cartesian product, but yes the join results in multiple records per input record. The number of output records varies depending on the number of duplicates in th

Re: Unable to force small partitions in streaming job without repartitioning

2022-02-12 Thread Chris Coutinho

. We set > spark.hadoop.parquet.block.size in our spark config for writing to Delta. > > Adam > > On Fri, Feb 11, 2022, 10:15 AM Chris Coutinho > wrote: > >> I tried re-writing the table with the updated block size but it doesn't >> appear to have an effect

Re: Unable to force small partitions in streaming job without repartitioning

2022-02-11 Thread Chris Coutinho

483 num_row_groups: 1 format_version: 1.0 serialized_size: 6364 Columns ... Chris On Fri, Feb 11, 2022 at 3:37 PM Sean Owen wrote: > It should just be parquet.block.size indeed. > spark.write.option("parquet.block.size", "16m").parquet(...) > This is an issue

Re: Unable to force small partitions in streaming job without repartitioning

2022-02-11 Thread Chris Coutinho

(or rewriting the > input Delta table with a much smaller row group size which wouldn't be > ideal performance wise). > > Adam > > On Fri, Feb 11, 2022 at 7:23 AM Chris Coutinho > wrote: > >> Hello, >> >> We have a spark structured streaming job that

Unable to force small partitions in streaming job without repartitioning

2022-02-11 Thread Chris Coutinho

Hello, We have a spark structured streaming job that includes a stream-static join and a Pandas UDF, streaming to/from delta tables. The primary key of the static table is non-unique, meaning that the streaming join results in multiple records per input record - in our case 100x increase. The Pand

Re: Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Chris Coutinho

I'm also curious if this is possible, so while I can't offer a solution maybe you could try the following. The driver and executor nodes need to have access to the same (distributed) file system, so you could try to mount the file system to your laptop, locally, and then try to submit jobs and/or

Re: Unable to force small partitions in streaming job without repartitioning

Re: Unable to force small partitions in streaming job without repartitioning

Re: Unable to force small partitions in streaming job without repartitioning

Re: Unable to force small partitions in streaming job without repartitioning

Unable to force small partitions in streaming job without repartitioning

Re: Running the driver on a laptop but data is on the Spark server

6 matches

Site Navigation

Mail list logo

Footer information