subject:"Re\: implement a distribution without shuffle like RDD.coalesce for DataSource V2 write"

Re: implement a distribution without shuffle like RDD.coalesce for DataSource V2 write

2023-06-18 Thread Mich Talebzadeh

OK the number of partitions n or more to the point the "optimum" no of partitions depends on the size of your batch data DF among other things and the degree of parallelism at the end point where you will be writing to sink. If you require high parallelism because your tasks are fine grained, then

Re: implement a distribution without shuffle like RDD.coalesce for DataSource V2 write

2023-06-18 Thread Mich Talebzadeh

Is this the point you are trying to implement? I have state data source which enables the state in SS --> Structured Streaming to be rewritten, which enables repartitioning, schema evolution, etc via batch query. The writer requires hash partitioning against group key, with the "desired number of