Re: [pyspark 2.4] maxrecordsperfile option

2019-11-30 Thread Shraddha Shah
After digging in a bit more, it looks like maxrecordsperfile does not provide full parallelism as expected. Any thoughts on this would be really helpful. On Sat, Nov 23, 2019 at 11:36 PM Rishi Shah wrote: > Hi All, > > Version 2.2 introduced maxrecordsperfile option while writing data, could > s

[pyspark 2.4] maxrecordsperfile option

2019-11-23 Thread Rishi Shah
Hi All, Version 2.2 introduced maxrecordsperfile option while writing data, could someone help understand the performance impact of using maxrecordsperfile (single pass at writing data with this option) vs repartitioning (2 stage process where we write down data and then consolidate later)? -- R