Re: Merge multiple different s3 logs using pyspark 2.4.3

2020-01-09 Thread Shraddha Shah
Unless I am reading this wrong, this can be achieved with aws sync ? aws s3 sync s3://my-bucket/ingestion/source1/y=2019/m=12/d=12 s3://my-bucket/ingestion/processed/ *src_category=other*/y=2019/m=12/d=12 Thanks, -Shraddha On Thu, Jan 9, 2020 at 7:05 AM Gourav Sengupta wrote: > why s3a? > >

Re: [pyspark 2.4] maxrecordsperfile option

2019-11-30 Thread Shraddha Shah
After digging in a bit more, it looks like maxrecordsperfile does not provide full parallelism as expected. Any thoughts on this would be really helpful. On Sat, Nov 23, 2019 at 11:36 PM Rishi Shah wrote: > Hi All, > > Version 2.2 introduced maxrecordsperfile option while writing data, could > s

Re: [pyspark 2.3+] repartition followed by window function

2019-05-22 Thread Shraddha Shah
Any suggestions? On Wed, May 22, 2019 at 6:32 AM Rishi Shah wrote: > Hi All, > > If dataframe is repartitioned in memory by (date, id) columns and then if > I use multiple window functions which uses partition by clause with (date, > id) columns --> we can avoid shuffle/sort again I believe.. Ca

Re: Use derived column for other derived column in the same statement

2019-04-21 Thread Shraddha Shah
Also the same thing for groupby agg operation, how can we use one aggregated result (say min(amount)) to derive another aggregated column? On Sun, Apr 21, 2019 at 11:24 PM Rishi Shah wrote: > Hello All, > > How can we use a derived column1 for deriving another column in the same > dataframe oper