Re: OOM Joining thousands of dataframes Was: AnalysisException: Trouble using select() to append multiple columns

2021-12-24 Thread Hollis
mail | From | Gourav Sengupta | | Date | 12/25/2021 03:46 | | To | Sean Owen | | Cc | Andrew Davidson、Nicholas Gustafson、User | | Subject | Re: OOM Joining thousands of dataframes Was: AnalysisException: Trouble using select() to append multiple columns | Hi, may be I am getting confused as

Re: OOM Joining thousands of dataframes Was: AnalysisException: Trouble using select() to append multiple columns

2021-12-24 Thread Gourav Sengupta
Hi, may be I am getting confused as always :) , but the requirement looked pretty simple to me to be implemented in SQL, or it is just the euphoria of Christmas eve Anyways, in case the above can be implemented in SQL, then I can have a look at it. Yes, indeed there are bespoke scenarios where

Re: OOM Joining thousands of dataframes Was: AnalysisException: Trouble using select() to append multiple columns

2021-12-24 Thread Sean Owen
This is simply not generally true, no, and not in this case. The programmatic and SQL APIs overlap a lot, and where they do, they're essentially aliases. Use whatever is more natural. What I wouldn't recommend doing is emulating SQL-like behavior in custom code, UDFs, etc. The native operators will

Re: OOM Joining thousands of dataframes Was: AnalysisException: Trouble using select() to append multiple columns

2021-12-24 Thread Gourav Sengupta
union them all together. Each “part” will still > need to iterate 16000 times > > > > In general I assume we want to avoid for loops. I assume Spark is unable > to optimize them. It would be nice if spark provide some sort of join all > function even if it used a for loop to hide this

OOM Joining thousands of dataframes Was: AnalysisException: Trouble using select() to append multiple columns

2021-12-24 Thread Andrew Davidson
holidays Andy From: Sean Owen Date: Friday, December 24, 2021 at 8:30 AM To: Gourav Sengupta Cc: Andrew Davidson , Nicholas Gustafson , User Subject: Re: AnalysisException: Trouble using select() to append multiple columns (that's not the situation below we are commenting on) On Fri

Re: AnalysisException: Trouble using select() to append multiple columns

2021-12-24 Thread Sean Owen
t;>> Thanks Nicholas >>>> >>>> >>>> >>>> Andy >>>> >>>> >>>> >>>> *From: *Nicholas Gustafson >>>> *Date: *Friday, December 17, 2021 at 6:12 PM >>>> *To: *Andrew Davidson >>>

Re: AnalysisException: Trouble using select() to append multiple columns

2021-12-24 Thread Gourav Sengupta
Davidson >> wrote: >> >>> Thanks Nicholas >>> >>> >>> >>> Andy >>> >>> >>> >>> *From: *Nicholas Gustafson >>> *Date: *Friday, December 17, 2021 at 6:12 PM >>> *To: *Andrew Davidson >&g

Re: AnalysisException: Trouble using select() to append multiple columns

2021-12-24 Thread Sean Owen
ustafson >> *Date: *Friday, December 17, 2021 at 6:12 PM >> *To: *Andrew Davidson >> *Cc: *"user@spark.apache.org" >> *Subject: *Re: AnalysisException: Trouble using select() to append >> multiple columns >> >> >> >> Since df1 and df2 are dif

Re: AnalysisException: Trouble using select() to append multiple columns

2021-12-24 Thread Gourav Sengupta
las > > > > Andy > > > > *From: *Nicholas Gustafson > *Date: *Friday, December 17, 2021 at 6:12 PM > *To: *Andrew Davidson > *Cc: *"user@spark.apache.org" > *Subject: *Re: AnalysisException: Trouble using select() to append > multiple columns >

Re: AnalysisException: Trouble using select() to append multiple columns

2021-12-18 Thread Andrew Davidson
Thanks Nicholas Andy From: Nicholas Gustafson Date: Friday, December 17, 2021 at 6:12 PM To: Andrew Davidson Cc: "user@spark.apache.org" Subject: Re: AnalysisException: Trouble using select() to append multiple columns Since df1 and df2 are different DataFrames, you will need to

Re: AnalysisException: Trouble using select() to append multiple columns

2021-12-17 Thread Nicholas Gustafson
Since df1 and df2 are different DataFrames, you will need to use a join. For example: df1.join(df2.selectExpr(“Name”, “NumReads as ctrl_2”), on=[“Name”]) > On Dec 17, 2021, at 16:25, Andrew Davidson wrote: > >  > Hi I am a newbie > > I have 16,000 data files, all files have the same number o

AnalysisException: Trouble using select() to append multiple columns

2021-12-17 Thread Andrew Davidson
Hi I am a newbie I have 16,000 data files, all files have the same number of rows and columns. The row ids are identical and are in the same order. I want to create a new data frame that contains the 3rd column from each data file I wrote a test program that uses a for loop and Join. It works w