Hi In python, or in general in spark, you can just "read" the files and select the column. I am assuming you are reading each file individually in separate dataframes and joining them. Instead, you can read all the files in single dataframe and select 1 column.
On Wed, Feb 9, 2022 at 2:55 AM Andrew Davidson <aedav...@ucsc.edu.invalid> wrote: > I need to create a single table by selecting one column from thousands of > files. The columns are all of the same type, have the same number of rows > and rows names. I am currently using join. I get OOM on mega-mem cluster > with 2.8 TB. > > > > Does spark have something like cbind() “Take a sequence of vector, matrix > or data-frame arguments and combine by *c*olumns or *r*ows, > respectively. “ > > > > https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/cbind > > > > Digging through the spark documentation I found a udf example > > https://spark.apache.org/docs/latest/sparkr.html#dapply > > > > ``` > > *# Convert waiting time from hours to seconds.* > > *# Note that we can apply UDF to DataFrame.* > > schema <- structType(structField("eruptions", "double"), structField( > "waiting", "double"), > > structField("waiting_secs", "double")) > > df1 <- dapply(df, *function*(x) { x <- cbind(x, x$waiting * 60) }, schema) > > head(collect(df1)) > > *## eruptions waiting waiting_secs* > > *##1 3.600 79 4740* > > *##2 1.800 54 3240* > > *##3 3.333 74 4440* > > *##4 2.283 62 3720* > > *##5 4.533 85 5100* > > *##6 2.883 55 3300* > > ``` > > > > I wonder if this is just a wrapper around join? If so it is probably not > going to help me out. > > > > Also I would prefer to work in python > > > > Any thoughts? > > > > Kind regards > > > > Andy > > > > > -- Best Regards, Ayan Guha