Hi

In python, or in general in spark, you can just "read" the files and select
the column. I am assuming you are reading each file individually in
separate dataframes and joining them. Instead, you can read all the files
in single dataframe and select 1 column.

On Wed, Feb 9, 2022 at 2:55 AM Andrew Davidson <aedav...@ucsc.edu.invalid>
wrote:

> I need to create a single table by selecting one column from thousands of
> files. The columns are all of the same type, have the same number of rows
> and rows names. I am currently using join. I get OOM on mega-mem cluster
> with 2.8 TB.
>
>
>
> Does spark have something like cbind() “Take a sequence of vector, matrix
> or data-frame arguments and combine by *c*olumns or *r*ows,
> respectively. “
>
>
>
> https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/cbind
>
>
>
> Digging through the spark documentation I found a udf example
>
> https://spark.apache.org/docs/latest/sparkr.html#dapply
>
>
>
> ```
>
> *# Convert waiting time from hours to seconds.*
>
> *# Note that we can apply UDF to DataFrame.*
>
> schema <- structType(structField("eruptions", "double"), structField(
> "waiting", "double"),
>
>                      structField("waiting_secs", "double"))
>
> df1 <- dapply(df, *function*(x) { x <- cbind(x, x$waiting * 60) }, schema)
>
> head(collect(df1))
>
> *##  eruptions waiting waiting_secs*
>
> *##1     3.600      79         4740*
>
> *##2     1.800      54         3240*
>
> *##3     3.333      74         4440*
>
> *##4     2.283      62         3720*
>
> *##5     4.533      85         5100*
>
> *##6     2.883      55         3300*
>
> ```
>
>
>
> I wonder if this is just a wrapper around join? If so it is probably not
> going to help me out.
>
>
>
> Also I would prefer to work in python
>
>
>
> Any thoughts?
>
>
>
> Kind regards
>
>
>
> Andy
>
>
>
>
>


-- 
Best Regards,
Ayan Guha

Reply via email to