Something like this?
val huge_data = sc.textFile("/path/to/first.csv").map(x =>
(x.split("\t")(1), x.split("\t")(0))
val gender_data = sc.textFile("/path/to/second.csv"),map(x =>
(x.split("\t")(0), x))
val joined_data = huge_data.join(gender_data)
joined_data.take(1000)
Its scala btw, python api should also be similar.
Thanks
Best Regards
On Sat, Jun 13, 2015 at 12:16 AM, Rex X <[email protected]> wrote:
> To be concrete, say we have a folder with thousands of tab-delimited csv
> files with following attributes format (each csv file is about 10GB):
>
> id name address city...
> 1 Matt add1 LA...
> 2 Will add2 LA...
> 3 Lucy add3 SF...
> ...
>
> And we have a lookup table based on "name" above
>
> name gender
> Matt M
> Lucy F
> ...
>
> Now we are interested to output from top 1000 rows of each csv file into
> following format:
>
> id name gender
> 1 Matt M
> ...
>
> Can we use pyspark to efficiently handle this?
>
>
>