To be concrete, say we have a folder with thousands of tab-delimited csv files with following attributes format (each csv file is about 10GB):
id name address city...
1 Matt add1 LA...
2 Will add2 LA...
3 Lucy add3 SF...
...
And we have a lookup table based on "name" above
name gender
Matt M
Lucy F
...
Now we are interested to output from top 1000 rows of each csv file into
following format:
id name gender
1 Matt M
...
Can we use pyspark to efficiently handle this?
