How to use spark for map-reduce flow to filter N columns, top M rows of all csv files under a folder?

Rex X Fri, 12 Jun 2015 11:47:02 -0700

To be concrete, say we have a folder with thousands of tab-delimited csv
files with following attributes format (each csv file is about 10GB):


    id    name    address    city...
    1    Matt    add1    LA...
    2    Will    add2    LA...
    3    Lucy    add3    SF...
    ...

And we have a lookup table based on "name" above

    name    gender
    Matt    M
    Lucy    F
    ...

Now we are interested to output from top 1000 rows of each csv file into
following format:

    id    name    gender
    1    Matt    M
    ...

Can we use pyspark to efficiently handle this?

How to use spark for map-reduce flow to filter N columns, top M rows of all csv files under a folder?

Reply via email to