Hello, I'm trying to use Spark to process a large number of files in S3.
I'm running into an issue that I believe is related to the high number of
files, and the resources required to build the listing within the driver
program. If anyone in the Spark community can provide insight or guidance,
it would be greatly appreciated.

The task at hand is to read ~100 million files stored in S3, and
repartition the data into a sensible number of files (perhaps 1,000). The
files are organized in a directory structure like so:


s3://bucket/event_type/year/month/day/hour/minute/second/customer_id/file_name

(Note that each file is very small, containing 1-10 records each.
Unfortunately this is an artifact of the upstream systems that put data in
S3.)

My Spark program is simple, and works when I target a relatively specific
subdirectory. For example:


sparkContext.textFile("s3n://bucket/purchase/2014/01/01/00/*/*/*/*").coalesce(...).write(...)

This targets 1 hour's worth of purchase records, containing about 10,000
files. The driver program blocks (I assume it is making S3 calls to
traverse the directories), and during this time no activity is visible in
the driver UI. After about a minute, the stages and tasks allocate in the
UI, and then everything progresses and completes within a few minutes.

I need to process all the data (several year's worth). Something like:


sparkContext.textFile("s3n://bucket/*/*/*/*/*/*/*/*/*").coalesce(...).write(...)

This blocks "forever" (I have only run the program for as long as
overnight). The stages and tasks never appear in the UI. I assume Spark is
building the file listing, which will either take too long and/or cause the
driver to eventually run out of memory.

I would appreciate any comments or suggestions. I'm happy to provide more
information if that would be helpful.

Thanks

Landon

Reply via email to