Hello, I'm trying to use Spark to process a large number of files in S3. I'm running into an issue that I believe is related to the high number of files, and the resources required to build the listing within the driver program. If anyone in the Spark community can provide insight or guidance, it would be greatly appreciated.
The task at hand is to read ~100 million files stored in S3, and repartition the data into a sensible number of files (perhaps 1,000). The files are organized in a directory structure like so: s3://bucket/event_type/year/month/day/hour/minute/second/customer_id/file_name (Note that each file is very small, containing 1-10 records each. Unfortunately this is an artifact of the upstream systems that put data in S3.) My Spark program is simple, and works when I target a relatively specific subdirectory. For example: sparkContext.textFile("s3n://bucket/purchase/2014/01/01/00/*/*/*/*").coalesce(...).write(...) This targets 1 hour's worth of purchase records, containing about 10,000 files. The driver program blocks (I assume it is making S3 calls to traverse the directories), and during this time no activity is visible in the driver UI. After about a minute, the stages and tasks allocate in the UI, and then everything progresses and completes within a few minutes. I need to process all the data (several year's worth). Something like: sparkContext.textFile("s3n://bucket/*/*/*/*/*/*/*/*/*").coalesce(...).write(...) This blocks "forever" (I have only run the program for as long as overnight). The stages and tasks never appear in the UI. I assume Spark is building the file listing, which will either take too long and/or cause the driver to eventually run out of memory. I would appreciate any comments or suggestions. I'm happy to provide more information if that would be helpful. Thanks Landon