Unfortunately not. Again, I wonder if adding support targeted at this "small files problem" would make sense for Spark core, as it is a common problem in our space.
Right now, I don't know of any other options. Nick On Mon, Oct 6, 2014 at 2:24 PM, Landon Kuhn <[email protected]> wrote: > Nicholas, thanks for the tip. Your suggestion certainly seemed like the > right approach, but after a few days of fiddling I've come to the > conclusion that s3distcp will not work for my use case. It is unable to > flatten directory hierarchies, which I need because my source directories > contain hour/minute/second parts. > > See https://forums.aws.amazon.com/message.jspa?messageID=478960. It seems > that s3distcp can only combine files in the same path. > > Thanks again. That gave me a lot to go on. Any further suggestions? > > L > > > On Thu, Oct 2, 2014 at 4:15 PM, Nicholas Chammas < > [email protected]> wrote: > >> I believe this is known as the "Hadoop Small Files Problem", and it >> affects Spark as well. The best approach I've seen to merging small files >> like this is by using s3distcp, as suggested here >> <http://snowplowanalytics.com/blog/2013/05/30/dealing-with-hadoops-small-files-problem/>, >> as a pre-processing step. >> >> It would be great if Spark could somehow handle this common situation out >> of the box, but for now I don't think it does. >> >> Nick >> >> On Thu, Oct 2, 2014 at 7:10 PM, Landon Kuhn <[email protected]> wrote: >> >>> Hello, I'm trying to use Spark to process a large number of files in S3. >>> I'm running into an issue that I believe is related to the high number of >>> files, and the resources required to build the listing within the driver >>> program. If anyone in the Spark community can provide insight or guidance, >>> it would be greatly appreciated. >>> >>> The task at hand is to read ~100 million files stored in S3, and >>> repartition the data into a sensible number of files (perhaps 1,000). The >>> files are organized in a directory structure like so: >>> >>> >>> s3://bucket/event_type/year/month/day/hour/minute/second/customer_id/file_name >>> >>> (Note that each file is very small, containing 1-10 records each. >>> Unfortunately this is an artifact of the upstream systems that put data in >>> S3.) >>> >>> My Spark program is simple, and works when I target a relatively >>> specific subdirectory. For example: >>> >>> >>> sparkContext.textFile("s3n://bucket/purchase/2014/01/01/00/*/*/*/*").coalesce(...).write(...) >>> >>> This targets 1 hour's worth of purchase records, containing about 10,000 >>> files. The driver program blocks (I assume it is making S3 calls to >>> traverse the directories), and during this time no activity is visible in >>> the driver UI. After about a minute, the stages and tasks allocate in the >>> UI, and then everything progresses and completes within a few minutes. >>> >>> I need to process all the data (several year's worth). Something like: >>> >>> >>> sparkContext.textFile("s3n://bucket/*/*/*/*/*/*/*/*/*").coalesce(...).write(...) >>> >>> This blocks "forever" (I have only run the program for as long as >>> overnight). The stages and tasks never appear in the UI. I assume Spark is >>> building the file listing, which will either take too long and/or cause the >>> driver to eventually run out of memory. >>> >>> I would appreciate any comments or suggestions. I'm happy to provide >>> more information if that would be helpful. >>> >>> Thanks >>> >>> Landon >>> >>> >> > > > -- > *Landon Kuhn*, *Software Architect*, Janrain, Inc. <http://bit.ly/cKKudR> > E: [email protected] | M: 971-645-5501 | F: 888-267-9025 > Follow Janrain: Facebook <http://bit.ly/9CGHdf> | Twitter > <http://bit.ly/9umxlK> | YouTube <http://bit.ly/N0OiBT> | LinkedIn > <http://bit.ly/a7WZMC> | Blog <http://bit.ly/OI2uOR> > Follow Me: LinkedIn <http://www.linkedin.com/in/landonkuhn> > > ------------------------------------------------------------------------------------- > *Acquire, understand, and engage your users. Watch our video > <http://bit.ly/janrain-overview> or sign up for a live demo > <http://bit.ly/janraindemo> to see what it's all about.* >
