Re: performance problem when reading lots of small files created by spark streaming.

2016-07-28 Thread Gourav Sengupta
cutionException e) { > > logger.error("", e); > > } > > } > > } > > > static class SaveData { > > private DataFrame df; > > private String path; > > > SaveData(DataFrame df, String path) { > > this.df = df; > > this.path

Re: performance problem when reading lots of small files created by spark streaming.

2016-07-28 Thread Andy Davidson
te().json(data.path); } } } } From: Pedro Rodriguez Date: Wednesday, July 27, 2016 at 8:40 PM To: Andrew Davidson Cc: "user @spark" Subject: Re: performance problem when reading lots of small files created by spark streaming. > There are a few blog posts that detail one p

Re: performance problem when reading lots of small files created by spark streaming.

2016-07-27 Thread Pedro Rodriguez
There are a few blog posts that detail one possible/likely issue for example: http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219 TLDR: The hadoop libraries spark uses assumes that its input comes from a file system (works with HDFS) however S3 is a key value store, not a