subject:"Reading multiple S3 objects, transforming, writing back one"

Re: Reading multiple S3 objects, transforming, writing back one

2014-05-04 Thread Peter

Hi Patrick I should probably explain my use case in a bit more detail. I have hundreds of thousands to millions of clients uploading events to my pipeline, these are batched periodically (every 60 seconds atm) into logs which are dumped into S3 (and uploaded into a data warehouse). I need to po

Re: Reading multiple S3 objects, transforming, writing back one

2014-05-04 Thread Peter

Thank you Chris, I am familiar with S3distcp, I'm trying to replicate some of that functionality and combine it with my log post processing in one step instead of yet another step. On Saturday, May 3, 2014 4:15 PM, Chris Fregly wrote: not sure if this directly addresses your issue, peter, but

Re: Reading multiple S3 objects, transforming, writing back one

2014-05-04 Thread Nicholas Chammas

Chris, To use s3distcp in this case, are you suggesting saving the RDD to local/ephemeral HDFS and then copying it up to S3 using this tool? On Sat, May 3, 2014 at 7:14 PM, Chris Fregly wrote: > not sure if this directly addresses your issue, peter, but it's worth > mentioned a handy AWS EMR u

Re: Reading multiple S3 objects, transforming, writing back one

2014-05-03 Thread Patrick Wendell

Hi Peter, If your dataset is large (3GB) then why not just chunk it into multiple files? You'll need to do this anyways if you want to read/write this from S3 quickly, because S3's throughput per file is limited (as you've seen). This is exactly why the Hadoop API lets you save datasets with many

Re: Reading multiple S3 objects, transforming, writing back one

2014-05-03 Thread Chris Fregly

not sure if this directly addresses your issue, peter, but it's worth mentioned a handy AWS EMR utility called s3distcp that can upload a single HDFS file - in parallel - to a single, concatenated S3 file once all the partitions are uploaded. kinda cool. here's some info: http://docs.aws.amazon.

Re: Reading multiple S3 objects, transforming, writing back one

2014-05-01 Thread Nicholas Chammas

The fastest way to save to S3 should be to leave the RDD with many partitions, because all partitions will be written out in parallel. Then, once the various parts are in S3, somehow concatenate the files together into one file. If this can be done within S3 (I don't know if this is possible), th

Re: Reading multiple S3 objects, transforming, writing back one

2014-05-01 Thread Peter

Thank you Patrick. I took a quick stab at it: val s3Client = new AmazonS3Client(...) val copyObjectResult = s3Client.copyObject("upload", outputPrefix + "/part-0", "rolled-up-logs", "2014-04-28.csv") val objectListing = s3Client.listObjects("upload", outputPrefix) s3Client.d

Re: Reading multiple S3 objects, transforming, writing back one

2014-04-30 Thread Patrick Wendell

This is a consequence of the way the Hadoop files API works. However, you can (fairly easily) add code to just rename the file because it will always produce the same filename. (heavy use of pseudo code) dir = "/some/dir" rdd.coalesce(1).saveAsTextFile(dir) f = new File(dir + "part-0") f.move

Re: Reading multiple S3 objects, transforming, writing back one

2014-04-30 Thread Peter

Thanks Nicholas, this is a bit of a shame, not very practical for log roll up for example when every output needs to be in it's own "directory". On Wednesday, April 30, 2014 12:15 PM, Nicholas Chammas wrote: Yes, saveAsTextFile() will give you 1 part per RDD partition. When you coalesce(1),

Re: Reading multiple S3 objects, transforming, writing back one

2014-04-30 Thread Nicholas Chammas

Yes, saveAsTextFile() will give you 1 part per RDD partition. When you coalesce(1), you move everything in the RDD to a single partition, which then gives you 1 output file. It will still be called part-0 or something like that because that’s defined by the Hadoop API that Spark uses for readi

Re: Reading multiple S3 objects, transforming, writing back one

2014-04-30 Thread Peter

Ah, looks like RDD.coalesce(1) solves one part of the problem. On Wednesday, April 30, 2014 11:15 AM, Peter wrote: Hi Playing around with Spark & S3, I'm opening multiple objects (CSV files) with: val hfile = sc.textFile("s3n://bucket/2014-04-28/") so hfile is a RDD representing 10 object

Reading multiple S3 objects, transforming, writing back one

2014-04-30 Thread Peter

Hi Playing around with Spark & S3, I'm opening multiple objects (CSV files) with: val hfile = sc.textFile("s3n://bucket/2014-04-28/") so hfile is a RDD representing 10 objects that were "underneath" 2014-04-28. After I've sorted and otherwise transformed the content, I'm trying to write it

Re: Reading multiple S3 objects, transforming, writing back one

Re: Reading multiple S3 objects, transforming, writing back one

Re: Reading multiple S3 objects, transforming, writing back one

Re: Reading multiple S3 objects, transforming, writing back one

Re: Reading multiple S3 objects, transforming, writing back one

Re: Reading multiple S3 objects, transforming, writing back one

Re: Reading multiple S3 objects, transforming, writing back one

Re: Reading multiple S3 objects, transforming, writing back one

Re: Reading multiple S3 objects, transforming, writing back one

Re: Reading multiple S3 objects, transforming, writing back one

Re: Reading multiple S3 objects, transforming, writing back one

Reading multiple S3 objects, transforming, writing back one

12 matches

Site Navigation

Mail list logo

Footer information