Hello Vitalii, The 5TB limit is only valid if you are using the EMR framework to run ur jobs in a jobflow. I think we cannot use that in my case as I have a CDH4 cluster on EC2. But thanks for the tip. Reference: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/FileSystemConfig.html
On Tue, Apr 9, 2013 at 9:09 AM, Vitalii Tymchyshyn <[email protected]> wrote: > Have you tried it with native? AFAIR the limitation was raised to 5TB few > years ago. > 8 квіт. 2013 18:30, "Panshul Whisper" <[email protected]> напис. > > > Thank you for the advice David. > > > > I tried this ant it works with the native system. But my problem is not > > solved yet, because I have to work with files much bigger than 5GB. My > test > > data file is 9GB. How do I make it read from s3:// > > > > Thanking You, > > > > Regards, > > > > > > On Mon, Apr 8, 2013 at 3:27 PM, David LaBarbera < > > [email protected]> wrote: > > > > > Try > > > fs.s3n.aws... > > > > > > and also load from s3 > > > data = load 's3n://...' > > > > > > The "n" stands for native. I believe S3 also supports block device > > storage > > > (s3://) which allows bigger files to be stored. I don't know how (if at > > > all) the two types interact. > > > > > > David > > > > > > On Apr 7, 2013, at 1:11 PM, Panshul Whisper <[email protected]> > > wrote: > > > > > > > Hello > > > > > > > > I am trying to run a pig script which is suppoesed to read input from > > s3 > > > > and write back to s3. The cluster > > > > scenario is as follows: > > > > * Cluster is installed on EC2 using Cloudera Manager 4.5 Automatic > > > > Installation > > > > * Installed version: CDH4 > > > > * Script location on - one of the nodes of cluster > > > > * running as : $ pig countGroups_daily.pig > > > > > > > > *The Pig Script*: > > > > set fs.s3.awsAccessKeyId xxxxxxxxxxxxxxxxxx > > > > set fs.s3.awsSecretAccessKey xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx > > > > --load the sample input file > > > > data = load 's3://steamdata/nysedata/NYSE_daily.txt' as > > > > (exchange:chararray, symbol:chararray, date:chararray, open:float, > > > > high:float, low:float, close:float, volume:int, adj_close:float); > > > > --group data by symbols > > > > symbolgrp = group data by symbol; > > > > --count data in every group > > > > symcount = foreach symbolgrp generate group,COUNT(data); > > > > --order the counted list by count > > > > symcountordered = order symcount by $1; > > > > store symcountordered into 's3://steamdata/nyseoutput/daily'; > > > > > > > > *Error:* > > > > > > > > Message: org.apache.pig.backend.executionengine.ExecException: ERROR > > > 2118: > > > > Input path does not exist: s3://steamdata/nysedata/NYSE_daily.txt > > > > > > > > Input(s): > > > > Failed to read data from "s3://steamdata/nysedata/NYSE_daily.txt" > > > > > > > > Please help me, what am I doing wrong. I can assure you that the > input > > > > path/file exists on s3 and the AWS key and secret key entered are > > > correct. > > > > > > > > Thanking You, > > > > > > > > > > > > -- > > > > Regards, > > > > Ouch Whisper > > > > 010101010101 > > > > > > > > > > > > -- > > Regards, > > Ouch Whisper > > 010101010101 > > > -- Regards, Ouch Whisper 010101010101
