Re: Why are there different "parts" in my CSV?

Akhil Das Sat, 14 Feb 2015 01:23:18 -0800

Simplest way would be to merge the output files at the end of your job like:


hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt

If you want to do it pro grammatically, then you can use the 
FileUtil.copyMerge API
. like:

FileUtil.copyMerge(FileSystem of source(hdfs), /output-location, FileSystem
of destination(hdfs), Path to the merged files /merged-ouput, true(to
delete the original dir),null)



Thanks
Best Regards

On Sat, Feb 14, 2015 at 2:18 AM, Su She <suhsheka...@gmail.com> wrote:

> Thanks Akhil for the suggestion, it is now only giving me one part - xxxx.
> Is there anyway I can just create a file rather than a directory? It
> doesn't seem like there is just a saveAsTextFile option for
> JavaPairRecieverDstream.
>
> Also, for the copy/merge api, how would I add that to my spark job?
>
> Thanks Akhil!
>
> Best,
>
> Su
>
> On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> For streaming application, for every batch it will create a new directory
>> and puts the data in it. If you don't want to have multiple files inside
>> the directory as part-xxxx then you can do a repartition before the saveAs*
>> call.
>>
>> messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
>> String.class, (Class) TextOutputFormat.class);
>>
>>
>> Thanks
>> Best Regards
>>
>> On Fri, Feb 13, 2015 at 11:59 AM, Su She <suhsheka...@gmail.com> wrote:
>>
>>> Hello Everyone,
>>>
>>> I am writing simple word counts to hdfs using
>>> messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
>>> String.class, (Class) TextOutputFormat.class);
>>>
>>> 1) However, each 2 seconds I getting a new *directory *that is titled
>>> as a csv. So i'll have test.csv, which will be a directory that has two
>>> files inside of it called part-00000 and part 00001 (something like that).
>>> This obv makes it very hard for me to read the data stored in the csv
>>> files. I am wondering if there is a better way to store the
>>> JavaPairRecieverDStream and JavaPairDStream?
>>>
>>> 2) I know there is a copy/merge hadoop api for merging files...can this
>>> be done inside java? I am not sure the logic behind this api if I am using
>>> spark streaming which is constantly making new files.
>>>
>>> Thanks a lot for the help!
>>>
>>
>>
>

Re: Why are there different "parts" in my CSV?

Reply via email to