Re: pyspark and hdfs file name

Davies Liu Fri, 14 Nov 2014 00:30:53 -0800

On Fri, Nov 14, 2014 at 12:14 AM, Oleg Ruchovets <oruchov...@gmail.com> wrote:
> Hi Devies.
> Thank you for the quick answer.
>
> I have a code like this:
>
> ....
>
> sc = SparkContext(appName="TAD")
> lines = sc.textFile(sys.argv[1], 1)
> result = lines.map(doSplit).groupByKey().map(lambda (k,vc):
> traffic_process_model(k,vc))
> result.saveAsTextFile(sys.argv[2])
>
>
> Can  you please give short example what should I do?
>
> Also I found only saveAsTextFile. Does PySpark has saveAsBinary options or
> what is the way to change text format output files?


You can use saveAsPickleFile() [1], you could use the following line
to rename (it's slow):

>>> os.system( "hadoop fs -mv URI [URI …] <dest>")

Just found that there is a pure python client for HDFS [2] (not verified).

[1] 
http://spark.apache.org/docs/latest/api/python/pyspark.rdd.RDD-class.html#saveAsPickleFile
[2] https://labs.spotify.com/2013/05/07/snakebite/

> Thanks
> Oleg.
>
> On Fri, Nov 14, 2014 at 3:26 PM, Davies Liu <dav...@databricks.com> wrote:
>>
>> One option maybe call HDFS tools or client to rename them after
>> saveAsXXXFile().
>>
>> On Thu, Nov 13, 2014 at 9:39 PM, Oleg Ruchovets <oruchov...@gmail.com>
>> wrote:
>> > Hi ,
>> >   I am running pyspark job.
>> > I need serialize final result to hdfs in binary files and having ability
>> > to
>> > give a name for output files.
>> >
>> > I found this post:
>> >
>> > http://stackoverflow.com/questions/25293962/specifying-the-output-file-name-in-apache-spark
>> >
>> > but it explains how to do it using scala.
>> >
>> > Question:
>> >  How to do it using pyspark
>> >
>> > Thanks
>> > Oleg.
>> >
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: pyspark and hdfs file name

Reply via email to