Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

Igor Berman Thu, 05 Nov 2015 12:50:18 -0800

java/scala? I think there is everything in dataframes tutorial
*e.g. if u have dataframe and working from java - toJavaRDD
<https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrame.html#toJavaRDD()>*
()


On 5 November 2015 at 21:13, swetha kasireddy <[email protected]>
wrote:

> How to convert a parquet file that is saved in hdfs to an RDD after
> reading the file from hdfs?
>
> On Thu, Nov 5, 2015 at 10:02 AM, Igor Berman <[email protected]>
> wrote:
>
>> Hi,
>> we are using avro with compression(snappy). As soon as you have enough
>> partitions, the saving won't be a problem imho.
>> in general hdfs is pretty fast, s3 is less so
>> the issue with storing data is that you will loose your partitioner(even
>> though rdd has it) at loading moment. There is PR that tries to solve this.
>>
>>
>> On 5 November 2015 at 01:09, swetha <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> What is the efficient approach to save an RDD as a file in HDFS and
>>> retrieve
>>> it back? I was thinking between Avro, Parquet and SequenceFileFormart. We
>>> currently use SequenceFileFormart for one of our use cases.
>>>
>>> Any example on how to store and retrieve an RDD in an Avro and Parquet
>>> file
>>> formats would be of great help.
>>>
>>> Thanks,
>>> Swetha
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-approach-to-store-an-RDD-as-a-file-in-HDFS-and-read-it-back-as-an-RDD-tp25279.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>
>>
>

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

Reply via email to