Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

swetha kasireddy Thu, 05 Nov 2015 09:12:08 -0800

I am not looking for Spark Sql specifically. My usecase is that I need to
save an RDD as a parquet file in hdfs at the end of a batch and load it
back and convert it into an RDD in the next batch. The RDD has a String and
a Long as the key/value pairs.


On Wed, Nov 4, 2015 at 11:52 PM, Stefano Baghino <
[email protected]> wrote:

> What scenario would you like to optimize for? If you have something more
> specific regarding your use case, the mailing list can surely provide you
> with some very good advice.
>
> If you just want to save an RDD as Avro you can use a module from
> Databricks (the README on GitHub
> <https://github.com/databricks/spark-avro> also gives you some example),
> otherwise Parquet is natively supported by Spark SQL, the official
> documentation contains useful examples
> <http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files>
> .
>
> On Thu, Nov 5, 2015 at 12:09 AM, swetha <[email protected]> wrote:
>
>> Hi,
>>
>> What is the efficient approach to save an RDD as a file in HDFS and
>> retrieve
>> it back? I was thinking between Avro, Parquet and SequenceFileFormart. We
>> currently use SequenceFileFormart for one of our use cases.
>>
>> Any example on how to store and retrieve an RDD in an Avro and Parquet
>> file
>> formats would be of great help.
>>
>> Thanks,
>> Swetha
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-approach-to-store-an-RDD-as-a-file-in-HDFS-and-read-it-back-as-an-RDD-tp25279.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>
>
> --
> BR,
> Stefano Baghino
>
> Software Engineer @ Radicalbit
>

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

Reply via email to