Re: binary file deserialization

Ted Yu Wed, 09 Mar 2016 14:51:12 -0800

bq. there is a varying number of items for that record

If the combination of items is very large, using case class would be
tedious.


On Wed, Mar 9, 2016 at 9:57 AM, Saurabh Bajaj <bajaj.onl...@gmail.com>
wrote:

> You can load that binary up as a String RDD, then map over that RDD and
> convert each row to your case class representing the data. In the map stage
> you could also map the input string into an RDD of JSON values and use the
> following function to convert it into a DF
>
> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
>
> val anotherPeople = sqlContext.read.json(anotherPeopleRDD)
>
>
> On Wed, Mar 9, 2016 at 9:15 AM, Ruslan Dautkhanov <dautkha...@gmail.com>
> wrote:
>
>> We have a huge binary file in a custom serialization format (e.g. header
>> tells the length of the record, then there is a varying number of items for
>> that record). This is produced by an old c++ application.
>> What would be best approach to deserialize it into a Hive table or a
>> Spark RDD?
>> Format is known and well documented.
>>
>>
>> --
>> Ruslan Dautkhanov
>>
>
>

Re: binary file deserialization

Reply via email to