bq. there is a varying number of items for that record If the combination of items is very large, using case class would be tedious.
On Wed, Mar 9, 2016 at 9:57 AM, Saurabh Bajaj <bajaj.onl...@gmail.com> wrote: > You can load that binary up as a String RDD, then map over that RDD and > convert each row to your case class representing the data. In the map stage > you could also map the input string into an RDD of JSON values and use the > following function to convert it into a DF > > http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets > > val anotherPeople = sqlContext.read.json(anotherPeopleRDD) > > > On Wed, Mar 9, 2016 at 9:15 AM, Ruslan Dautkhanov <dautkha...@gmail.com> > wrote: > >> We have a huge binary file in a custom serialization format (e.g. header >> tells the length of the record, then there is a varying number of items for >> that record). This is produced by an old c++ application. >> What would be best approach to deserialize it into a Hive table or a >> Spark RDD? >> Format is known and well documented. >> >> >> -- >> Ruslan Dautkhanov >> > >