custom binary format

Ingo Thon Thu, 11 Dec 2014 23:18:12 -0800

Dear List,


I want to set up a DW based on Hive. However, my data does not come as handy 
csv files but as binary files in a proprietary format.

The binary file  consists of 
- 1 header of a dynamic number of bytes, which can be read from the contents of 
the header
   The header tells me how to parse the rows and how many bytes each row has.
- n rows of k bytes, where k is defined within the header


The solution I have in mind looks as follows
- Write a custom InputFormat which chunks the data into blobs of length k but 
skips the bytes of the header. So I’d have two parameters for the Inputformat. 
(bytes to skip, bytes per row)
  Do I really have to build this myself or does sth. like this already exists? 
Worst case I could also remove the header prior to pushing the data into the 
hdfs
- Write a custom SerDe to parse the Blobs. At least in theory easy.

The coding part does not look to complicated, however, I’m kind of struggling 
with how to compile and install such serde. I installed Hive from source and 
imported it into eclipse.
I guess I’ve to now build my own project…. Still I’m a little bit lost. Is 
there any tutorial which describes the process?
And also is my general idea ok?

thanks in advance

custom binary format

Reply via email to