Dear List,
I want to set up a DW based on Hive. However, my data does not come as handy csv files but as binary files in a proprietary format. The binary file consists of - 1 header of a dynamic number of bytes, which can be read from the contents of the header The header tells me how to parse the rows and how many bytes each row has. - n rows of k bytes, where k is defined within the header The solution I have in mind looks as follows - Write a custom InputFormat which chunks the data into blobs of length k but skips the bytes of the header. So I’d have two parameters for the Inputformat. (bytes to skip, bytes per row) Do I really have to build this myself or does sth. like this already exists? Worst case I could also remove the header prior to pushing the data into the hdfs - Write a custom SerDe to parse the Blobs. At least in theory easy. The coding part does not look to complicated, however, I’m kind of struggling with how to compile and install such serde. I installed Hive from source and imported it into eclipse. I guess I’ve to now build my own project…. Still I’m a little bit lost. Is there any tutorial which describes the process? And also is my general idea ok? thanks in advance