Hello Andrew, this one looks indeed like a good idea. However, there is also another Problem already here. This InputFormat expects that conf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, recordLength); is set. I haven’t found any way to specify a parameter for a InputFormat. I couldn’t find any way to specify it. Do you have any hints how to do it?
Ingo On 18 Dec 2014, at 23:40, Andrew Mains <andrew.ma...@kontagent.com> wrote: > Hi Ingo, > > Take a look at > https://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapred/FixedLengthInputFormat.html--it > seems to be designed for use cases very similar to yours. You may need to > subclass it to make things work precisely the way you need (in particular, to > deal with the header properly), but I think it ought to be a good place to > start. > > Andrew > > On 12/18/14, 2:25 PM, Ingo Thon wrote: >> Hi thanks for the answer so far, however, I still think there must be an >> easy way. >> The file format I’m looking at is pretty simple. >> There is first an header of >> n bytes, Which can be ignored. After that there is the data. >> The data consists of rows where ich rows has 9 bytes. >> First there is a byte int (0..256), then there is an 8 byte int (0….) >> >> If I understand correctly lazy.LazySimpleSerDe should do the SerDe part. >> Is that right. so if I say schema TinyInt,Int64 a row consisting of 9 bytes >> will be correctly parsed? >> >> The only thing missing would then be a proper input format. >> Ignoring the header format >> org.apache.hadoop.hive.ql.io.HiveBinaryOutputFormat would actually doing the >> output part. >> Any hints how to do the input part? >> >> thanks in advance! >> >> >> >> On 12 Dec 2014, at 17:02, Moore, Douglas >> <douglas.mo...@thinkbiganalytics.com> wrote: >> >>> You want to look into ADD JAR and CREATE FUNCTION (for UDFs) and STORED AS >>> 'full.class.name' for serde. >>> >>> For tutorials, google for "adding custom serde", I found one from >>> Cloudera: >>> http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/ >>> >>> Depending on your numbers (rows / file, bytes / file, files per time >>> interval, #containers || map slots, mem size per slot or container) >>> creating a split of your file may not be necessary to obtain good >>> performance. >>> >>> - Douglas >>> >>> >>> >>> >>> On 12/12/14 2:17 AM, "Ingo Thon" <ist...@gmx.de> wrote: >>> >>>> Dear List, >>>> >>>> >>>> I want to set up a DW based on Hive. However, my data does not come as >>>> handy csv files but as binary files in a proprietary format. >>>> >>>> The binary file consists of >>>> - 1 header of a dynamic number of bytes, which can be read from the >>>> contents of the header >>>> The header tells me how to parse the rows and how many bytes each row >>>> has. >>>> - n rows of k bytes, where k is defined within the header >>>> >>>> >>>> The solution I have in mind looks as follows >>>> - Write a custom InputFormat which chunks the data into blobs of length k >>>> but skips the bytes of the header. So I¹d have two parameters for the >>>> Inputformat. (bytes to skip, bytes per row) >>>> Do I really have to build this myself or does sth. like this already >>>> exists? Worst case I could also remove the header prior to pushing the >>>> data into the hdfs >>>> - Write a custom SerDe to parse the Blobs. At least in theory easy. >>>> >>>> The coding part does not look to complicated, however, I¹m kind of >>>> struggling with how to compile and install such serde. I installed Hive >>>> from source and imported it into eclipse. >>>> I guess I¹ve to now build my own projectŠ. Still I¹m a little bit lost. >>>> Is there any tutorial which describes the process? >>>> And also is my general idea ok? >>>> >>>> thanks in advance >