Re: custom binary format

Ingo Thon Thu, 18 Dec 2014 15:01:24 -0800

Hello Andrew,

this one looks indeed like a good idea.
However, there is also another Problem already here. This InputFormat expects 
that
conf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, recordLength); 
is set. I haven’t found any way to specify a parameter for a InputFormat.
I couldn’t find any way to specify it. Do you have any hints how to do it?


Ingo

On 18 Dec 2014, at 23:40, Andrew Mains <andrew.ma...@kontagent.com> wrote:

> Hi Ingo,
> 
> Take a look at 
> https://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapred/FixedLengthInputFormat.html--it
>  seems to be designed for use cases very similar to yours. You may need to 
> subclass it to make things work precisely the way you need (in particular, to 
> deal with the header properly), but I think it ought to be a good place to 
> start.
> 
> Andrew
> 
> On 12/18/14, 2:25 PM, Ingo Thon wrote:
>> Hi thanks for the answer so far, however, I still think there must be an 
>> easy way.
>> The file format I’m looking at is pretty simple.
>> There is first an header of
>> n bytes, Which can be ignored. After that there is the data.
>> The data consists of rows where ich rows has 9 bytes.
>> First there is a byte int (0..256), then there is an 8 byte int (0….)
>> 
>> If I understand correctly lazy.LazySimpleSerDe should do the SerDe part.
>> Is that right. so if I say schema TinyInt,Int64 a row consisting of 9 bytes 
>> will be correctly parsed?
>> 
>> The only thing missing would then be a proper input format.
>> Ignoring the header format 
>> org.apache.hadoop.hive.ql.io.HiveBinaryOutputFormat would actually doing the 
>> output part.
>> Any hints how to do the input part?
>> 
>> thanks in advance!
>> 
>> 
>> 
>> On 12 Dec 2014, at 17:02, Moore, Douglas 
>> <douglas.mo...@thinkbiganalytics.com> wrote:
>> 
>>> You want to look into ADD JAR and CREATE FUNCTION (for UDFs) and STORED AS
>>> 'full.class.name' for serde.
>>> 
>>> For tutorials, google for "adding custom serde", I found one from
>>> Cloudera:
>>> http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/
>>> 
>>> Depending on your numbers (rows / file, bytes / file, files per time
>>> interval, #containers || map slots, mem size per slot or container)
>>> creating a split of your file may not be necessary to obtain good
>>> performance.
>>> 
>>> - Douglas
>>> 
>>> 
>>> 
>>> 
>>> On 12/12/14 2:17 AM, "Ingo Thon" <ist...@gmx.de> wrote:
>>> 
>>>> Dear List,
>>>> 
>>>> 
>>>> I want to set up a DW based on Hive. However, my data does not come as
>>>> handy csv files but as binary files in a proprietary format.
>>>> 
>>>> The binary file  consists of
>>>> - 1 header of a dynamic number of bytes, which can be read from the
>>>> contents of the header
>>>>  The header tells me how to parse the rows and how many bytes each row
>>>> has.
>>>> - n rows of k bytes, where k is defined within the header
>>>> 
>>>> 
>>>> The solution I have in mind looks as follows
>>>> - Write a custom InputFormat which chunks the data into blobs of length k
>>>> but skips the bytes of the header. So I¹d have two parameters for the
>>>> Inputformat. (bytes to skip, bytes per row)
>>>> Do I really have to build this myself or does sth. like this already
>>>> exists? Worst case I could also remove the header prior to pushing the
>>>> data into the hdfs
>>>> - Write a custom SerDe to parse the Blobs. At least in theory easy.
>>>> 
>>>> The coding part does not look to complicated, however, I¹m kind of
>>>> struggling with how to compile and install such serde. I installed Hive
>>>> from source and imported it into eclipse.
>>>> I guess I¹ve to now build my own projectŠ. Still I¹m a little bit lost.
>>>> Is there any tutorial which describes the process?
>>>> And also is my general idea ok?
>>>> 
>>>> thanks in advance
>

Re: custom binary format

Reply via email to