Hi Josh,

OK, I'm committed to looking into this problem as and when I get time.
I will do my best to try and raise JIRAs and submit some unit tests to 
reproduce.
Hopefully I'll be able to work on some fixes as well.

However, in the short-term I would really like to know how the 
AccumuloStorageHandler would/should/will actually store ARRAYs.
You mention the HBaseStorageHandler, so that might be a guide,  but I haven't 
played much with HBase and so that doesn't help
me much in the short term

I would assume that if we are storing an ARRAY of a fixed length type (I'm 
limiting myself to the binary representation here)
then we would end up with just the binary values stored sequentially.
So an array of INT with values 3, 23, 10 would be stored as 
\x00\x00\x00\x03\x00\x00\x01\x06\x00\x00\x00\x0a
and so on for all the other types.
This seems obvious enough,  but I would like to check.

But the real question is how is an ARRAY<STRING> stored? I can't really see a 
delimiter being used as you can't
be sure that the delimiter doesn't occur in the data. So I would assume that it 
would be something like this:

<length1><string1><length2><string2> ... <lastLength><lastString>

Is this correct?

If it is how would the lengths be stored? As 4-byte integers?
Or some variable length encoding scheme.

I know that I'm asking a lot here, as I should probably just look at the code 
and work it out for myself,
but if you do know and could let me know I'd be grateful.

Thanks,

Z

-----Original Message-----
From: Josh Elser [mailto:josh.el...@gmail.com]
Sent: 13 September 2015 03:19
To: user@hive.apache.org
Subject: Re: Accumulo Storage Manager

So the binary parsing definitely seems wrong. Maybe two issues there:
one being the inline #binary not being recognized with the '*' map modifier and 
the second being the row failing to parse.

I'd have to write a test to see how the HBaseStorageHandler works and see if I 
missed something in handling all the types correctly. The 
AccumuloStorageHandler should be able to handle the same kind of types that a 
native table can handle. So, I would call ARRAYs not being serialized a bug as 
well.

Sorry you're running into this. If you could capture these in JIRA issues, that 
would make it really good to start working through them and get them fixed.

If you have the time and desire, trying to reproduce theses failures in unit 
tests would also be great :). The type handling can be a little difficult but 
there are likely some places to start in the accumulo or hbase handler tests. 
At worst, we can start by writing a qtest that will reproduce your errors using 
an full environment (Accumulo minicluster, etc).

peter.mar...@baesystems.com wrote:
> Hi Josh,
>
> At this stage I don't know whether there's anything wrong with Hive or it's 
> just user error.
> Perhaps if I go through what I have done you can see where the error lies.
> Unfortunately this is going to be wordy. Apologies in advance for the long 
> email.
>
<snip>
Please consider the environment before printing this email. This message should 
be regarded as confidential. If you have received this email in error please 
notify the sender and destroy it immediately. Statements of intent shall only 
become binding when confirmed in hard copy by an authorised signatory. The 
contents of this email may relate to dealings with other companies under the 
control of BAE Systems Applied Intelligence Limited, details of which can be 
found at http://www.baesystems.com/Businesses/index.htm.

Reply via email to