Hi Josh, OK, I'm committed to looking into this problem as and when I get time. I will do my best to try and raise JIRAs and submit some unit tests to reproduce. Hopefully I'll be able to work on some fixes as well.
However, in the short-term I would really like to know how the AccumuloStorageHandler would/should/will actually store ARRAYs. You mention the HBaseStorageHandler, so that might be a guide, but I haven't played much with HBase and so that doesn't help me much in the short term I would assume that if we are storing an ARRAY of a fixed length type (I'm limiting myself to the binary representation here) then we would end up with just the binary values stored sequentially. So an array of INT with values 3, 23, 10 would be stored as \x00\x00\x00\x03\x00\x00\x01\x06\x00\x00\x00\x0a and so on for all the other types. This seems obvious enough, but I would like to check. But the real question is how is an ARRAY<STRING> stored? I can't really see a delimiter being used as you can't be sure that the delimiter doesn't occur in the data. So I would assume that it would be something like this: <length1><string1><length2><string2> ... <lastLength><lastString> Is this correct? If it is how would the lengths be stored? As 4-byte integers? Or some variable length encoding scheme. I know that I'm asking a lot here, as I should probably just look at the code and work it out for myself, but if you do know and could let me know I'd be grateful. Thanks, Z -----Original Message----- From: Josh Elser [mailto:josh.el...@gmail.com] Sent: 13 September 2015 03:19 To: user@hive.apache.org Subject: Re: Accumulo Storage Manager So the binary parsing definitely seems wrong. Maybe two issues there: one being the inline #binary not being recognized with the '*' map modifier and the second being the row failing to parse. I'd have to write a test to see how the HBaseStorageHandler works and see if I missed something in handling all the types correctly. The AccumuloStorageHandler should be able to handle the same kind of types that a native table can handle. So, I would call ARRAYs not being serialized a bug as well. Sorry you're running into this. If you could capture these in JIRA issues, that would make it really good to start working through them and get them fixed. If you have the time and desire, trying to reproduce theses failures in unit tests would also be great :). The type handling can be a little difficult but there are likely some places to start in the accumulo or hbase handler tests. At worst, we can start by writing a qtest that will reproduce your errors using an full environment (Accumulo minicluster, etc). peter.mar...@baesystems.com wrote: > Hi Josh, > > At this stage I don't know whether there's anything wrong with Hive or it's > just user error. > Perhaps if I go through what I have done you can see where the error lies. > Unfortunately this is going to be wordy. Apologies in advance for the long > email. > <snip> Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies under the control of BAE Systems Applied Intelligence Limited, details of which can be found at http://www.baesystems.com/Businesses/index.htm.