Accumulo Storage Manager
Hi, I have been trying out the Hive Accumulo Manager as described here https://cwiki.apache.org/confluence/display/Hive/AccumuloIntegration and it seems to work as advertised. Thanks. However I don't seem to be able to get any sensible results when I have a Hive column of type ARRAY<> like ARRAY or ARRAY. Are Hive columns of ARRAY<> types not supported? If so then how are they stored in Accumulo? Also, although I can get a Hive MAP to work by using the approach where the map key is mapped to the column qualifier ("Using an asterisk in the column mapping string") I don't seem to be able to get a Hive MAP stored successfully in any other way. Is this the only supported way to store Hive MAPs? I know that I could customise the storage manager to achieve what I want but, in the short-term at least, I would like To know what I can achieve with the Hive Accumulo Storage Manager as is. Regards, Z Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies under the control of BAE Systems Applied Intelligence Limited, details of which can be found at http://www.baesystems.com/Businesses/index.htm.
RE: Accumulo Storage Manager
Hi Josh, At this stage I don't know whether there's anything wrong with Hive or it's just user error. Perhaps if I go through what I have done you can see where the error lies. Unfortunately this is going to be wordy. Apologies in advance for the long email. So I created a "normal" table in HDFS with a variety of column types like this: CREATE TABLE employees4 ( rowid STRING, flag BOOLEAN, number INT, bignum BIGINT, name STRING, salary FLOAT, bigsalary DOUBLE, numbers ARRAY, floats ARRAY, subordinates ARRAY, deductions MAP, namedNumbers MAP, address STRUCT); And I put some data into it and I can see the data: hive> SELECT * FROM employees4; OK row1true100 7 John Doe10.010.0 [13,23,-1,1001] [3.14159,2.71828,-1.1,1001.0] ["Mary Smith","Todd Jones"] {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1} {"nameOne":123,"Name Two":49,"The Third Man":-1}{"street":"1 Michigan Ave.","city":"Chicago","state":"IL","zip":60600} row2false 7 100 Mary Smith 10.08.0 [13,23,-1,1001] [3.14159,2.71828,-1.1,1001.0,1001.0]["Bill King"] {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}{"nameOne":123,"Name Two":49,"The Third Man":-1} {"street":"100 Ontario St.","city":"Chicago","state":"IL","zip":60601} row3false 3245877878 Todd Jones 10.07.0 [13,23,-1,1001] [3.14159,2.71828,-1.1,1001.0,2.0] [] {"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1} {"nameOne":123,"Name Two":49,"The Third Man":-1} {"street":"200 Chicago Ave.","city":"Oak Park","state":"IL","zip":60700} row4true877878 3245Bill King 10.06.0 [13,23,-1,1001] [3.14159,2.71828,-1.1,1001.0,1001.0,1001.0,1001.0] [] {"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1} {"nameOne":123,"Name Two":49,"The Third Man":-1}{"street":"300 Obscure Dr.","city":"Obscuria","state":"IL","zip":60100} Time taken: 0.535 seconds, Fetched: 4 row(s) Everything looks fine. Now I create a Hive table stored in Accumulo: DROP TABLE IF EXISTS accumulo_table4; CREATE TABLE accumulo_table4 ( rowid STRING, flag BOOLEAN, number INT, bignum BIGINT, name STRING, salary FLOAT, bigsalary DOUBLE, numbers ARRAY, floats ARRAY, subordinates ARRAY, deductions MAP, namednumbers MAP, address STRUCT) STORED BY 'org.apache.hadoop.hive.accumulo.AccumuloStorageHandler' WITH SERDEPROPERTIES('accumulo.columns.mapping' = ':rowid,person:flag#binary,person:number#binary,person:bignum#binary,person:name,person:salary#binary,person:bigsalary#binary,person:numbers#binary,person:floats,person:subordinates,deductions:*,namednumbers:*,person:address'); (Note that I am only really interested in storing the values in "binary".) Now I can load the Accumulo table from the normal table: INSERT OVERWRITE TABLE accumulo_table4 SELECT * FROM employees4; And I can query the data from the Accumulo table. hive> SELECT * FROM accumulo_table4; OK row1true100 7 John Doe10.010.0 [null] [null] ["Mary Smith\u0003Todd Jones"] {"Federal Taxes":0.2,"Insurance":0.1,"State Taxes":0.05}{"Name Two":49,"The Third Man":-1,"nameOne":123} {"street":"1 Michigan Ave.\u0003Chicago\u0003IL\u000360600","city":null,"state":null,"zip":null} row2false 7 100 Mary Smith 10.08.0 [null] [null] ["Bill King"] {"Federal Taxes":0.2,"Insurance":0.1,"State Taxes":0.05}{"Name Two":49,"The Third Man":-1,"nameOne":123} {"street":"100 Ontario St.\u0003Chicago\u0003IL\u000360601","city":null,"state":null,"zip":null} row3false 3245877878 Todd Jones 10.07.0 [null] [null] [] {"Federal Taxes":0.15,"Insurance":0.1,"State Taxes":0.03} {"Name Two":49,"The Third Man":-1,"nameOne":123} {"street":"200 Chicago Ave.\u0003Oak Park\u0003IL\u000360700","city":null,"state":null,"zip":null} row4true877878 3245Bill King 10.06.0 [null] [null] [] {"Federal Taxes":0.15,"Insurance":0.1,"State Taxes":0.03} {"Name Two":49,"The Third Man":-1,"nameOne":123} {"street":"300 Obscure Dr.\u0003Obscuria\u0003IL\u000360100","city":null,"state":null,"zip":null} Time taken: 0.109 seconds, Fetched: 4 row(s) Notice that the columns with type ARRAYand ARRAY are empty. I assume that this means that there is something wrong and the Hive Storage Handler is returning a null? When I use the accumulo shell to look at the data stored in Accumulo root@accumulo> scan -t accumulo_table4 row1 deductions:Federal Taxes []0.2 row1 deductions:Insurance []0.1 row1 deducti
RE: Accumulo Storage Manager
Hi Josh, OK, I'm committed to looking into this problem as and when I get time. I will do my best to try and raise JIRAs and submit some unit tests to reproduce. Hopefully I'll be able to work on some fixes as well. However, in the short-term I would really like to know how the AccumuloStorageHandler would/should/will actually store ARRAYs. You mention the HBaseStorageHandler, so that might be a guide, but I haven't played much with HBase and so that doesn't help me much in the short term I would assume that if we are storing an ARRAY of a fixed length type (I'm limiting myself to the binary representation here) then we would end up with just the binary values stored sequentially. So an array of INT with values 3, 23, 10 would be stored as \x00\x00\x00\x03\x00\x00\x01\x06\x00\x00\x00\x0a and so on for all the other types. This seems obvious enough, but I would like to check. But the real question is how is an ARRAY stored? I can't really see a delimiter being used as you can't be sure that the delimiter doesn't occur in the data. So I would assume that it would be something like this: ... Is this correct? If it is how would the lengths be stored? As 4-byte integers? Or some variable length encoding scheme. I know that I'm asking a lot here, as I should probably just look at the code and work it out for myself, but if you do know and could let me know I'd be grateful. Thanks, Z -Original Message- From: Josh Elser [mailto:josh.el...@gmail.com] Sent: 13 September 2015 03:19 To: user@hive.apache.org Subject: Re: Accumulo Storage Manager So the binary parsing definitely seems wrong. Maybe two issues there: one being the inline #binary not being recognized with the '*' map modifier and the second being the row failing to parse. I'd have to write a test to see how the HBaseStorageHandler works and see if I missed something in handling all the types correctly. The AccumuloStorageHandler should be able to handle the same kind of types that a native table can handle. So, I would call ARRAYs not being serialized a bug as well. Sorry you're running into this. If you could capture these in JIRA issues, that would make it really good to start working through them and get them fixed. If you have the time and desire, trying to reproduce theses failures in unit tests would also be great :). The type handling can be a little difficult but there are likely some places to start in the accumulo or hbase handler tests. At worst, we can start by writing a qtest that will reproduce your errors using an full environment (Accumulo minicluster, etc). peter.mar...@baesystems.com wrote: > Hi Josh, > > At this stage I don't know whether there's anything wrong with Hive or it's > just user error. > Perhaps if I go through what I have done you can see where the error lies. > Unfortunately this is going to be wordy. Apologies in advance for the long > email. > Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies under the control of BAE Systems Applied Intelligence Limited, details of which can be found at http://www.baesystems.com/Businesses/index.htm.
Table statistics
Hi, I was wondering if there is any "recognized" way to obtain table statistics. Ideally, given a Key range I would like to know the number of distinct rowids, entries and amount of data (in bytes) in that key range. I assume that Accumulo holds at least some of this information internally, partly because I can see some of this through the monitor, and partly because it must know something about the quantity of data held in order to be able to implement the table threshold. In my case the tables are very static and so the "estimates" that the monitor has are likely to sufficiently accurate for my purposes. I have found this link http://apache-accumulo.1065345.n5.nabble.com/Determining-tablets-assigned-to-table-splits-and-the-number-of-rows-in-each-tablet-td11546.html which describes a process (which I haven't tried yet) to get the number of entries in a range. Which would probably be sufficient for me and would certainly be a good start. However it seems to be using internal data structures and non-published APIs, which is less than ideal. And it seems to be written against Accumulo version 1.6. I'm using Accumulo 1.7. Is there anything better than I can do or is it recommended that this is the way to go? Regards, Z Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies under the control of BAE Systems Applied Intelligence Limited, details of which can be found at http://www.baesystems.com/Businesses/index.htm.
FW: Table statistics
Sorry, wrong list. Z From: peter.mar...@baesystems.com [mailto:peter.mar...@baesystems.com] Sent: 15 December 2015 09:39 To: user@hive.apache.org Subject: Table statistics Hi, I was wondering if there is any "recognized" way to obtain table statistics. Ideally, given a Key range I would like to know the number of distinct rowids, entries and amount of data (in bytes) in that key range. I assume that Accumulo holds at least some of this information internally, partly because I can see some of this through the monitor, and partly because it must know something about the quantity of data held in order to be able to implement the table threshold. In my case the tables are very static and so the "estimates" that the monitor has are likely to sufficiently accurate for my purposes. I have found this link http://apache-accumulo.1065345.n5.nabble.com/Determining-tablets-assigned-to-table-splits-and-the-number-of-rows-in-each-tablet-td11546.html which describes a process (which I haven't tried yet) to get the number of entries in a range. Which would probably be sufficient for me and would certainly be a good start. However it seems to be using internal data structures and non-published APIs, which is less than ideal. And it seems to be written against Accumulo version 1.6. I'm using Accumulo 1.7. Is there anything better than I can do or is it recommended that this is the way to go? Regards, Z Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies under the control of BAE Systems Applied Intelligence Limited, details of which can be found at http://www.baesystems.com/Businesses/index.htm. Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies under the control of BAE Systems Applied Intelligence Limited, details of which can be found at http://www.baesystems.com/Businesses/index.htm.
Stored By
Hi, So I am using the AccumuloStorageHandler to allow me to access Accumulo tables from Hive. This works fine. So typically I would use something like this: CREATE EXTERNAL TABLE test_text (rowid STRING, testint INT, testbig BIGINT, testfloat FLOAT, testdouble DOUBLE, teststring STRING, testbool BOOLEAN) STORED BY 'org.apache.hadoop.hive.accumulo.AccumuloStorageHandler' WITH SERDEPROPERTIES('accumulo.table.name'='test_table_text','accumulo.columns.mapping' = ':rowid,testint:v,testbig:v,testfloat:v,testdouble:v,teststring:v,testbool:v'); Now for many reasons I am planning to have my own InputFormat. I don't want to start from scratch so I plan to have my class derive from the existing class HiveAccumuloTableInputFormat and pick up a lot of functionality for free. Now it was my understanding that "STORED BY" was a sort of optimization that saved the user having to specify the input format and output format and so on explicitly. Given that I want, eventually, to use my own input format class in the short-term I just want to ensure that I can create a Hive table that uses Accumulo but specifying the inputformat explicitly. I've looked at the source of AccumuloStorageHandler and I can see what inputformat and outputformat it returns. So my best guess at creating the same table as above, but without using "STORED BY" is as follows: CREATE EXTERNAL TABLE test_text2 (rowid STRING, testint INT, testbig BIGINT, testfloat FLOAT, testdouble DOUBLE, teststring STRING, testbool BOOLEAN) ROW FORMAT SERDE 'org.apache.hadoop.hive.accumulo.serde.AccumuloSerDe' WITH SERDEPROPERTIES('accumulo.table.name'='test_table_text','accumulo.columns.mapping' = ':rowid,testint:v,testbig:v,testfloat:v,testdouble:v,teststring:v,testbool:v') STORED AS INPUTFORMAT 'org.apache.hadoop.hive.accumulo.mr.HiveAccumuloTableInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.accumulo.mr.HiveAccumuloTableOutputFormat'; This fails with: FAILED: SemanticException [Error 10055]: Output Format must implement HiveOutputFormat, otherwise it should be either IgnoreKeyTextOutputFormat or SequenceFileOutputFormat Which seems plausible, because 'org.apache.hadoop.hive.accumulo.mr.HiveAccumuloTableOutputFormat' really doesn't seem to implement HiveOutputFormat. However this begs the question, how can the storage handler get away with it if I can't? So, before I go off and implement my own storage handler class as well as my own inputformat class, can anyone tell me if I am doing something silly or is there some other way around this problem? Regards, Z Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies under the control of BAE Systems Applied Intelligence Limited, details of which can be found at http://www.baesystems.com/Businesses/index.htm.
RE: Stored By
Hi Gabriel, Yep, that's a good suggestion. That is what I ended up doing and it seemed to work fine. Many thanks for replying. Apologies for not responding earlier. Z From: Gabriel Balan [mailto:gabriel.ba...@oracle.com] Sent: 28 January 2016 23:27 To: user@hive.apache.org Subject: Re: Stored By Hi Why not write your own storage handler extending AccumuloStorageHandler and overriding getInputFormatClass() to return your HiveAccumuloTableInputFormat subclass. hth Gabriel Balan On 1/21/2016 10:46 AM, peter.mar...@baesystems.com<mailto:peter.mar...@baesystems.com> wrote: Hi, So I am using the AccumuloStorageHandler to allow me to access Accumulo tables from Hive. This works fine. So typically I would use something like this: CREATE EXTERNAL TABLE test_text (rowid STRING, testint INT, testbig BIGINT, testfloat FLOAT, testdouble DOUBLE, teststring STRING, testbool BOOLEAN) STORED BY 'org.apache.hadoop.hive.accumulo.AccumuloStorageHandler' WITH SERDEPROPERTIES('accumulo.table.name'='test_table_text','accumulo.columns.mapping' = ':rowid,testint:v,testbig:v,testfloat:v,testdouble:v,teststring:v,testbool:v'); Now for many reasons I am planning to have my own InputFormat. I don't want to start from scratch so I plan to have my class derive from the existing class HiveAccumuloTableInputFormat and pick up a lot of functionality for free. Now it was my understanding that "STORED BY" was a sort of optimization that saved the user having to specify the input format and output format and so on explicitly. Given that I want, eventually, to use my own input format class in the short-term I just want to ensure that I can create a Hive table that uses Accumulo but specifying the inputformat explicitly. I've looked at the source of AccumuloStorageHandler and I can see what inputformat and outputformat it returns. So my best guess at creating the same table as above, but without using "STORED BY" is as follows: CREATE EXTERNAL TABLE test_text2 (rowid STRING, testint INT, testbig BIGINT, testfloat FLOAT, testdouble DOUBLE, teststring STRING, testbool BOOLEAN) ROW FORMAT SERDE 'org.apache.hadoop.hive.accumulo.serde.AccumuloSerDe' WITH SERDEPROPERTIES('accumulo.table.name'='test_table_text','accumulo.columns.mapping' = ':rowid,testint:v,testbig:v,testfloat:v,testdouble:v,teststring:v,testbool:v') STORED AS INPUTFORMAT 'org.apache.hadoop.hive.accumulo.mr.HiveAccumuloTableInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.accumulo.mr.HiveAccumuloTableOutputFormat'; This fails with: FAILED: SemanticException [Error 10055]: Output Format must implement HiveOutputFormat, otherwise it should be either IgnoreKeyTextOutputFormat or SequenceFileOutputFormat Which seems plausible, because 'org.apache.hadoop.hive.accumulo.mr.HiveAccumuloTableOutputFormat' really doesn't seem to implement HiveOutputFormat. However this begs the question, how can the storage handler get away with it if I can't? So, before I go off and implement my own storage handler class as well as my own inputformat class, can anyone tell me if I am doing something silly or is there some other way around this problem? Regards, Z Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies under the control of BAE Systems Applied Intelligence Limited, details of which can be found at http://www.baesystems.com/Businesses/index.htm. -- The statements and opinions expressed here are my own and do not necessarily represent those of Oracle Corporation. Please consider the environment before printing this email. This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies under the control of BAE Systems Applied Intelligence Limited, details of which can be found at http://www.baesystems.com/Businesses/index.htm.