Accumulo Storage Manager

2015-09-08 Thread peter.mar...@baesystems.com
Hi,

I have been trying out the Hive Accumulo Manager as described here 
https://cwiki.apache.org/confluence/display/Hive/AccumuloIntegration
and it seems to work as advertised. Thanks.

However I don't seem to be able to get any sensible results when I have a Hive 
column of type ARRAY<> like ARRAY or ARRAY.
Are Hive columns of ARRAY<> types not supported? If so then how are they stored 
in Accumulo?

Also, although I can get a Hive MAP to work by using the approach 
where the map key is mapped to the column qualifier
("Using an asterisk in the column mapping string") I don't seem to be able to 
get a Hive MAP stored successfully in any other way.
Is this the only supported way to store Hive MAPs?

I know that I could customise the storage manager to achieve what I want but, 
in the short-term at least, I would like
To know what I can achieve with the Hive Accumulo Storage Manager as is.

Regards,

Z
Please consider the environment before printing this email. This message should 
be regarded as confidential. If you have received this email in error please 
notify the sender and destroy it immediately. Statements of intent shall only 
become binding when confirmed in hard copy by an authorised signatory. The 
contents of this email may relate to dealings with other companies under the 
control of BAE Systems Applied Intelligence Limited, details of which can be 
found at http://www.baesystems.com/Businesses/index.htm.


RE: Accumulo Storage Manager

2015-09-10 Thread peter.mar...@baesystems.com
Hi Josh,

At this stage I don't know whether there's anything wrong with Hive or it's 
just user error.
Perhaps if I go through what I have done you can see where the error lies.
Unfortunately this is going to be wordy. Apologies in advance for the long 
email.

So I created a "normal" table in HDFS with a variety of column types like this:

CREATE TABLE employees4 (
 rowid STRING,
 flag BOOLEAN,
 number INT,
 bignum BIGINT,
 name STRING,
 salary FLOAT,
 bigsalary DOUBLE,
 numbers ARRAY,
 floats ARRAY,
 subordinates ARRAY,
 deductions MAP,
 namedNumbers MAP,
 address STRUCT);

And I put some data into it and I can see the data:

hive> SELECT * FROM employees4;
OK
row1true100 7   John Doe10.010.0
[13,23,-1,1001] [3.14159,2.71828,-1.1,1001.0]   ["Mary Smith","Todd Jones"] 
{"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1} {"nameOne":123,"Name 
Two":49,"The Third Man":-1}{"street":"1 Michigan 
Ave.","city":"Chicago","state":"IL","zip":60600}
row2false   7   100 Mary Smith  10.08.0 
[13,23,-1,1001] [3.14159,2.71828,-1.1,1001.0,1001.0]["Bill King"]   
{"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}{"nameOne":123,"Name 
Two":49,"The Third Man":-1} {"street":"100 Ontario 
St.","city":"Chicago","state":"IL","zip":60601}
row3false   3245877878  Todd Jones  10.07.0 
[13,23,-1,1001] [3.14159,2.71828,-1.1,1001.0,2.0]   []  {"Federal 
Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}   {"nameOne":123,"Name 
Two":49,"The Third Man":-1} {"street":"200 Chicago Ave.","city":"Oak 
Park","state":"IL","zip":60700}
row4true877878  3245Bill King   10.06.0 
[13,23,-1,1001] [3.14159,2.71828,-1.1,1001.0,1001.0,1001.0,1001.0]  []  
{"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}
{"nameOne":123,"Name Two":49,"The Third Man":-1}{"street":"300 Obscure 
Dr.","city":"Obscuria","state":"IL","zip":60100}
Time taken: 0.535 seconds, Fetched: 4 row(s)

Everything looks fine.
Now I create a Hive table stored in Accumulo:

DROP TABLE IF EXISTS accumulo_table4;
CREATE TABLE accumulo_table4 (
 rowid STRING,
 flag BOOLEAN,
 number INT,
 bignum BIGINT,
 name STRING,
 salary FLOAT,
 bigsalary DOUBLE,
 numbers ARRAY,
 floats ARRAY,
 subordinates ARRAY,
 deductions MAP,
 namednumbers MAP,
 address STRUCT)
STORED BY 'org.apache.hadoop.hive.accumulo.AccumuloStorageHandler'
WITH SERDEPROPERTIES('accumulo.columns.mapping' = 
':rowid,person:flag#binary,person:number#binary,person:bignum#binary,person:name,person:salary#binary,person:bigsalary#binary,person:numbers#binary,person:floats,person:subordinates,deductions:*,namednumbers:*,person:address');

(Note that I am only really interested in storing the values in "binary".)
Now I can load the Accumulo table from the normal table:

INSERT OVERWRITE TABLE accumulo_table4 SELECT * FROM employees4;

And I can query the data from the Accumulo table.

hive> SELECT * FROM accumulo_table4;
OK
row1true100 7   John Doe10.010.0
[null]  [null]  ["Mary Smith\u0003Todd Jones"]  {"Federal 
Taxes":0.2,"Insurance":0.1,"State Taxes":0.05}{"Name Two":49,"The Third 
Man":-1,"nameOne":123} {"street":"1 Michigan 
Ave.\u0003Chicago\u0003IL\u000360600","city":null,"state":null,"zip":null}
row2false   7   100 Mary Smith  10.08.0 [null]  
[null]  ["Bill King"]   {"Federal Taxes":0.2,"Insurance":0.1,"State 
Taxes":0.05}{"Name Two":49,"The Third Man":-1,"nameOne":123} 
{"street":"100 Ontario 
St.\u0003Chicago\u0003IL\u000360601","city":null,"state":null,"zip":null}
row3false   3245877878  Todd Jones  10.07.0 [null]  
[null]  []  {"Federal Taxes":0.15,"Insurance":0.1,"State Taxes":0.03}   
{"Name Two":49,"The Third Man":-1,"nameOne":123} {"street":"200 Chicago 
Ave.\u0003Oak Park\u0003IL\u000360700","city":null,"state":null,"zip":null}
row4true877878  3245Bill King   10.06.0 [null]  
[null]  []  {"Federal Taxes":0.15,"Insurance":0.1,"State Taxes":0.03}   
{"Name Two":49,"The Third Man":-1,"nameOne":123} {"street":"300 Obscure 
Dr.\u0003Obscuria\u0003IL\u000360100","city":null,"state":null,"zip":null}
Time taken: 0.109 seconds, Fetched: 4 row(s)

Notice that the columns with type ARRAYand ARRAY are empty.
I assume that this means that there is something wrong and the Hive Storage 
Handler is returning a null?
When I use the accumulo shell to look at the data stored in Accumulo

root@accumulo> scan -t accumulo_table4
row1 deductions:Federal Taxes []0.2
row1 deductions:Insurance []0.1
row1 deducti

RE: Accumulo Storage Manager

2015-09-21 Thread peter.mar...@baesystems.com
Hi Josh,

OK, I'm committed to looking into this problem as and when I get time.
I will do my best to try and raise JIRAs and submit some unit tests to 
reproduce.
Hopefully I'll be able to work on some fixes as well.

However, in the short-term I would really like to know how the 
AccumuloStorageHandler would/should/will actually store ARRAYs.
You mention the HBaseStorageHandler, so that might be a guide,  but I haven't 
played much with HBase and so that doesn't help
me much in the short term

I would assume that if we are storing an ARRAY of a fixed length type (I'm 
limiting myself to the binary representation here)
then we would end up with just the binary values stored sequentially.
So an array of INT with values 3, 23, 10 would be stored as 
\x00\x00\x00\x03\x00\x00\x01\x06\x00\x00\x00\x0a
and so on for all the other types.
This seems obvious enough,  but I would like to check.

But the real question is how is an ARRAY stored? I can't really see a 
delimiter being used as you can't
be sure that the delimiter doesn't occur in the data. So I would assume that it 
would be something like this:

 ... 

Is this correct?

If it is how would the lengths be stored? As 4-byte integers?
Or some variable length encoding scheme.

I know that I'm asking a lot here, as I should probably just look at the code 
and work it out for myself,
but if you do know and could let me know I'd be grateful.

Thanks,

Z

-Original Message-
From: Josh Elser [mailto:josh.el...@gmail.com]
Sent: 13 September 2015 03:19
To: user@hive.apache.org
Subject: Re: Accumulo Storage Manager

So the binary parsing definitely seems wrong. Maybe two issues there:
one being the inline #binary not being recognized with the '*' map modifier and 
the second being the row failing to parse.

I'd have to write a test to see how the HBaseStorageHandler works and see if I 
missed something in handling all the types correctly. The 
AccumuloStorageHandler should be able to handle the same kind of types that a 
native table can handle. So, I would call ARRAYs not being serialized a bug as 
well.

Sorry you're running into this. If you could capture these in JIRA issues, that 
would make it really good to start working through them and get them fixed.

If you have the time and desire, trying to reproduce theses failures in unit 
tests would also be great :). The type handling can be a little difficult but 
there are likely some places to start in the accumulo or hbase handler tests. 
At worst, we can start by writing a qtest that will reproduce your errors using 
an full environment (Accumulo minicluster, etc).

peter.mar...@baesystems.com wrote:
> Hi Josh,
>
> At this stage I don't know whether there's anything wrong with Hive or it's 
> just user error.
> Perhaps if I go through what I have done you can see where the error lies.
> Unfortunately this is going to be wordy. Apologies in advance for the long 
> email.
>

Please consider the environment before printing this email. This message should 
be regarded as confidential. If you have received this email in error please 
notify the sender and destroy it immediately. Statements of intent shall only 
become binding when confirmed in hard copy by an authorised signatory. The 
contents of this email may relate to dealings with other companies under the 
control of BAE Systems Applied Intelligence Limited, details of which can be 
found at http://www.baesystems.com/Businesses/index.htm.


Table statistics

2015-12-15 Thread peter.mar...@baesystems.com
Hi,

I was wondering if there is any "recognized" way to obtain table statistics.
Ideally, given a Key range I would like to know the number of distinct rowids, 
entries and amount of data (in bytes) in that key range.
I assume that Accumulo holds at least some of this information internally, 
partly because I can see some of this
through the monitor, and partly because it must know something about the 
quantity of data held in order to be able
to implement the table threshold.

In my case the tables are very static and so the "estimates" that the monitor 
has are likely to sufficiently accurate for my purposes.

I have found this link
http://apache-accumulo.1065345.n5.nabble.com/Determining-tablets-assigned-to-table-splits-and-the-number-of-rows-in-each-tablet-td11546.html
which describes a process (which I haven't tried yet) to get the number of 
entries in a range.
Which would probably be sufficient for me and would certainly be a good start.
However it seems to be using internal data structures and non-published APIs, 
which is less than ideal.
And it seems to be written against Accumulo version 1.6.

I'm using Accumulo 1.7. Is there anything better than I can do or is it 
recommended that this is the way to go?

Regards,

Z
Please consider the environment before printing this email. This message should 
be regarded as confidential. If you have received this email in error please 
notify the sender and destroy it immediately. Statements of intent shall only 
become binding when confirmed in hard copy by an authorised signatory. The 
contents of this email may relate to dealings with other companies under the 
control of BAE Systems Applied Intelligence Limited, details of which can be 
found at http://www.baesystems.com/Businesses/index.htm.


FW: Table statistics

2015-12-15 Thread peter.mar...@baesystems.com
Sorry, wrong list.
Z

From: peter.mar...@baesystems.com [mailto:peter.mar...@baesystems.com]
Sent: 15 December 2015 09:39
To: user@hive.apache.org
Subject: Table statistics

Hi,

I was wondering if there is any "recognized" way to obtain table statistics.
Ideally, given a Key range I would like to know the number of distinct rowids, 
entries and amount of data (in bytes) in that key range.
I assume that Accumulo holds at least some of this information internally, 
partly because I can see some of this
through the monitor, and partly because it must know something about the 
quantity of data held in order to be able
to implement the table threshold.

In my case the tables are very static and so the "estimates" that the monitor 
has are likely to sufficiently accurate for my purposes.

I have found this link
http://apache-accumulo.1065345.n5.nabble.com/Determining-tablets-assigned-to-table-splits-and-the-number-of-rows-in-each-tablet-td11546.html
which describes a process (which I haven't tried yet) to get the number of 
entries in a range.
Which would probably be sufficient for me and would certainly be a good start.
However it seems to be using internal data structures and non-published APIs, 
which is less than ideal.
And it seems to be written against Accumulo version 1.6.

I'm using Accumulo 1.7. Is there anything better than I can do or is it 
recommended that this is the way to go?

Regards,

Z
Please consider the environment before printing this email. This message should 
be regarded as confidential. If you have received this email in error please 
notify the sender and destroy it immediately. Statements of intent shall only 
become binding when confirmed in hard copy by an authorised signatory. The 
contents of this email may relate to dealings with other companies under the 
control of BAE Systems Applied Intelligence Limited, details of which can be 
found at http://www.baesystems.com/Businesses/index.htm.
Please consider the environment before printing this email. This message should 
be regarded as confidential. If you have received this email in error please 
notify the sender and destroy it immediately. Statements of intent shall only 
become binding when confirmed in hard copy by an authorised signatory. The 
contents of this email may relate to dealings with other companies under the 
control of BAE Systems Applied Intelligence Limited, details of which can be 
found at http://www.baesystems.com/Businesses/index.htm.


Stored By

2016-01-21 Thread peter.mar...@baesystems.com
Hi,

So I am using the AccumuloStorageHandler to allow me to access Accumulo tables 
from Hive.
This works fine. So typically I would use something like this:

CREATE EXTERNAL TABLE test_text (rowid STRING, testint INT, testbig BIGINT, 
testfloat FLOAT, testdouble DOUBLE, teststring STRING, testbool BOOLEAN)
STORED BY 'org.apache.hadoop.hive.accumulo.AccumuloStorageHandler'
WITH 
SERDEPROPERTIES('accumulo.table.name'='test_table_text','accumulo.columns.mapping'
 = 
':rowid,testint:v,testbig:v,testfloat:v,testdouble:v,teststring:v,testbool:v');

Now for many reasons I am planning to have my own InputFormat.
I don't want to start from scratch so I plan to have my class derive from the 
existing class HiveAccumuloTableInputFormat and pick up a lot of functionality 
for free.

Now it was my understanding that "STORED BY" was a sort of optimization that 
saved the user having to specify the input format and output format and so on 
explicitly.
Given that I want, eventually, to use my own input format class in the 
short-term I just want to ensure that I can create a Hive table that uses 
Accumulo but specifying the inputformat explicitly.
I've looked at the source of AccumuloStorageHandler and I can see what 
inputformat and outputformat it returns.
So my best guess at creating the same table as above, but without using "STORED 
BY" is as follows:

CREATE EXTERNAL TABLE test_text2 (rowid STRING, testint INT, testbig BIGINT, 
testfloat FLOAT, testdouble DOUBLE, teststring STRING, testbool BOOLEAN)
ROW FORMAT SERDE 'org.apache.hadoop.hive.accumulo.serde.AccumuloSerDe'
WITH 
SERDEPROPERTIES('accumulo.table.name'='test_table_text','accumulo.columns.mapping'
 = 
':rowid,testint:v,testbig:v,testfloat:v,testdouble:v,teststring:v,testbool:v')
STORED AS INPUTFORMAT 
'org.apache.hadoop.hive.accumulo.mr.HiveAccumuloTableInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.accumulo.mr.HiveAccumuloTableOutputFormat';

This fails with:

FAILED: SemanticException [Error 10055]: Output Format must implement 
HiveOutputFormat, otherwise it should be either IgnoreKeyTextOutputFormat or 
SequenceFileOutputFormat

Which seems plausible, because 
'org.apache.hadoop.hive.accumulo.mr.HiveAccumuloTableOutputFormat' really 
doesn't seem to implement  HiveOutputFormat.
However this begs the question, how can the storage handler get away with it if 
I can't?

So, before I go off and implement my own storage handler class as well as my 
own inputformat class, can anyone tell me if I am doing something silly
or is there some other way around this problem?

Regards,

Z
Please consider the environment before printing this email. This message should 
be regarded as confidential. If you have received this email in error please 
notify the sender and destroy it immediately. Statements of intent shall only 
become binding when confirmed in hard copy by an authorised signatory. The 
contents of this email may relate to dealings with other companies under the 
control of BAE Systems Applied Intelligence Limited, details of which can be 
found at http://www.baesystems.com/Businesses/index.htm.


RE: Stored By

2016-02-16 Thread peter.mar...@baesystems.com
Hi Gabriel,

Yep, that's a good suggestion.
That is what I ended up doing and it seemed to work fine.
Many thanks for replying.
Apologies for not responding earlier.

Z

From: Gabriel Balan [mailto:gabriel.ba...@oracle.com]
Sent: 28 January 2016 23:27
To: user@hive.apache.org
Subject: Re: Stored By

Hi

Why not write your own storage handler extending AccumuloStorageHandler and 
overriding getInputFormatClass() to return your  HiveAccumuloTableInputFormat 
subclass.

hth
Gabriel Balan
On 1/21/2016 10:46 AM, 
peter.mar...@baesystems.com<mailto:peter.mar...@baesystems.com> wrote:
Hi,

So I am using the AccumuloStorageHandler to allow me to access Accumulo tables 
from Hive.
This works fine. So typically I would use something like this:

CREATE EXTERNAL TABLE test_text (rowid STRING, testint INT, testbig BIGINT, 
testfloat FLOAT, testdouble DOUBLE, teststring STRING, testbool BOOLEAN)
STORED BY 'org.apache.hadoop.hive.accumulo.AccumuloStorageHandler'
WITH 
SERDEPROPERTIES('accumulo.table.name'='test_table_text','accumulo.columns.mapping'
 = 
':rowid,testint:v,testbig:v,testfloat:v,testdouble:v,teststring:v,testbool:v');

Now for many reasons I am planning to have my own InputFormat.
I don't want to start from scratch so I plan to have my class derive from the 
existing class HiveAccumuloTableInputFormat and pick up a lot of functionality 
for free.

Now it was my understanding that "STORED BY" was a sort of optimization that 
saved the user having to specify the input format and output format and so on 
explicitly.
Given that I want, eventually, to use my own input format class in the 
short-term I just want to ensure that I can create a Hive table that uses 
Accumulo but specifying the inputformat explicitly.
I've looked at the source of AccumuloStorageHandler and I can see what 
inputformat and outputformat it returns.
So my best guess at creating the same table as above, but without using "STORED 
BY" is as follows:

CREATE EXTERNAL TABLE test_text2 (rowid STRING, testint INT, testbig BIGINT, 
testfloat FLOAT, testdouble DOUBLE, teststring STRING, testbool BOOLEAN)
ROW FORMAT SERDE 'org.apache.hadoop.hive.accumulo.serde.AccumuloSerDe'
WITH 
SERDEPROPERTIES('accumulo.table.name'='test_table_text','accumulo.columns.mapping'
 = 
':rowid,testint:v,testbig:v,testfloat:v,testdouble:v,teststring:v,testbool:v')
STORED AS INPUTFORMAT 
'org.apache.hadoop.hive.accumulo.mr.HiveAccumuloTableInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.accumulo.mr.HiveAccumuloTableOutputFormat';

This fails with:

FAILED: SemanticException [Error 10055]: Output Format must implement 
HiveOutputFormat, otherwise it should be either IgnoreKeyTextOutputFormat or 
SequenceFileOutputFormat

Which seems plausible, because 
'org.apache.hadoop.hive.accumulo.mr.HiveAccumuloTableOutputFormat' really 
doesn't seem to implement  HiveOutputFormat.
However this begs the question, how can the storage handler get away with it if 
I can't?

So, before I go off and implement my own storage handler class as well as my 
own inputformat class, can anyone tell me if I am doing something silly
or is there some other way around this problem?

Regards,

Z
Please consider the environment before printing this email. This message should 
be regarded as confidential. If you have received this email in error please 
notify the sender and destroy it immediately. Statements of intent shall only 
become binding when confirmed in hard copy by an authorised signatory. The 
contents of this email may relate to dealings with other companies under the 
control of BAE Systems Applied Intelligence Limited, details of which can be 
found at http://www.baesystems.com/Businesses/index.htm.



--

The statements and opinions expressed here are my own and do not necessarily 
represent those of Oracle Corporation.

Please consider the environment before printing this email. This message should 
be regarded as confidential. If you have received this email in error please 
notify the sender and destroy it immediately. Statements of intent shall only 
become binding when confirmed in hard copy by an authorised signatory. The 
contents of this email may relate to dealings with other companies under the 
control of BAE Systems Applied Intelligence Limited, details of which can be 
found at http://www.baesystems.com/Businesses/index.htm.