Re: RowID design and Hive push down

Josh Elser Mon, 14 Sep 2015 13:09:18 -0700

I'm not positive what you mean by the "in-built RowID push downmechanism won't work with unsigned bytes". Are you saying that you'retrying to change your current rowID structure tounixTime+logicalSplit+hash structure? And you're trying to evaluate the3 listed requirements against the new form?

First off, the Java primitives are signed, so you're going to be limitedby that. Don't forget that.

Have you seen accumulo.composite.rowid fromhttps://cwiki.apache.org/confluence/display/Hive/AccumuloIntegration.Hypothetically, you can provide some logic which will do custom parsingon your row and generate a struct from the components in your row ID.


Of interest might be:

https://github.com/apache/hive/blob/release-1.2.1/accumulo-handler/src/java/org/apache/hadoop/hive/accumulo/serde/AccumuloRowSerializer.java

https://github.com/apache/hive/blob/release-1.2.1/accumulo-handler/src/test/org/apache/hadoop/hive/accumulo/serde/TestAccumuloRowSerializer.java

You could extend the AccumuloRowSerializer to parse the bytes of therowId according to your own spec. I haven't explicitly tried thismyself, but in theory, I think your problems are meant to be solved bythis support. It will take a little bit of effort. Hive's LazyObjecttype system is not my favorite framework to work with. Referencing someof the HBaseStorageHandler code might also be worthwhile (as the two arevery similar).


- Josh

[email protected] wrote:

Hi there,

Our current rowid format is yyyyMMdd_payload_sha256(raw data). It works
nicely as we have a date and uniqueness guaranteed by hash, however
unfortunately, rowid is around 50-60 bytes per record.

Requirements are the following:

1)Support Hive on top of Accumulo for ad-hoc queries

2)Query original table by date range (e.g rowID < ‘20060101’ AND rowID
 >= ‘20060103’) both in code and hive

3)Additional queries by ~20 different fields

Requirement 3) requires secondary indexes and of course because each
RowID is 50-60 bytes, they become super massive (99% of overall space)
and really expensive to store.

What we are looking to do is to reduce index size to a fixed size:
{unixTime}{logicalSplit}{hash}, where unixTime is 4 bytes unsigned
integer, logicalSplit – 2 bytes unsigned integer, and hash is 4 bytes –
overall 10 bytes.

What is unclear to me is how second requirement can be met in Hive as to
my understanding an in-built RowID push down mechanism won’t work with
unsigned bytes?

Regards,

Roman

Please consider the environment before printing this email. This message
should be regarded as confidential. If you have received this email in
error please notify the sender and destroy it immediately. Statements of
intent shall only become binding when confirmed in hard copy by an
authorised signatory. The contents of this email may relate to dealings
with other companies under the control of BAE Systems Applied
Intelligence Limited, details of which can be found at
http://www.baesystems.com/Businesses/index.htm.

Re: RowID design and Hive push down

Reply via email to