Yes! Very astute, Jamie :)

For the wikisearch schemas, the general idea is that the inverted index tables can prune your row space for some terms. This way, you can know the exact rows you have to search in the sharded table to get good parallelism without a full-table scan.

Jamie Johnson wrote:
Thanks guys.  I was also looking at some of the examples and saw the
event store, I like the idea of including time as a prefix to the
binning to limit the number of servers that need to be hit for time
bound queries.  Without something like this queries end up having to hit
all tablets right?  It's not always a full table scan since the
iterators can bail on a row part way through but still needs to hit
every row to some extent right?

I also was looking at the wiki example but wasn't able to find a good
description of how all the tables are used, does anything more exist?

On Feb 6, 2016 2:20 PM, "Josh Elser" <[email protected]
<mailto:[email protected]>> wrote:

    You can get *really* fancy if you have lots of ingesters and lots of
    servers, include some attribute in the data you're hashing to
    control how many servers a given client will need to write to for
    some batch of documents. This is probably overkill for most setups
    though.

    Guava provides a decent murmur3 implementation which will be much
    faster than your run-of-the-mill MD5 for generating the hash (which
    you'll mod by the max number of bins).

    William Slacum wrote:

        Often it'll be a hash of the document mod the number of bins you're
        using. The hash should be "good" in the sense that it uniquely
        identifies the document. It can be as simple as some unique
        field in the
        document or just a hash (like murmur) of the whole document.

        On Saturday, February 6, 2016, Jamie Johnson <[email protected]
        <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>> wrote:

             Just found this excellent write up that explains a bit.

        https://www.slideshare.net/mobile/acordova00/text-indexing-in-accumulo

             On Feb 6, 2016 8:52 AM, "Jamie Johnson" <[email protected]
        <mailto:[email protected]>
        <javascript:_e(%7B%7D,'cvml','[email protected]
        <mailto:[email protected]>');>> wrote:

                 Reading the examples for table design I've come across a
                 question associated with the document partitioned index,
                 specifically what is typically chosen as the BinId or
        maybe more
                 appropriately what factors should influence what is
        chosen as
                 the BinId and what impact do they have?

Reply via email to