Re: [google-appengine] Re: Using datastore for up to 4.5 billion keys

Donovan Hide Sat, 18 Dec 2010 15:56:31 -0800

Hi Robert,

wow! Thanks for the great advice! I'll digest all of this over the
coming few days and will read through your slagg library. Fast
numerical aggregation is another problem that I know of some projects
that would benefit from. I used to do a lot of MDX work, so will be
interesting to see how it compares.


The tip about keeping names short and using ids rather than keys where
possible definitely seems wise:

from google.appengine.ext import db

print db.Key.from_path('Article',12)
print db.Key.from_path('Article','12')
print db.Key.from_path('Article',4294967296)
print db.Key.from_path('Article','4294967296')
print db.Key.from_path('I',1)
print db.Key.from_path('I',1<<32)

agZjaHVybnJyDQsSB0FydGljbGUYDAw
agZjaHVybnJyDwsSB0FydGljbGUiAjEyDA
agZjaHVybnJyEQsSB0FydGljbGUYgICAgBAM
agZjaHVybnJyFwsSB0FydGljbGUiCjQyOTQ5NjcyOTYM
agZjaHVybnJyBwsSAUkYAQw
agZjaHVybnJyCwsSAUkYgICAgBAM


proves the point! It's a shame there is a six character minmum limit
on the app name, as I'm  sure that would shave a few gigabytes of the
index size. I wonder if anyone knows how to minimise these keys any
further?

If I go with this schema:

class I(db.Model):
    v=db.UIntListProperty(required=True)

making use of this recipe:

http://appengine-cookbook.appspot.com/attachment/?id=ahJhcHBlbmdpbmUtY29va2Jvb2tyEQsSCkF0dGFjaG1lbnQY2x0M

It's quite possible that the key size will be the biggest user of space!

Well, I'll give it a go and have a look at the AppStats and will
report an answer...

Cheers,
Donovan

On 18 December 2010 22:55, Robert Kluin <[email protected]> wrote:
> Hey Donovan,
>   Some inline comments follow.
>
>
> On Sat, Dec 18, 2010 at 15:06, Donovan <[email protected]> wrote:
>> Hi Robert,
>>
>> thank you very much for the well considered and thorough reply! Much
>> appreciated!
>>
>> The likely number of hashes per document is roughly the length of the
>> document, minus duplicate hashes and minus a fixed number (typically
>> 15), so roughly 3500. There will be very likely a combination of the
>> same hashes in a batch import of documents and there will also be
>> probabilistic hash collisions due to the nature of the hash algorithm.
>> These are fine, however.
>
> You'll probably need to add an additional step or two to handle
> processing 3,500 hashes / document.  You might need to just try a few
> different solutions to figure out what is best for your use-case.
> Perhaps you can take advantage of the new long-running (10 minute)
> tasks and just loop a few times inserting batches of a couple hundred
> AddToindex entities for each document.  Another option is to simply
> use a per-document hash-list as a queue, and keep processing it until
> the 'queue' is empty.  You could probably run creation tasks in
> parallel with some sharding and speed up the document processes a bit.
>
>
>>
>> The documents will arrive in batches of 1000. Your tips on task queues
>> describe exactly what I'm working on this minute!!! Your example code
>> is very useful.
>>
> The solution I suggested will handle multiple documents with the same
> hash at once well.  In fact, that would (potentially) make it more
> efficient since it can add multiple documents to an index at once.
>
>
>> I think the two most useful tips I can glean from your proposed
>> solution are to use the hash as the key name and the use of
>> Model.get_by_key_name. I did not know that this could accept a list of
>> keys, thus only using one query. This is fantastic! Do you know if
>> there are any limits on the length of the list? I thought perhaps the
>> limit of 30 would cause problems when using an IN query, but maybe
>> this does not apply to key names.
>
> I do not remember there being an explicit limit on the number you can
> fetch by name.  I've been able to fetch a few hundred without any
> major issues.  You'll probably want to explore way of breaking the
> fetch up into smaller batches of a few hundred using async datastore
> calls though.  Just do some tests and benchmarks to find out what will
> be best.  You're not going to want to use IN for this, use keys
> anywhere possible -- it will be significantly faster.  Also, make sure
> you use Appstats to help you minimize the number of (serial) RPC calls
> you are making.
>
>
>>
>> I don't need to preserve the document hashes as all the queries will
>> come from a separate corpora, and they will be hashed at query time.
>>
>> One question I still have is to choose a suitable hash width. 4.5
>> billion hashes presumes a hash width of 32 bits, but the overhead of
>> the metadata and indexes for that many entities could be unhelpful. I
>> can reduce the hash width to anything between 24 and 32 bits to
>> minmize the overhead. Do you have any thought on this?
>
> To reduce the metadata size, mark as many things as indexed=False as
> possible, particularly on ListProperties.  Also, use short names for
> everything -- entities and properties --  names like 'D' instead of
> 'Document.'   Also, in some of the example I simply used the encoded
> key, instead you should use the key name / id, it will be shorter and
> save you some more space.
>
> As far as the hash size goes, I guess it depends on what you're doing
> and the relative cost of a false match.  Will the increase in false
> positives outweigh the cost of an extra couple gigs of data per month?
>
>
> In my implementation, I used the Aggregator from my slagg
> (https://bitbucket.org/thebobert/slagg/src) lib, I'll probably upload
> an example of a similar use-case sometime this weekend.  For what
> you're doing you would probably want to customize the 'first steps' in
> the processes, but I suspect it would work for you with relatively few
> modifications.
>
> You might also want to checkout the new Pipeline API, it looks like a
> robust tool for building workflows.  I am not sure about the overhead
> / setup time as I have not played with it yet.
>
>
> Robert
>
>
>>
>> Thanks again for your great answer!
>>
>> Cheers,
>> Donovan.
>>
>>
>> On Dec 18, 7:39 pm, Robert Kluin <[email protected]> wrote:
>>> Hi Donovan,
>>>   It sounds to me like the hash should be used as the key.  I would
>>> probably use a structure something like this:
>>>
>>> class Document(db.Model):
>>>     data = db.BlobProperty()  # Original 'raw' document data.
>>>     hashes = db.StringListProperty(indexed=False) # List of hash
>>> values for this document.
>>>
>>> class AddToIndex(db.Model):
>>>     """Set key_name to document_id"""
>>>     index = db.StringProperty(required=True) # hash key
>>>
>>> class Index(db.Model):
>>>     """Set key_name to your hash value."""
>>>     index = db.StringListProperty(required=True)
>>>
>>> I am assuming the documents are small enough to fit into a
>>> BlobProperty, if not you can use the blobstore to achieve a similar
>>> result.  You also do not tell us how many hashes there will typically
>>> be per document; the hashes per document is important.  Are you likely
>>> to get multiple documents at one time that have overlapping hashes,
>>> because the might impact the design too.  Also, you don't really tell
>>> us about how you get these documents, or exactly what you will be
>>> doing with them (as in how is the data presented).
>>>
>>> So I'll outline a one possible general process -- I'm sure it could be
>>> significantly improved for your use case.
>>>
>>> 1) A document *somehow* appears for the system to process.  Compute
>>> the hash values of the content, and store the doc entity.
>>>
>>>     # If computing the hashes is a slow process, split it more steps.
>>>     hashes = compute_hashes(doc_data)
>>>     def txn():
>>>        doc_key = str(Document(data=doc_data, hashes=hashes).put())
>>>        db.put([AddToIndex(key_name=doc_key, index=hash) for hash in hashes])
>>>     db.run_in_transaction(txn)
>>>
>>> 2) In a processing task, find documents that need added to indexes.
>>>
>>>     items_to_add = AddToIndex.all().fetch(50)
>>>     seen_indexes = set()
>>>     for item in items_to_add:
>>>        if not item.index in seen_indexes:
>>>            insert_index_builder_tasks(item.index)
>>>           seen_indexes.add(item.index)
>>>
>>>    Your index_builder_task should insert named tasks to prevent fork
>>> bombs.  So you might use some modulus of the current time in
>>> conjunction with a (memcache based) count for the hash, that way if
>>> you do hit the same hash five minutes apart it can still process but
>>> if the same initial processing task runs three times it won't fork.
>>>
>>> 3) The index builder only needs the hash to do its thing:
>>>
>>>     index_hash = self.request.get('index_hash')
>>>     work = AddToIndex.all(keys_only=True).filter('index', 
>>> index_hash).fetch(200)
>>>     def txn():
>>>        index = Index.get_by_key_name(hash)
>>>        index.index.extend([str(doc) for doc in work])  # Add new docs to 
>>> index
>>>        index.index = list(set(index.index))  # Remove any dupes
>>>        index.put()
>>>    db.run_in_transaction(txn)
>>>    db.delete(work)
>>>
>>> 4) Given a document, finding other documents with similar content is
>>> pretty easy, and only needs one _fetch_!
>>>
>>>    document = Document.get(document_key)
>>>    indexes = Index.get_by_key_name(document.hashes)
>>>
>>>    Now you can do what ever you need with the list of matching
>>> documents in the index lists!
>>>
>>> I implemented something very similar to this process a couple weeks
>>> ago; so far it seems to be working quite well for me.  It uses a lot
>>> of tasks, but they are small and very fast.
>>>
>>> Robert
>>>
>>> On Sat, Dec 18, 2010 at 08:20, Donovan Hide <[email protected]> wrote:
>>> > Hi,
>>> > I have a custom index of a large amount of content that works by creating 
>>> > a
>>> > 32 bit hash for sections of text. Each document id is stored against this
>>> > hash and lookups involve hashing the input and retrieving the matching 
>>> > ids.
>>> > Currently I use node.js to serve the index and hadoop to generate it.
>>> > However this is an expensive operation in terms of processing and requires
>>> > an SSD drive for decent serving performance. The scale of the index is as
>>> > follows:
>>> > Up to 4.5 billions keys
>>> > An average of 8 document ids per key, delta-encoded and then variable
>>> > integer encoded.
>>> > Lookups on average involve retrieving values for 3500 keys
>>> > Having read the datastore docs it seems like this could be a possible
>>> > schema:
>>> > from google.appengine.ext import db
>>> > class Index(db.Model):
>>> >     hash=db.IntegerProperty(required=True)
>>> >     values=db.BlobProperty(required=True)
>>> > I would be grateful if anyone could give me some advice or tips on how 
>>> > this
>>> > might perform on AppEngine in terms of query performance, cost and
>>> > minimizing metadata/index overhead. It sounds like 4.5 billion*metadata
>>> > storage could be the killer.
>>> > Cheers,
>>> > Donovan
>>>
>>> > --
>>> > You received this message because you are subscribed to the Google Groups
>>> > "Google App Engine" group.
>>> > To post to this group, send email to [email protected].
>>> > To unsubscribe from this group, send email to
>>> > [email protected].
>>> > For more options, visit this group at
>>> >http://groups.google.com/group/google-appengine?hl=en.
>>
>> --
>> You received this message because you are subscribed to the Google Groups 
>> "Google App Engine" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to 
>> [email protected].
>> For more options, visit this group at 
>> http://groups.google.com/group/google-appengine?hl=en.
>>
>>
>
> --
> You received this message because you are subscribed to the Google Groups 
> "Google App Engine" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to 
> [email protected].
> For more options, visit this group at 
> http://groups.google.com/group/google-appengine?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Re: [google-appengine] Re: Using datastore for up to 4.5 billion keys

Reply via email to