Hey Donovan,
   Some inline comments follow.

On Sat, Dec 18, 2010 at 15:06, Donovan <[email protected]> wrote:
> Hi Robert,
>
> thank you very much for the well considered and thorough reply! Much
> appreciated!
>
> The likely number of hashes per document is roughly the length of the
> document, minus duplicate hashes and minus a fixed number (typically
> 15), so roughly 3500. There will be very likely a combination of the
> same hashes in a batch import of documents and there will also be
> probabilistic hash collisions due to the nature of the hash algorithm.
> These are fine, however.

You'll probably need to add an additional step or two to handle
processing 3,500 hashes / document.  You might need to just try a few
different solutions to figure out what is best for your use-case.
Perhaps you can take advantage of the new long-running (10 minute)
tasks and just loop a few times inserting batches of a couple hundred
AddToindex entities for each document.  Another option is to simply
use a per-document hash-list as a queue, and keep processing it until
the 'queue' is empty.  You could probably run creation tasks in
parallel with some sharding and speed up the document processes a bit.


>
> The documents will arrive in batches of 1000. Your tips on task queues
> describe exactly what I'm working on this minute!!! Your example code
> is very useful.
>
The solution I suggested will handle multiple documents with the same
hash at once well.  In fact, that would (potentially) make it more
efficient since it can add multiple documents to an index at once.


> I think the two most useful tips I can glean from your proposed
> solution are to use the hash as the key name and the use of
> Model.get_by_key_name. I did not know that this could accept a list of
> keys, thus only using one query. This is fantastic! Do you know if
> there are any limits on the length of the list? I thought perhaps the
> limit of 30 would cause problems when using an IN query, but maybe
> this does not apply to key names.

I do not remember there being an explicit limit on the number you can
fetch by name.  I've been able to fetch a few hundred without any
major issues.  You'll probably want to explore way of breaking the
fetch up into smaller batches of a few hundred using async datastore
calls though.  Just do some tests and benchmarks to find out what will
be best.  You're not going to want to use IN for this, use keys
anywhere possible -- it will be significantly faster.  Also, make sure
you use Appstats to help you minimize the number of (serial) RPC calls
you are making.


>
> I don't need to preserve the document hashes as all the queries will
> come from a separate corpora, and they will be hashed at query time.
>
> One question I still have is to choose a suitable hash width. 4.5
> billion hashes presumes a hash width of 32 bits, but the overhead of
> the metadata and indexes for that many entities could be unhelpful. I
> can reduce the hash width to anything between 24 and 32 bits to
> minmize the overhead. Do you have any thought on this?

To reduce the metadata size, mark as many things as indexed=False as
possible, particularly on ListProperties.  Also, use short names for
everything -- entities and properties --  names like 'D' instead of
'Document.'   Also, in some of the example I simply used the encoded
key, instead you should use the key name / id, it will be shorter and
save you some more space.

As far as the hash size goes, I guess it depends on what you're doing
and the relative cost of a false match.  Will the increase in false
positives outweigh the cost of an extra couple gigs of data per month?


In my implementation, I used the Aggregator from my slagg
(https://bitbucket.org/thebobert/slagg/src) lib, I'll probably upload
an example of a similar use-case sometime this weekend.  For what
you're doing you would probably want to customize the 'first steps' in
the processes, but I suspect it would work for you with relatively few
modifications.

You might also want to checkout the new Pipeline API, it looks like a
robust tool for building workflows.  I am not sure about the overhead
/ setup time as I have not played with it yet.


Robert


>
> Thanks again for your great answer!
>
> Cheers,
> Donovan.
>
>
> On Dec 18, 7:39 pm, Robert Kluin <[email protected]> wrote:
>> Hi Donovan,
>>   It sounds to me like the hash should be used as the key.  I would
>> probably use a structure something like this:
>>
>> class Document(db.Model):
>>     data = db.BlobProperty()  # Original 'raw' document data.
>>     hashes = db.StringListProperty(indexed=False) # List of hash
>> values for this document.
>>
>> class AddToIndex(db.Model):
>>     """Set key_name to document_id"""
>>     index = db.StringProperty(required=True) # hash key
>>
>> class Index(db.Model):
>>     """Set key_name to your hash value."""
>>     index = db.StringListProperty(required=True)
>>
>> I am assuming the documents are small enough to fit into a
>> BlobProperty, if not you can use the blobstore to achieve a similar
>> result.  You also do not tell us how many hashes there will typically
>> be per document; the hashes per document is important.  Are you likely
>> to get multiple documents at one time that have overlapping hashes,
>> because the might impact the design too.  Also, you don't really tell
>> us about how you get these documents, or exactly what you will be
>> doing with them (as in how is the data presented).
>>
>> So I'll outline a one possible general process -- I'm sure it could be
>> significantly improved for your use case.
>>
>> 1) A document *somehow* appears for the system to process.  Compute
>> the hash values of the content, and store the doc entity.
>>
>>     # If computing the hashes is a slow process, split it more steps.
>>     hashes = compute_hashes(doc_data)
>>     def txn():
>>        doc_key = str(Document(data=doc_data, hashes=hashes).put())
>>        db.put([AddToIndex(key_name=doc_key, index=hash) for hash in hashes])
>>     db.run_in_transaction(txn)
>>
>> 2) In a processing task, find documents that need added to indexes.
>>
>>     items_to_add = AddToIndex.all().fetch(50)
>>     seen_indexes = set()
>>     for item in items_to_add:
>>        if not item.index in seen_indexes:
>>            insert_index_builder_tasks(item.index)
>>           seen_indexes.add(item.index)
>>
>>    Your index_builder_task should insert named tasks to prevent fork
>> bombs.  So you might use some modulus of the current time in
>> conjunction with a (memcache based) count for the hash, that way if
>> you do hit the same hash five minutes apart it can still process but
>> if the same initial processing task runs three times it won't fork.
>>
>> 3) The index builder only needs the hash to do its thing:
>>
>>     index_hash = self.request.get('index_hash')
>>     work = AddToIndex.all(keys_only=True).filter('index', 
>> index_hash).fetch(200)
>>     def txn():
>>        index = Index.get_by_key_name(hash)
>>        index.index.extend([str(doc) for doc in work])  # Add new docs to 
>> index
>>        index.index = list(set(index.index))  # Remove any dupes
>>        index.put()
>>    db.run_in_transaction(txn)
>>    db.delete(work)
>>
>> 4) Given a document, finding other documents with similar content is
>> pretty easy, and only needs one _fetch_!
>>
>>    document = Document.get(document_key)
>>    indexes = Index.get_by_key_name(document.hashes)
>>
>>    Now you can do what ever you need with the list of matching
>> documents in the index lists!
>>
>> I implemented something very similar to this process a couple weeks
>> ago; so far it seems to be working quite well for me.  It uses a lot
>> of tasks, but they are small and very fast.
>>
>> Robert
>>
>> On Sat, Dec 18, 2010 at 08:20, Donovan Hide <[email protected]> wrote:
>> > Hi,
>> > I have a custom index of a large amount of content that works by creating a
>> > 32 bit hash for sections of text. Each document id is stored against this
>> > hash and lookups involve hashing the input and retrieving the matching ids.
>> > Currently I use node.js to serve the index and hadoop to generate it.
>> > However this is an expensive operation in terms of processing and requires
>> > an SSD drive for decent serving performance. The scale of the index is as
>> > follows:
>> > Up to 4.5 billions keys
>> > An average of 8 document ids per key, delta-encoded and then variable
>> > integer encoded.
>> > Lookups on average involve retrieving values for 3500 keys
>> > Having read the datastore docs it seems like this could be a possible
>> > schema:
>> > from google.appengine.ext import db
>> > class Index(db.Model):
>> >     hash=db.IntegerProperty(required=True)
>> >     values=db.BlobProperty(required=True)
>> > I would be grateful if anyone could give me some advice or tips on how this
>> > might perform on AppEngine in terms of query performance, cost and
>> > minimizing metadata/index overhead. It sounds like 4.5 billion*metadata
>> > storage could be the killer.
>> > Cheers,
>> > Donovan
>>
>> > --
>> > You received this message because you are subscribed to the Google Groups
>> > "Google App Engine" group.
>> > To post to this group, send email to [email protected].
>> > To unsubscribe from this group, send email to
>> > [email protected].
>> > For more options, visit this group at
>> >http://groups.google.com/group/google-appengine?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups 
> "Google App Engine" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to 
> [email protected].
> For more options, visit this group at 
> http://groups.google.com/group/google-appengine?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Reply via email to