Hi Robert,

thank you very much for the well considered and thorough reply! Much
appreciated!

The likely number of hashes per document is roughly the length of the
document, minus duplicate hashes and minus a fixed number (typically
15), so roughly 3500. There will be very likely a combination of the
same hashes in a batch import of documents and there will also be
probabilistic hash collisions due to the nature of the hash algorithm.
These are fine, however.

The documents will arrive in batches of 1000. Your tips on task queues
describe exactly what I'm working on this minute!!! Your example code
is very useful.

I think the two most useful tips I can glean from your proposed
solution are to use the hash as the key name and the use of
Model.get_by_key_name. I did not know that this could accept a list of
keys, thus only using one query. This is fantastic! Do you know if
there are any limits on the length of the list? I thought perhaps the
limit of 30 would cause problems when using an IN query, but maybe
this does not apply to key names.

I don't need to preserve the document hashes as all the queries will
come from a separate corpora, and they will be hashed at query time.

One question I still have is to choose a suitable hash width. 4.5
billion hashes presumes a hash width of 32 bits, but the overhead of
the metadata and indexes for that many entities could be unhelpful. I
can reduce the hash width to anything between 24 and 32 bits to
minmize the overhead. Do you have any thought on this?

Thanks again for your great answer!

Cheers,
Donovan.


On Dec 18, 7:39 pm, Robert Kluin <[email protected]> wrote:
> Hi Donovan,
>   It sounds to me like the hash should be used as the key.  I would
> probably use a structure something like this:
>
> class Document(db.Model):
>     data = db.BlobProperty()  # Original 'raw' document data.
>     hashes = db.StringListProperty(indexed=False) # List of hash
> values for this document.
>
> class AddToIndex(db.Model):
>     """Set key_name to document_id"""
>     index = db.StringProperty(required=True) # hash key
>
> class Index(db.Model):
>     """Set key_name to your hash value."""
>     index = db.StringListProperty(required=True)
>
> I am assuming the documents are small enough to fit into a
> BlobProperty, if not you can use the blobstore to achieve a similar
> result.  You also do not tell us how many hashes there will typically
> be per document; the hashes per document is important.  Are you likely
> to get multiple documents at one time that have overlapping hashes,
> because the might impact the design too.  Also, you don't really tell
> us about how you get these documents, or exactly what you will be
> doing with them (as in how is the data presented).
>
> So I'll outline a one possible general process -- I'm sure it could be
> significantly improved for your use case.
>
> 1) A document *somehow* appears for the system to process.  Compute
> the hash values of the content, and store the doc entity.
>
>     # If computing the hashes is a slow process, split it more steps.
>     hashes = compute_hashes(doc_data)
>     def txn():
>        doc_key = str(Document(data=doc_data, hashes=hashes).put())
>        db.put([AddToIndex(key_name=doc_key, index=hash) for hash in hashes])
>     db.run_in_transaction(txn)
>
> 2) In a processing task, find documents that need added to indexes.
>
>     items_to_add = AddToIndex.all().fetch(50)
>     seen_indexes = set()
>     for item in items_to_add:
>        if not item.index in seen_indexes:
>            insert_index_builder_tasks(item.index)
>           seen_indexes.add(item.index)
>
>    Your index_builder_task should insert named tasks to prevent fork
> bombs.  So you might use some modulus of the current time in
> conjunction with a (memcache based) count for the hash, that way if
> you do hit the same hash five minutes apart it can still process but
> if the same initial processing task runs three times it won't fork.
>
> 3) The index builder only needs the hash to do its thing:
>
>     index_hash = self.request.get('index_hash')
>     work = AddToIndex.all(keys_only=True).filter('index', 
> index_hash).fetch(200)
>     def txn():
>        index = Index.get_by_key_name(hash)
>        index.index.extend([str(doc) for doc in work])  # Add new docs to index
>        index.index = list(set(index.index))  # Remove any dupes
>        index.put()
>    db.run_in_transaction(txn)
>    db.delete(work)
>
> 4) Given a document, finding other documents with similar content is
> pretty easy, and only needs one _fetch_!
>
>    document = Document.get(document_key)
>    indexes = Index.get_by_key_name(document.hashes)
>
>    Now you can do what ever you need with the list of matching
> documents in the index lists!
>
> I implemented something very similar to this process a couple weeks
> ago; so far it seems to be working quite well for me.  It uses a lot
> of tasks, but they are small and very fast.
>
> Robert
>
> On Sat, Dec 18, 2010 at 08:20, Donovan Hide <[email protected]> wrote:
> > Hi,
> > I have a custom index of a large amount of content that works by creating a
> > 32 bit hash for sections of text. Each document id is stored against this
> > hash and lookups involve hashing the input and retrieving the matching ids.
> > Currently I use node.js to serve the index and hadoop to generate it.
> > However this is an expensive operation in terms of processing and requires
> > an SSD drive for decent serving performance. The scale of the index is as
> > follows:
> > Up to 4.5 billions keys
> > An average of 8 document ids per key, delta-encoded and then variable
> > integer encoded.
> > Lookups on average involve retrieving values for 3500 keys
> > Having read the datastore docs it seems like this could be a possible
> > schema:
> > from google.appengine.ext import db
> > class Index(db.Model):
> >     hash=db.IntegerProperty(required=True)
> >     values=db.BlobProperty(required=True)
> > I would be grateful if anyone could give me some advice or tips on how this
> > might perform on AppEngine in terms of query performance, cost and
> > minimizing metadata/index overhead. It sounds like 4.5 billion*metadata
> > storage could be the killer.
> > Cheers,
> > Donovan
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "Google App Engine" group.
> > To post to this group, send email to [email protected].
> > To unsubscribe from this group, send email to
> > [email protected].
> > For more options, visit this group at
> >http://groups.google.com/group/google-appengine?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Reply via email to