Hi Robert,

wow! Thanks for the great advice! I'll digest all of this over the
coming few days and will read through your slagg library. Fast
numerical aggregation is another problem that I know of some projects
that would benefit from. I used to do a lot of MDX work, so will be
interesting to see how it compares.

The tip about keeping names short and using ids rather than keys where
possible definitely seems wise:

from google.appengine.ext import db

print db.Key.from_path('Article',12)
print db.Key.from_path('Article','12')
print db.Key.from_path('Article',4294967296)
print db.Key.from_path('Article','4294967296')
print db.Key.from_path('I',1)
print db.Key.from_path('I',1<<32)

agZjaHVybnJyDQsSB0FydGljbGUYDAw
agZjaHVybnJyDwsSB0FydGljbGUiAjEyDA
agZjaHVybnJyEQsSB0FydGljbGUYgICAgBAM
agZjaHVybnJyFwsSB0FydGljbGUiCjQyOTQ5NjcyOTYM
agZjaHVybnJyBwsSAUkYAQw
agZjaHVybnJyCwsSAUkYgICAgBAM


proves the point! It's a shame there is a six character minmum limit
on the app name, as I'm  sure that would shave a few gigabytes of the
index size. I wonder if anyone knows how to minimise these keys any
further?

If I go with this schema:

class I(db.Model):
    v=db.UIntListProperty(required=True)

making use of this recipe:

http://appengine-cookbook.appspot.com/attachment/?id=ahJhcHBlbmdpbmUtY29va2Jvb2tyEQsSCkF0dGFjaG1lbnQY2x0M

It's quite possible that the key size will be the biggest user of space!

Well, I'll give it a go and have a look at the AppStats and will
report an answer...

Cheers,
Donovan

On 18 December 2010 22:55, Robert Kluin <[email protected]> wrote:
> Hey Donovan,
>   Some inline comments follow.
>
>
> On Sat, Dec 18, 2010 at 15:06, Donovan <[email protected]> wrote:
>> Hi Robert,
>>
>> thank you very much for the well considered and thorough reply! Much
>> appreciated!
>>
>> The likely number of hashes per document is roughly the length of the
>> document, minus duplicate hashes and minus a fixed number (typically
>> 15), so roughly 3500. There will be very likely a combination of the
>> same hashes in a batch import of documents and there will also be
>> probabilistic hash collisions due to the nature of the hash algorithm.
>> These are fine, however.
>
> You'll probably need to add an additional step or two to handle
> processing 3,500 hashes / document.  You might need to just try a few
> different solutions to figure out what is best for your use-case.
> Perhaps you can take advantage of the new long-running (10 minute)
> tasks and just loop a few times inserting batches of a couple hundred
> AddToindex entities for each document.  Another option is to simply
> use a per-document hash-list as a queue, and keep processing it until
> the 'queue' is empty.  You could probably run creation tasks in
> parallel with some sharding and speed up the document processes a bit.
>
>
>>
>> The documents will arrive in batches of 1000. Your tips on task queues
>> describe exactly what I'm working on this minute!!! Your example code
>> is very useful.
>>
> The solution I suggested will handle multiple documents with the same
> hash at once well.  In fact, that would (potentially) make it more
> efficient since it can add multiple documents to an index at once.
>
>
>> I think the two most useful tips I can glean from your proposed
>> solution are to use the hash as the key name and the use of
>> Model.get_by_key_name. I did not know that this could accept a list of
>> keys, thus only using one query. This is fantastic! Do you know if
>> there are any limits on the length of the list? I thought perhaps the
>> limit of 30 would cause problems when using an IN query, but maybe
>> this does not apply to key names.
>
> I do not remember there being an explicit limit on the number you can
> fetch by name.  I've been able to fetch a few hundred without any
> major issues.  You'll probably want to explore way of breaking the
> fetch up into smaller batches of a few hundred using async datastore
> calls though.  Just do some tests and benchmarks to find out what will
> be best.  You're not going to want to use IN for this, use keys
> anywhere possible -- it will be significantly faster.  Also, make sure
> you use Appstats to help you minimize the number of (serial) RPC calls
> you are making.
>
>
>>
>> I don't need to preserve the document hashes as all the queries will
>> come from a separate corpora, and they will be hashed at query time.
>>
>> One question I still have is to choose a suitable hash width. 4.5
>> billion hashes presumes a hash width of 32 bits, but the overhead of
>> the metadata and indexes for that many entities could be unhelpful. I
>> can reduce the hash width to anything between 24 and 32 bits to
>> minmize the overhead. Do you have any thought on this?
>
> To reduce the metadata size, mark as many things as indexed=False as
> possible, particularly on ListProperties.  Also, use short names for
> everything -- entities and properties --  names like 'D' instead of
> 'Document.'   Also, in some of the example I simply used the encoded
> key, instead you should use the key name / id, it will be shorter and
> save you some more space.
>
> As far as the hash size goes, I guess it depends on what you're doing
> and the relative cost of a false match.  Will the increase in false
> positives outweigh the cost of an extra couple gigs of data per month?
>
>
> In my implementation, I used the Aggregator from my slagg
> (https://bitbucket.org/thebobert/slagg/src) lib, I'll probably upload
> an example of a similar use-case sometime this weekend.  For what
> you're doing you would probably want to customize the 'first steps' in
> the processes, but I suspect it would work for you with relatively few
> modifications.
>
> You might also want to checkout the new Pipeline API, it looks like a
> robust tool for building workflows.  I am not sure about the overhead
> / setup time as I have not played with it yet.
>
>
> Robert
>
>
>>
>> Thanks again for your great answer!
>>
>> Cheers,
>> Donovan.
>>
>>
>> On Dec 18, 7:39 pm, Robert Kluin <[email protected]> wrote:
>>> Hi Donovan,
>>>   It sounds to me like the hash should be used as the key.  I would
>>> probably use a structure something like this:
>>>
>>> class Document(db.Model):
>>>     data = db.BlobProperty()  # Original 'raw' document data.
>>>     hashes = db.StringListProperty(indexed=False) # List of hash
>>> values for this document.
>>>
>>> class AddToIndex(db.Model):
>>>     """Set key_name to document_id"""
>>>     index = db.StringProperty(required=True) # hash key
>>>
>>> class Index(db.Model):
>>>     """Set key_name to your hash value."""
>>>     index = db.StringListProperty(required=True)
>>>
>>> I am assuming the documents are small enough to fit into a
>>> BlobProperty, if not you can use the blobstore to achieve a similar
>>> result.  You also do not tell us how many hashes there will typically
>>> be per document; the hashes per document is important.  Are you likely
>>> to get multiple documents at one time that have overlapping hashes,
>>> because the might impact the design too.  Also, you don't really tell
>>> us about how you get these documents, or exactly what you will be
>>> doing with them (as in how is the data presented).
>>>
>>> So I'll outline a one possible general process -- I'm sure it could be
>>> significantly improved for your use case.
>>>
>>> 1) A document *somehow* appears for the system to process.  Compute
>>> the hash values of the content, and store the doc entity.
>>>
>>>     # If computing the hashes is a slow process, split it more steps.
>>>     hashes = compute_hashes(doc_data)
>>>     def txn():
>>>        doc_key = str(Document(data=doc_data, hashes=hashes).put())
>>>        db.put([AddToIndex(key_name=doc_key, index=hash) for hash in hashes])
>>>     db.run_in_transaction(txn)
>>>
>>> 2) In a processing task, find documents that need added to indexes.
>>>
>>>     items_to_add = AddToIndex.all().fetch(50)
>>>     seen_indexes = set()
>>>     for item in items_to_add:
>>>        if not item.index in seen_indexes:
>>>            insert_index_builder_tasks(item.index)
>>>           seen_indexes.add(item.index)
>>>
>>>    Your index_builder_task should insert named tasks to prevent fork
>>> bombs.  So you might use some modulus of the current time in
>>> conjunction with a (memcache based) count for the hash, that way if
>>> you do hit the same hash five minutes apart it can still process but
>>> if the same initial processing task runs three times it won't fork.
>>>
>>> 3) The index builder only needs the hash to do its thing:
>>>
>>>     index_hash = self.request.get('index_hash')
>>>     work = AddToIndex.all(keys_only=True).filter('index', 
>>> index_hash).fetch(200)
>>>     def txn():
>>>        index = Index.get_by_key_name(hash)
>>>        index.index.extend([str(doc) for doc in work])  # Add new docs to 
>>> index
>>>        index.index = list(set(index.index))  # Remove any dupes
>>>        index.put()
>>>    db.run_in_transaction(txn)
>>>    db.delete(work)
>>>
>>> 4) Given a document, finding other documents with similar content is
>>> pretty easy, and only needs one _fetch_!
>>>
>>>    document = Document.get(document_key)
>>>    indexes = Index.get_by_key_name(document.hashes)
>>>
>>>    Now you can do what ever you need with the list of matching
>>> documents in the index lists!
>>>
>>> I implemented something very similar to this process a couple weeks
>>> ago; so far it seems to be working quite well for me.  It uses a lot
>>> of tasks, but they are small and very fast.
>>>
>>> Robert
>>>
>>> On Sat, Dec 18, 2010 at 08:20, Donovan Hide <[email protected]> wrote:
>>> > Hi,
>>> > I have a custom index of a large amount of content that works by creating 
>>> > a
>>> > 32 bit hash for sections of text. Each document id is stored against this
>>> > hash and lookups involve hashing the input and retrieving the matching 
>>> > ids.
>>> > Currently I use node.js to serve the index and hadoop to generate it.
>>> > However this is an expensive operation in terms of processing and requires
>>> > an SSD drive for decent serving performance. The scale of the index is as
>>> > follows:
>>> > Up to 4.5 billions keys
>>> > An average of 8 document ids per key, delta-encoded and then variable
>>> > integer encoded.
>>> > Lookups on average involve retrieving values for 3500 keys
>>> > Having read the datastore docs it seems like this could be a possible
>>> > schema:
>>> > from google.appengine.ext import db
>>> > class Index(db.Model):
>>> >     hash=db.IntegerProperty(required=True)
>>> >     values=db.BlobProperty(required=True)
>>> > I would be grateful if anyone could give me some advice or tips on how 
>>> > this
>>> > might perform on AppEngine in terms of query performance, cost and
>>> > minimizing metadata/index overhead. It sounds like 4.5 billion*metadata
>>> > storage could be the killer.
>>> > Cheers,
>>> > Donovan
>>>
>>> > --
>>> > You received this message because you are subscribed to the Google Groups
>>> > "Google App Engine" group.
>>> > To post to this group, send email to [email protected].
>>> > To unsubscribe from this group, send email to
>>> > [email protected].
>>> > For more options, visit this group at
>>> >http://groups.google.com/group/google-appengine?hl=en.
>>
>> --
>> You received this message because you are subscribed to the Google Groups 
>> "Google App Engine" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to 
>> [email protected].
>> For more options, visit this group at 
>> http://groups.google.com/group/google-appengine?hl=en.
>>
>>
>
> --
> You received this message because you are subscribed to the Google Groups 
> "Google App Engine" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to 
> [email protected].
> For more options, visit this group at 
> http://groups.google.com/group/google-appengine?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Reply via email to