Re: Possible interest in a webcast/presentation about Django site with 40mil+ rows of data??

Cal Leeming [Simplicity Media Ltd] Wed, 22 Jun 2011 07:33:03 -0700

Hmm, that's odd, the grouping (map/reduce/filter/lambda) is extremely quick
for me (even on a heavy data set).


My guess is that grouping would need to be done on a combination of field
name+value, and would need to allow the user to specify what bulk to use (to
prevent MemoryError exception - or find some way to reduce the bulk when
MemoryError is encountered).

If you end up introducing it into 3.0, I'll definitely be interested in
taking a look at the code :)

Cal

On Wed, Jun 22, 2011 at 3:17 PM, Thomas Weholt <thomas.weh...@gmail.com>wrote:

> On Wed, Jun 22, 2011 at 3:52 PM, Cal Leeming [Simplicity Media Ltd]
> <cal.leem...@simplicitymedialtd.co.uk> wrote:
> > Sorry, let me explain a little better.
> > (51.98s) Found 49659 objs (match: 16563) (db writes: 51180) (range:
> > 72500921 ~ 72550921), (avg 16.9 mins/million) - [('is_checked',
> > 49659), ('is_image_blocked', 0), ('has_link', 1517), ('is_spam', 4)]
> > map(lambda x: (x[0], len(x[1])), _obj_incs.iteritems()) = [('is_checked',
> > 49659), ('is_image_blocked', 0), ('has_link', 1517), ('is_spam', 4)]
> > In the above example, it has found 49659 rows which need 'is_checked'
> > changing to the value '1' (same principle applied to the other 3), giving
> a
> > total of 51,130 database writes, split into 4 queries.
> > Those 4 fields have the IDs assigned to them:
> >                                     if _f == 'block_images':
> >
> > _obj_incs.get('is_image_blocked').append(_hit_id)
> >                                         if _parent_id:
> >
> > _obj_incs.get('is_image_blocked').append(_parent_id)
> > Then I loop through those fields, and do an update() using the necessary
> > IDs:
> >                     # now apply the obj changes in bulk (massive speed
> > improvements)
> >                     for _key, _value in _obj_incs.iteritems():
> >                         # update the child object
> >                         Post.objects.filter(
> >                             id__in = _value
> >                         ).update(
> >                             **{
> >                                 _key : 1
> >                             }
> >                         )
> > So in simple terms, we're not doing 51 thousand update queries, instead
> > we're grouping them into bulk queries based on the row to be updated. It
> > doesn't yet to grouping based on key AND value, simply because we didn't
> > need it at the time, but if we release the code for public use,
> > we'd definitely add this in.
> > Hope this makes sense, let me know if I didn't explain it very well lol.
> > Cal
>
> Actually, I started working on something similar, but tried to find
> sets of fields, instead of just updating one field pr update, but
> didn't finish it because the actual grouping of the fields seem to
> take alot of time/cpu/memory. Perhaps if I focused on updating one
> field at the time it would be simpler. Might look at it again for DSE
> 3.0 ;-)
>
> Thomas
>
> --
> You received this message because you are subscribed to the Google Groups
> "Django users" group.
> To post to this group, send email to django-users@googlegroups.com.
> To unsubscribe from this group, send email to
> django-users+unsubscr...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/django-users?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.

Re: Possible interest in a webcast/presentation about Django site with 40mil+ rows of data??

Reply via email to