On Tue, Feb 9, 2010 at 12:07 AM, Jonathan Lange <[email protected]> wrote:
On Fri, Jan 29, 2010 at 12:22 PM, Abel Deuring
<[email protected]> wrote:
On 29.01.2010 10:54, Jonathan Lange wrote:
On Fri, Jan 29, 2010 at 9:53 AM, Jonathan Lange <[email protected]> wrote:
On Thu, Jan 28, 2010 at 7:30 PM, Abel Deuring
<[email protected]> wrote:


In other words, a deliberate denormalisation. What do you think, should
we start with a plain query without such a denormalsation, or do you
think that it is better to add the "cache column"
date_created_youngest_patch to the Bug table?

This is really Stuart's province. I don't have a strong opinion on this.

However, it seems to me that we're getting increasingly frequent
requests to denormalize data in order to make querying faster. Perhaps
we need a general answer, or some other technology.

Perhaps, but finding a general answer is not trivial, I think. "Think
harder to find a better query" is one answer, but won't give useful
results in each case; "We don't really need such a query" is another
(though Bryce had good reasons to ask for the sort order we're
discussing); "Let's accept denormalisation after thoughtful
consideration if we find no better way" seems most reasonable to me.

Regarding other technology, I'm quite curious about suggestions.

Mostly I mean a data warehouse.

In any case, this is Stuart's call. It's about the database and not
really about the nature of the product.

Bug searching and sorting has sucked for a long time. Our data model seems to 
represent things, but we cannot query it efficiently the ways we need to. I 
think this first became apparent when we added the 'private' flag to Bug 
several years ago.

I'm not happy about the cache columns we have added so far - its probably a 
mistake to mix the denormalized data with the normalized data. Its unclear and 
messy and can causes concurrency issues when cache data is being refreshed in 
bulk (not really an issue with bugs, but we see it with translations and code 
hosting).

I think the way forward is to define a bug search schema, which might just be a 
single super flat table (or perhaps two - a big one with public bug information 
and a small one with private ones), that supports efficient querying. The 
closest we have to this that I can think of is the Karma cache tables. I 
suspect this will also vastly simplify the hideous bug searching code as the 
queries that need to be generated will be much, much simpler.

The bug search schema can be updated in a number of ways - zope events, 
database triggers, cron jobs, messaging system. This is a secondary, separate 
problem.

Steps would be:

1) identify how we search, filter and sort our 500k+ bugs into reports useful 
to our end users.

2) define a suitable schema.

3) define database queries on the new schema to match our needs determined in 
step one. If the queries are not simple, go back to step 2.

4) benchmark on real data. If searches suck, go back to step 2.

5) Add code to keep the bug search schema in sync with reality with acceptable 
lag. Acceptable needs to be defined.

6) Land work

7+) Update bugs.launchpad.net pages to use the new search schema.


An alternative approach would be to not use the relational database for 
searching at all. Perhaps some other technology would be more helpful, such as 
a bugs search service in front of an in memory bugs database.


--
Stuart Bishop <[email protected]>
http://www.stuartbishop.net/

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Mailing list: https://launchpad.net/~launchpad-dev
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~launchpad-dev
More help   : https://help.launchpad.net/ListHelp

Reply via email to