Re: Lucas Nussbaum 2011-10-29 <20111029045114.ga25...@xanadu.blop.info> > Clearly the problem here is the Seq scans on packages, that require reading > the > data from disk since the table usually doesn't stay in RAM. (the packages > table > takes 2 GB). Apparently, when the machine is idle, the runtime is reasonable > (I > could get a runtime of 14s). But if the query runs at the same time as an > import of data, it can take more than 200s. > > I tried to add an INDEX: > CREATE INDEX packages_pkgverdescr_idx on packages(package, > version,description);
Including description in the index doesn't really help as that's not an column that is searched for, but that should be retrieved. (packages_pkgverdescr_idx is still there and should be dropped again.) > QUERY PLAN > > ----------------------------------------------------------------------------------------------------------------------------------------- > HashAggregate (cost=499026.31..499722.83 rows=69652 width=58) (actual > time=14708.197..14751.840 rows=43791 loops=1) > -> Hash Join (cost=471046.39..497285.02 rows=348258 width=58) (actual > time=12295.665..14135.913 rows=444577 loops=1) > Hash Cond: (public.packages.package = public.packages.package) > Join Filter: (public.packages.version = > (max(public.packages.version))) > -> HashAggregate (cost=228999.76..229306.82 rows=24565 width=24) > (actual time=6314.553..6382.598 rows=43787 loops=1) > -> Seq Scan on packages (cost=0.00..225517.17 rows=696517 > width=24) (actual time=163.114..4332.862 rows=696049 loops=1) > -> Hash (cost=225517.17..225517.17 rows=696517 width=68) (actual > time=5980.429..5980.429 rows=696049 loops=1) > Buckets: 65536 Batches: 2 Memory Usage: 30568kB > -> Seq Scan on packages (cost=0.00..225517.17 rows=696517 > width=68) (actual time=167.867..4632.245 rows=696049 loops=1) > Total runtime: 14776.545 ms > (10 rows) The best I could get now was to use a window function to get rid of the self-join: EXPLAIN ANALYZE SELECT package, description, version FROM (SELECT packages.package, packages.description, packages.version, row_number() OVER (PARTITION BY packages.package ORDER BY version DESC) AS row FROM packages) AS sub WHERE row = 1; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------- Subquery Scan on sub (cost=433124.90..478921.49 rows=7046 width=96) (actual time=18954.689..22608.978 rows=44375 loops=1) Filter: (sub."row" = 1) -> WindowAgg (cost=433124.90..461307.42 rows=1409126 width=68) (actual time=18954.650..22210.341 rows=700467 loops=1) -> Sort (cost=433124.90..436647.71 rows=1409126 width=68) (actual time=18954.612..20714.929 rows=700467 loops=1) Sort Key: packages.package, packages.version Sort Method: external merge Disk: 53984kB -> Seq Scan on packages (cost=0.00..233817.26 rows=1409126 width=68) (actual time=0.376..6184.397 rows=700467 loops=1) Total runtime: 22832.212 ms (8 Zeilen) This is a better plan, but doesn't run faster than your version. (Ideally, Postgres would realize it could use one of the (package, version, ...) indexes instead of sorting the whole table.) > I also tried to rewrite the query into: > explain analyze select distinct packages.package, packages.description > from packages where (package, version) in (select package, max(version) from > packages_summary group by package); > > (packages_summary is the table with a restricted set of field from packages) That's still a pretty big table with several times more rows than should be retrieved (because of multiple archs or something like that). In the end, there's probably no way to optimize the query because effectively the query asks for (almost) all packages to be considered in the computation. Christoph -- c...@df7cb.de | http://www.df7cb.de/ -- To UNSUBSCRIBE, email to debian-qa-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20111118150424.ga26...@msgid.df7cb.de