Can't it just scan the index to get that? I assumed the index had links
to every fileid in the table. In my over-simplified imagination, the
table looks like this:
ctid|fileid|column|column|column|column
ctid|fileid|column|column|column|column
ctid|fileid|column|column|column|column
ctid|fileid|column|column|column|column
etc.
While the index looks like
fileid|ctid
fileid|ctid
fileid|ctid
fileid|ctid
...
So I expected scanning the index was faster, and still had everything it
needed to do the count. Or perhaps it was because I said COUNT(*) so it
needs to look at the other columns in the table? I really just wanted
the number of "hits" not the number of records with distinct values or
anything like that. My understanding was that COUNT(*) did that, and
didn't really look at the columns themselves.
Adrian Klaver wrote:
-------------- Original message ----------------------
From: William Garrison <[EMAIL PROTECTED]>
I am looking for records with duplicate keys, so I am running this query:
SELECT
fileid, COUNT(*)
FROM
file
GROUP BY
fileid
HAVING
COUNT(*)>1
The table has an index on fileid (non-unique index) so I am surprised
that postgres is doing a table scan. This database is >15GB, and there
are a number of fairly large string columns in the table. I am very
surprised that scanning the index is not faster than scanning the
table. Any thoughts on that? Is scanning the table faster than
scanning the index? Is there a reason that it needs anything other than
the index?
I may be missing something, but it would have to scan the entire table to get
all the occurrences of each fileid in order to do the count(*).
--
Adrian Klaver
[EMAIL PROTECTED]