> On Sep 23, 2021, at 10:36 AM, Geoff Winkless <pgsqlad...@geoff.dj> wrote: > > On Wed, 22 Sept 2021 at 21:05, Israel Brewster <ijbrews...@alaska.edu > <mailto:ijbrews...@alaska.edu>> wrote: > I was wondering if there was any way to improve the performance of this query: > > SELECT station,array_agg(distinct(channel)) as channels FROM data GROUP BY > station; > > If you have tables of possible stations and channels (and if not, why not?), > then an EXISTS query, something like > > SELECT stations.name <http://stations.name/>, ARRAY_AGG(channels.name > <http://channels.name/>) > FROM stations, channels > WHERE EXISTS > (SELECT FROM data WHERE data.channels=channels.name > <http://channels.name/> AND data.station=stations.name > <http://stations.name/>) > GROUP BY stations.name <http://stations.name/> > > will usually be much faster, because it can stop scanning after the first > match in the index.
So that one ran in about 5 minutes as well - apparently the time it takes to scan the index, given the similarity of run times for each of the different queries: https://explain.depesz.com/s/w46h <https://explain.depesz.com/s/w46h> It’s making me think though, because this similar (though incomplete, and therefore incorrect result) query runs in only 19ms (https://explain.depesz.com/s/iZnN <https://explain.depesz.com/s/iZnN>): SELECT stations.name, array_agg(channels.channel) FROM stations,channels WHERE EXISTS (SELECT FROM data WHERE data.station=stations.id) GROUP BY stations.name It’s only when I add in the AND data.channels=channels.channel that the query time blows up to 5+ minutes. I personally don’t understand why there would be such a large difference between the two queries - something wrong with my indexes? --- Israel Brewster Software Engineer Alaska Volcano Observatory Geophysical Institute - UAF 2156 Koyukuk Drive Fairbanks AK 99775-7320 Work: 907-474-5172 cell: 907-328-9145 > > Geoff