Hi Pavel, Here are the explains you asked for:
explain analyze select string_agg(id::text,',') from things group by guid; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=400357.78..433784.93 rows=1337086 width=37) (actual time=41434.485..53195.185 rows=2378626 loops=1) -> Sort (cost=400357.78..403700.49 rows=1337086 width=37) (actual time=41434.433..44992.736 rows=2378626 loops=1) Sort Key: guid Sort Method: quicksort Memory: 284135kB -> Seq Scan on things (cost=0.00..264304.86 rows=1337086 width=37) (actual time=0.027..21429.179 rows=2378626 loops=1) Total runtime: 56295.362 ms (6 rows) explain analyze select array_agg(id::text) from things group by guid; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=400357.78..433784.93 rows=1337086 width=37) (actual time=23953.922..38157.059 rows=2378626 loops=1) -> Sort (cost=400357.78..403700.49 rows=1337086 width=37) (actual time=23953.847..27527.316 rows=2378626 loops=1) Sort Key: guid Sort Method: quicksort Memory: 284135kB -> Seq Scan on things (cost=0.00..264304.86 rows=1337086 width=37) (actual time=0.007..4941.752 rows=2378626 loops=1) Total runtime: 41280.897 ms (6 rows) These seem to be running on the machine now, and the memory is not inflating, I just run this one, and it blew up. explain with t as (select id, guid, md5(concat_ws(':', fields -> 'a', fields -> 'b', fields -> 'c', fields -> 'd', fields -> 'e', foo_id::text)) from things) select md5, count(id), array_agg(id) from t group by 1 having count(id) > 1; -Robert On Tue, Aug 20, 2013 at 1:53 PM, Pavel Stehule <pavel.steh...@gmail.com>wrote: > Can you send a EXPLAIN result in both use cases? > > Pavel > > > 2013/8/19 Robert Sosinski <rsosin...@ticketevolution.com> > >> At the moment, all guids are distinct, however before I zapped the >> duplicates, there were 280 duplicates. >> >> Currently, there are over 2 million distinct guids. >> >> -Robert >> >> >> On Mon, Aug 19, 2013 at 11:12 AM, Pavel Stehule >> <pavel.steh...@gmail.com>wrote: >> >>> >>> >>> >>> 2013/8/19 Robert Sosinski <rsosin...@ticketevolution.com> >>> >>>> Hi Pavel, >>>> >>>> What kind of example do you need? I cant give you the actual data I >>>> have in the table, but I can give you an example query and the schema >>>> attached below. From there, I would just put in 2 million rows worth 1.2 >>>> Gigs of data. Average size of the the extended columns (using the >>>> pg_column_size function) in bytes are: >>>> >>>> guid: 33 >>>> name: 2.41 >>>> currency: 4 >>>> fields: 120.32 >>>> >>>> example query: >>>> >>>> -- find duplicate records using a guid >>>> select guid, array_agg(id) from orders group by guid; >>>> >>> >>> how much distinct guid is there, and how much duplicates >>> >>> ?? >>> >>> regards >>> >>> Pavel >>> >>> >>> >>>> >>>> example schema: >>>> Table "public.things" >>>> >>>> Column | Type | >>>> Modifiers | Storage | Stats target | Description >>>> >>>> ------------+-----------------------------+-----------------------------------------------------+----------+--------------+------------- >>>> id | integer | not null default >>>> nextval('things_id_seq'::regclass) | plain | | >>>> version | integer | not null >>>> | plain | | >>>> created_at | timestamp without time zone | not null >>>> | plain | | >>>> updated_at | timestamp without time zone | not null >>>> | plain | | >>>> foo_id | integer | not null >>>> | plain | | >>>> bar_id | integer | not null >>>> | plain | | >>>> baz_id | integer | not null >>>> | plain | | >>>> guid | character varying | not null >>>> | extended | | >>>> name | character varying | not null >>>> | extended | | >>>> price | numeric(12,2) | not null >>>> | main | | >>>> currency | character varying | not null >>>> | extended | | >>>> amount | integer | not null >>>> | plain | | >>>> the_date | date | not null >>>> | plain | | >>>> fields | hstore | >>>> | extended | | >>>> Indexes: >>>> "things_pkey" PRIMARY KEY, btree (id) >>>> "things_foo_id_idx" btree (foo_id) >>>> "things_bar_id_idx" btree (bar_id) >>>> "things_baz_id_idx" btree (baz_id) >>>> "things_guid_uidx" UNIQUE, btree (guid) >>>> "things_lpad_lower_name_eidx" btree (lpad(lower(name::text), 10, >>>> '0'::text)) >>>> "things_price_idx" btree (price) >>>> >>>> Foreign-key constraints: >>>> "things_foo_id_fkey" FOREIGN KEY (foo_id) REFERENCES foos(id) >>>> "things_bar_id_fkey" FOREIGN KEY (bar_id) REFERENCES bars(id) >>>> "things_baz_id_fkey" FOREIGN KEY (baz_id) REFERENCES bazs(id) >>>> Triggers: >>>> timestamps_trig BEFORE INSERT OR UPDATE ON things FOR EACH ROW >>>> EXECUTE PROCEDURE timestamps_tfun() >>>> >>>> Let me know if you need anything else. >>>> >>>> Thanks, >>>> >>>> >>>> On Mon, Aug 19, 2013 at 3:29 AM, Pavel Stehule <pavel.steh...@gmail.com >>>> > wrote: >>>> >>>>> Hello >>>>> >>>>> please, can you send some example or test? >>>>> >>>>> Regards >>>>> >>>>> Pavel Stehule >>>>> >>>>> >>>>> 2013/8/19 Robert Sosinski <rsosin...@ticketevolution.com> >>>>> >>>>>> When using array_agg on a large table, memory usage seems to spike up >>>>>> until Postgres crashes with the following error: >>>>>> >>>>>> 2013-08-17 18:41:02 UTC [2716]: [2] WARNING: terminating connection >>>>>> because of crash of another server process >>>>>> 2013-08-17 18:41:02 UTC [2716]: [3] DETAIL: The postmaster has >>>>>> commanded this server process to roll back the current transaction and >>>>>> exit, because another server process exited abnormally and possibly >>>>>> corrupted shared memory. >>>>>> 2013-08-17 18:41:02 UTC [2716]: [4] HINT: In a moment you should be >>>>>> able to reconnect to the database and repeat your command. >>>>>> >>>>>> I've definitely isolated it down to using array_agg, as when I >>>>>> changed the query to use string_agg, it worked fine. I also tried using >>>>>> array_agg on a few different queries, all yielding the same issue. >>>>>> Swapping in string_agg fixed the issue once more. >>>>>> >>>>>> This particular table has over 2 million rows and is 1.2 Gigs, and >>>>>> when I ran the query while viewing htop, the virtual size of the >>>>>> Postgres process ballooned to 13.9G until crashing. >>>>>> >>>>>> The version of Postgres I am using is: PostgreSQL 9.2.4 on >>>>>> x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.7.2-2ubuntu1) >>>>>> 4.7.2, 64-bit >>>>>> >>>>>> Any help would be much appreciated, thanks! >>>>>> >>>>>> -Robert >>>>>> >>>>> >>>>> >>>> >>> >> >