Hi Pavel,

Here are the explains you asked for:

explain analyze select string_agg(id::text,',') from things group by guid;
                                                            QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=400357.78..433784.93 rows=1337086 width=37) (actual
time=41434.485..53195.185 rows=2378626 loops=1)
   ->  Sort  (cost=400357.78..403700.49 rows=1337086 width=37) (actual
time=41434.433..44992.736 rows=2378626 loops=1)
         Sort Key: guid
         Sort Method: quicksort  Memory: 284135kB
         ->  Seq Scan on things  (cost=0.00..264304.86 rows=1337086
width=37) (actual time=0.027..21429.179 rows=2378626 loops=1)
 Total runtime: 56295.362 ms
(6 rows)


explain analyze select array_agg(id::text) from things group by guid;
                                                           QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=400357.78..433784.93 rows=1337086 width=37) (actual
time=23953.922..38157.059 rows=2378626 loops=1)
   ->  Sort  (cost=400357.78..403700.49 rows=1337086 width=37) (actual
time=23953.847..27527.316 rows=2378626 loops=1)
         Sort Key: guid
         Sort Method: quicksort  Memory: 284135kB
         ->  Seq Scan on things  (cost=0.00..264304.86 rows=1337086
width=37) (actual time=0.007..4941.752 rows=2378626 loops=1)
 Total runtime: 41280.897 ms
(6 rows)



These seem to be running on the machine now, and the memory is not
inflating, I just run this one, and it blew up.

explain with t as (select id, guid, md5(concat_ws(':', fields -> 'a',
fields -> 'b', fields -> 'c', fields -> 'd', fields -> 'e', foo_id::text))
from things) select md5, count(id), array_agg(id) from t group by 1 having
count(id) > 1;

-Robert


On Tue, Aug 20, 2013 at 1:53 PM, Pavel Stehule <pavel.steh...@gmail.com>wrote:

> Can you send a EXPLAIN result in both use cases?
>
> Pavel
>
>
> 2013/8/19 Robert Sosinski <rsosin...@ticketevolution.com>
>
>> At the moment, all guids are distinct, however before I zapped the
>> duplicates, there were 280 duplicates.
>>
>> Currently, there are over 2 million distinct guids.
>>
>> -Robert
>>
>>
>> On Mon, Aug 19, 2013 at 11:12 AM, Pavel Stehule 
>> <pavel.steh...@gmail.com>wrote:
>>
>>>
>>>
>>>
>>> 2013/8/19 Robert Sosinski <rsosin...@ticketevolution.com>
>>>
>>>> Hi Pavel,
>>>>
>>>> What kind of example do you need?  I cant give you the actual data I
>>>> have in the table, but I can give you an example query and the schema
>>>> attached below.  From there, I would just put in 2 million rows worth 1.2
>>>> Gigs of data.  Average size of the the extended columns (using the
>>>> pg_column_size function) in bytes are:
>>>>
>>>> guid: 33
>>>> name: 2.41
>>>> currency: 4
>>>> fields: 120.32
>>>>
>>>> example query:
>>>>
>>>> -- find duplicate records using a guid
>>>> select guid, array_agg(id) from orders group by guid;
>>>>
>>>
>>> how much distinct guid is there, and how much duplicates
>>>
>>> ??
>>>
>>> regards
>>>
>>> Pavel
>>>
>>>
>>>
>>>>
>>>> example schema:
>>>>                                      Table "public.things"
>>>>
>>>>    Column   |            Type             |
>>>>  Modifiers                      | Storage  | Stats target | Description
>>>>
>>>> ------------+-----------------------------+-----------------------------------------------------+----------+--------------+-------------
>>>>  id         | integer                     | not null default
>>>> nextval('things_id_seq'::regclass) | plain    |              |
>>>>  version    | integer                     | not null
>>>>                          | plain    |              |
>>>>  created_at | timestamp without time zone | not null
>>>>                          | plain    |              |
>>>>  updated_at | timestamp without time zone | not null
>>>>                          | plain    |              |
>>>>  foo_id     | integer                     | not null
>>>>                          | plain    |              |
>>>>  bar_id     | integer                     | not null
>>>>                          | plain    |              |
>>>>  baz_id     | integer                     | not null
>>>>                          | plain    |              |
>>>>  guid       | character varying           | not null
>>>>                          | extended |              |
>>>>  name       | character varying           | not null
>>>>                          | extended |              |
>>>>  price      | numeric(12,2)               | not null
>>>>                          | main     |              |
>>>>  currency   | character varying           | not null
>>>>                          | extended |              |
>>>>  amount     | integer                     | not null
>>>>                          | plain    |              |
>>>>  the_date   | date                        | not null
>>>>                          | plain    |              |
>>>>  fields     | hstore                      |
>>>>                         | extended |              |
>>>> Indexes:
>>>>     "things_pkey" PRIMARY KEY, btree (id)
>>>>     "things_foo_id_idx" btree (foo_id)
>>>>     "things_bar_id_idx" btree (bar_id)
>>>>     "things_baz_id_idx" btree (baz_id)
>>>>     "things_guid_uidx" UNIQUE, btree (guid)
>>>>     "things_lpad_lower_name_eidx" btree (lpad(lower(name::text), 10,
>>>> '0'::text))
>>>>     "things_price_idx" btree (price)
>>>>
>>>> Foreign-key constraints:
>>>>     "things_foo_id_fkey" FOREIGN KEY (foo_id) REFERENCES foos(id)
>>>>     "things_bar_id_fkey" FOREIGN KEY (bar_id) REFERENCES bars(id)
>>>>     "things_baz_id_fkey" FOREIGN KEY (baz_id) REFERENCES bazs(id)
>>>> Triggers:
>>>>     timestamps_trig BEFORE INSERT OR UPDATE ON things FOR EACH ROW
>>>> EXECUTE PROCEDURE timestamps_tfun()
>>>>
>>>> Let me know if you need anything else.
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> On Mon, Aug 19, 2013 at 3:29 AM, Pavel Stehule <pavel.steh...@gmail.com
>>>> > wrote:
>>>>
>>>>> Hello
>>>>>
>>>>> please, can you send some example or test?
>>>>>
>>>>> Regards
>>>>>
>>>>> Pavel Stehule
>>>>>
>>>>>
>>>>> 2013/8/19 Robert Sosinski <rsosin...@ticketevolution.com>
>>>>>
>>>>>> When using array_agg on a large table, memory usage seems to spike up
>>>>>> until Postgres crashes with the following error:
>>>>>>
>>>>>> 2013-08-17 18:41:02 UTC [2716]: [2] WARNING: terminating connection
>>>>>> because of crash of another server process
>>>>>> 2013-08-17 18:41:02 UTC [2716]: [3] DETAIL: The postmaster has
>>>>>> commanded this server process to roll back the current transaction and
>>>>>> exit, because another server process exited abnormally and possibly
>>>>>> corrupted shared memory.
>>>>>> 2013-08-17 18:41:02 UTC [2716]: [4] HINT: In a moment you should be
>>>>>> able to reconnect to the database and repeat your command.
>>>>>>
>>>>>> I've definitely isolated it down to using array_agg, as when I
>>>>>> changed the query to use string_agg, it worked fine.  I also tried using
>>>>>> array_agg on a few different queries, all yielding the same issue.
>>>>>>  Swapping in string_agg fixed the issue once more.
>>>>>>
>>>>>> This particular table has over 2 million rows and is 1.2 Gigs, and
>>>>>> when I ran the query while viewing htop, the virtual size of the
>>>>>> Postgres process ballooned to 13.9G until crashing.
>>>>>>
>>>>>> The version of Postgres I am using is: PostgreSQL 9.2.4 on
>>>>>> x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.7.2-2ubuntu1)
>>>>>> 4.7.2, 64-bit
>>>>>>
>>>>>> Any help would be much appreciated, thanks!
>>>>>>
>>>>>> -Robert
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to