[GENERAL] Moving Specific Data Across Schemas Including FKs

2015-04-23 Thread Cory Tucker
I have the need to move a specific set of data from one schema to another.
These schemas are on the same database instance and have all of the same
relations defined.  The SQL to copy data from one table is relatively
straightforward:

INSERT INTO schema_b.my_table
SELECT * FROM schema_a.my_table WHERE ...

What I am trying to figure out is that if I also have other relations that
have foreign keys into the data I am moving, how would I also move the data
from those relations and maintain the FK integrity?

The tables are setup to use BIGSERIAL values for the id column which is the
primary key, and the foreign keys reference these id columns.  Ideally each
schema would use it's own serial for the ID values, but I'm open to clever
alternatives.

I am using PG 9.3.5 (Amazon RDS), but with an option to move to 9.4 should
the situation call for it.

thanks
--Cory


Re: [GENERAL] Moving Specific Data Across Schemas Including FKs

2015-04-23 Thread Cory Tucker
On Thu, Apr 23, 2015 at 10:27 AM Steve Atkins  wrote:

>
> On Apr 23, 2015, at 10:09 AM, Cory Tucker  wrote:
>
> > I have the need to move a specific set of data from one schema to
> another.  These schemas are on the same database instance and have all of
> the same relations defined.  The SQL to copy data from one table is
> relatively straightforward:
> >
> > INSERT INTO schema_b.my_table
> > SELECT * FROM schema_a.my_table WHERE ...
>
> Would ALTER TABLE ... SET SCHEMA do what you need? A schema is mostly just
> a name space, so there's no need to create new tables or copy data around.
>
> Cheers,
>   Steve
>

If I were moving all of the contents from these table(s) then it might
work, but I only want to move a specific selection of it based on a where
clause.

--Cory


>
> >
> > What I am trying to figure out is that if I also have other relations
> that have foreign keys into the data I am moving, how would I also move the
> data from those relations and maintain the FK integrity?
> >
> > The tables are setup to use BIGSERIAL values for the id column which is
> the primary key, and the foreign keys reference these id columns.  Ideally
> each schema would use it's own serial for the ID values, but I'm open to
> clever alternatives.
> >
> > I am using PG 9.3.5 (Amazon RDS), but with an option to move to 9.4
> should the situation call for it.
> >
> > thanks
> > --Cory
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


[GENERAL] Grouping By Similarity (using pg_trgm)?

2015-05-14 Thread Cory Tucker
[pg version 9.3 or 9.4]

Suppose I have a simple table:

create table data (
  my_value  TEXT NOT NULL
);
CREATE INDEX idx_my_value ON data USING gin(my_value gin_trgm_ops);


Now I would like to essentially do group by to get a count of all the
values that are sufficiently similar.  I can do it using something like a
CROSS JOIN to join the table on itself, but then I still am getting all the
rows with duplicate counts.

Is there a way to do a group by query and only return a single "my_value"
column and a count of the number of times other values are similar while
also not returning the included similar values in the output, too?


Re: [GENERAL] Grouping By Similarity (using pg_trgm)?

2015-05-14 Thread Cory Tucker
That produces pretty much the same results as the CROSS JOIN I was using
before.  Because each "my_value" in the table are different, if I group on
just their value then I will always have the full result set and a bunch of
essentially duplicated results.

Any other ideas/options?

On Thu, May 14, 2015 at 12:08 PM David G. Johnston <
david.g.johns...@gmail.com> wrote:

>
> On Thu, May 14, 2015 at 11:58 AM, Cory Tucker 
> wrote:
>
>> [pg version 9.3 or 9.4]
>>
>> Suppose I have a simple table:
>>
>> create table data (
>>   my_value  TEXT NOT NULL
>> );
>> CREATE INDEX idx_my_value ON data USING gin(my_value gin_trgm_ops);
>>
>>
>> Now I would like to essentially do group by to get a count of all the
>> values that are sufficiently similar.  I can do it using something like a
>> CROSS JOIN to join the table on itself, but then I still am getting all the
>> rows with duplicate counts.
>>
>> Is there a way to do a group by query and only return a single "my_value"
>> column and a count of the number of times other values are similar while
>> also not returning the included similar values in the output, too?
>>
>>
> ​Concept below - not bothering to lookup the functions/operators for
> pg_trgm:
>
> SELECT my_value_src, count(*)
> FROM (SELECT my_value AS my_value_src FROM data) src
> JOIN (SELECT my_value AS my_value_compareto FROM data) comparedto
> ON ( func(my_value_src, my_value_compareto) < # )
> GROUP BY my_value_src
>
> ​David J.
>
>


[GENERAL] Most effective settings for deleting lots of data?

2015-08-26 Thread Cory Tucker
Hi, I am using postgres 9.3 and am preparing to migrate to 9.4.  As part of
the migration, I was hoping to also delete a bunch of data that is no
longer needed (100M+ rows across several tables).

I can fairly trivially delete the data by doing a simple statement like
this:

DELETE FROM account WHERE id = 1234;

All the foreign keys are setup to have "ON DELETE CASCADE", so this will
effectively drop the entire graph, which is what I want.  My desire is to
do this as quickly as possible.  The first thing I have done is to disable
autovacuuming on all of the largest tables so that the entire system
doesn't lock up, like so:

ALTER TABLE my_data SET (autovacuum_enabled = false,
toast.autovacuum_enabled = false);

I was wondering what other settings might be best be applied?  For
instance, I was thinking that "work_mem" could be bumped up to some
ridiculously large value if needed.  I have the benefit of being able to
have exclusive access to the database at the time that this operation needs
to happen so most options are on the table.

What settings would you recommend?  Also, it just occurred to me that I
should try to disable/drop all indexes (especially since they will be
recreated) later so that those are not updated in the process.

thanks in advance,
--Cory


[GENERAL] Efficiently Triggering Autovacuum Analyze?

2015-12-30 Thread Cory Tucker
We have a performance problem accessing one of our tables, I think because
the statistics are out of date.  The table is fairly large, on the order of
100M rows or so.
The general structure of the table is as follows:

Column | Type | Modifiers
---+--+
id | bigint | not null default nextval('foo_id_seq'::regclass)
record_id | text |
account_id | bigint | not null

With indexes:
"foo_pkey" PRIMARY KEY, btree (id)
"uq_account_id_record_id" UNIQUE CONSTRAINT, btree (account_id,
record_id)


This table is almost always queried using a combination of (account_id,
record_id) and is generally pretty fast.  However, under certain loads, the
query becomes slower and slower as time goes on.  The workload that causes
this to happen is when data for a new account_id is being inserted into the
table.  This will happen in rapid succession and may insert millions of
rows over the course of several hours.

The pattern that I notice when this happens is that the CPU on DB will be
pegged much higher than usual, and the query to lookup records for the
(account_id, record_id) combo will steadily rise from <1ms to more then 2
or 3 seconds over time.

The fix I have employed to restore the speed of the query after I notice it
is happening is to manually issue a VACUUM ANALYZE on the table.  After the
analyze is done, the query returns to its normal speed.

I am looking for suggestions for how to tune, or perhaps automatically
detect this pattern, so that I don't have to manually intervene whenever
this happens.

Here are my autovacuum settings:

name |  setting  | unit
-+---+--
 autovacuum  | on|
 autovacuum_analyze_scale_factor | 0.05  |
 autovacuum_analyze_threshold| 50|
 autovacuum_freeze_max_age   | 2 |
 autovacuum_max_workers  | 3 |
 autovacuum_multixact_freeze_max_age | 4 |
 autovacuum_naptime  | 30| s
 autovacuum_vacuum_cost_delay| 20| ms
 autovacuum_vacuum_cost_limit| -1|
 autovacuum_vacuum_scale_factor  | 0.1   |
 autovacuum_vacuum_threshold | 50|
 autovacuum_work_mem | -1| kB

We're using 9.4.4 (RDS)


Re: [GENERAL] Efficiently Triggering Autovacuum Analyze?

2015-12-30 Thread Cory Tucker
On Wed, Dec 30, 2015 at 11:20 AM Tom Lane  wrote:

> Cory Tucker  writes:
> > This table is almost always queried using a combination of (account_id,
> > record_id) and is generally pretty fast.  However, under certain loads,
> the
> > query becomes slower and slower as time goes on.  The workload that
> causes
> > this to happen is when data for a new account_id is being inserted into
> the
> > table.  This will happen in rapid succession and may insert millions of
> > rows over the course of several hours.
>
> Are those insertions happening in one enormous transaction, or even just
> a few very large ones?
>

No, one transaction per row insert.


>
> > The pattern that I notice when this happens is that the CPU on DB will be
> > pegged much higher than usual, and the query to lookup records for the
> > (account_id, record_id) combo will steadily rise from <1ms to more then 2
> > or 3 seconds over time.
>
> I'm suspicious that this is not autovacuum's fault but reflects the cost
> of checking uncommitted tuples to see if they've become committed yet.
> If so, there may be little you can do about it except break the insertion
> into smaller transactions ... which might or might not be all right from
> a data consistency standpoint.
>
> regards, tom lane
>


Re: [GENERAL] Blocked updates and background writer performance

2016-01-13 Thread Cory Tucker
On Wed, Jan 13, 2016 at 9:48 AM Vick Khera  wrote:

> That was my intuition too. Not enough I/O available from the hardware for
> the workload requested.
>
> As recommended, log your checkpoints and try tuning them to spread the
> load.
>

Thanks guys, will turn on checkpoint logging and try to sniff this out
further.


[GENERAL] Query Using Massive Temp Space

2017-11-19 Thread Cory Tucker
Hello,

I have a query that is using a tremendous amount of temp disk space given
the overall size of the dataset.  I'd love for someone to try to explain
what PG is doing and why its using so much space for the query.

First off, the system is PG 9.6 on Ubuntu with 4 cores and 28 GB of RAM.
The query in question is a fairly large join of several tables (6)
including some aggregations.  The overall dataset size of the 6 tables in
question is about 20GB and the largest table is about 15M rows.  The query
is essentially a dump of *most* of the data from these tables joined
together to be used in another system.

When the query runs it begins to use an aggressive amount of temp space on
the volume over the course of many hours, until it reaches about 95%
capacity and then tapers off.  Never completes though.  The temp space it
uses is around 1.5TB out of a 2TB volume.  Again, the *total* size of the
relations in question is only 20GB.

Can anyone explain how the query could possibly use so much temp space?

Query and query plan are attached.  Please let me know if any more info is
needed.

thanks
--Cory


query
Description: Binary data


query_plan
Description: Binary data

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Determine all listeners subscribed to notifcations and what channels

2015-02-17 Thread Cory Tucker
I'm interested in trying to figure out which channels have been subscribed
to (using LISTEN).  From what I could tell via a little Googling, there
used to be a table named pg_catalog.pg_listener that contained all this
information, but that seems to have disappeared somewhere in the 9.x
release (I'm using 9.3.x).

Is there a way to find out which channels have listeners?

thanks
--Cory