[GENERAL] Difference between array column type and separate table

2009-05-02 Thread Mike Christensen
Let's say you have a table called Threads, and each thread can have zero or
more "tags" associated with it.  A tag is just a byte which maps to some
enum somewhere.
There's two ways I can think of to do this.  The first would be to have:

create table Threads (
  Id uuid not null,
   Posted timestamp not null,
   Subject varchar(255) not null,
   Replies int4 not null,
   PosterId uuid not null,
   primary key (Id)
);

and a separate table for each tag on a thread:

create table ThreadTags (
  Id uuid not null,
   Tag int2 not null,
   ThreadId uuid not null,
   primary key (Id)
)

I can query for all threads with a certain tag using a join:

select * from Threads T
inner join ThreadTags tag ON tag.ThreadId = T.Id AND tag.Tag = 5;

This should work fine.  However, recently I was digging through Postgres
manuals and found that you can store arrays of stuff in a column.  Using
this, I could completely get rid of ThreadTags and have a table like this:

create table Threads (
  Id uuid not null,
   Posted timestamp not null,
   Subject varchar(255) not null,
   Replies int4 not null,
   PosterId uuid not null,
   Tags int2[],
   primary key (Id)
);

and then find threads using the ANY function:

select * from Threads where 5 = ANY (Tags);

To me this seems cleaner, but I'm wondering about performance.  If I had
millions of threads, is a JOIN going to be faster?  I guess what I'm asking
about is the underlying implementation of ANY.  Is it doing
a sequential search?  Can I index Tags and will ANY() then use that index?
 Any other opinions on what option is better?

One thing about the array approach is I'm using NHibernate which doesn't
really seem to want to support Postgres arrays without a whole bunch of
custom driver code and IUserTypes and junk, so I'd like to make sure this
architecture is best before I commit to it.  Thanks!!

Mike


Re: [GENERAL] Re: Mapping output from a SEQUENCE into something non-repeating/colliding but random-looking?

2009-05-02 Thread Daniel Verite

Craig Ringer wrote:


What I'm looking for is a function that, given an input within a
constrained range (say, a 32 bit integer) produces a different
output within the same range. For any given input, the output
should be the same each time, and for any given output
there should only be one input that results in that output.


That's a permutation, as used in symmetric ciphering. A proven way to 
build one is to use a Feistel network:

http://en.wikipedia.org/wiki/Feistel_cipher
In principle, the function used to encode the blocks uses a cipher key, 
but a pseudo-randomizing of the input is good enough when you're not 
interested in making it crypto-secure.

Here is a plpgqsl implementation:

CREATE OR REPLACE FUNCTION pseudo_encrypt(value int) returns bigint AS 
$$

DECLARE
l1 int;
l2 int;
r1 int;
r2 int;
i int:=0;
BEGIN
 l1:= (value >> 16) & 65535;
 r1:= value&65535;
 WHILE i<3 LOOP
   l2:=r1;
   r2:=l1 # 1366.0*r1+150889)%714025)/714025.0)*32767)::int;
   l1:=l2;
   r1:=r2;
   i:=i+1;
 END LOOP;
 return ((l1::bigint<<16) + r1);
END;
$$ LANGUAGE plpgsql strict immutable;

Note that it returns a bigint because we don't have unsigned integers 
in PG. If you're OK with getting negative values, the return type can 
be changed to int.
Otherwise if you need a positive result that fits in 32 bits, it's 
possible to tweak the code to use 15 bits blocks instead of 16, but 
then the input will have to be less than 2^30.


Best regards,
--
Daniel
PostgreSQL-powered mail user agent and storage:
http://www.manitou-mail.org

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Difference between array column type and separate table

2009-05-02 Thread Mike Christensen
Let's say you have a table called Threads, and each thread can have zero or
more "tags" associated with it.  A tag is just a byte which maps to some
enum somewhere.
There's two ways I can think of to do this.  The first would be to have:

create table Threads (
  Id uuid not null,
   Posted timestamp not null,
   Subject varchar(255) not null,
   Replies int4 not null,
   PosterId uuid not null,
   primary key (Id)
);

and a separate table for each tag on a thread:

create table ThreadTags (
  Id uuid not null,
   Tag int2 not null,
   ThreadId uuid not null,
   primary key (Id)
)

I can query for all threads with a certain tag using a join:

select * from Threads T
inner join ThreadTags tag ON tag.ThreadId = T.Id AND tag.Tag = 5;

This should work fine.  However, recently I was digging through Postgres
manuals and found that you can store arrays of stuff in a column.  Using
this, I could completely get rid of ThreadTags and have a table like this:

create table Threads (
  Id uuid not null,
   Posted timestamp not null,
   Subject varchar(255) not null,
   Replies int4 not null,
   PosterId uuid not null,
   Tags int2[],
   primary key (Id)
);

and then find threads using the ANY function:

select * from Threads where 5 = ANY (Tags);

To me this seems cleaner, but I'm wondering about performance.  If I had
millions of threads, is a JOIN going to be faster?  I guess what I'm asking
about is the underlying implementation of ANY.  Is it doing
a sequential search?  Can I index Tags and will ANY() then use that index?
 Any other opinions on what option is better?

One thing about the array approach is I'm using NHibernate which doesn't
really seem to want to support Postgres arrays without a whole bunch of
custom driver code and IUserTypes and junk, so I'd like to make sure this
architecture is best before I commit to it.  Thanks!!

Mike


Re: [GENERAL] Difference between array column type and separate table

2009-05-02 Thread Alban Hertroys

On May 2, 2009, at 9:33 AM, Mike Christensen wrote:

Using this, I could completely get rid of ThreadTags and have a  
table like this:


create table Threads (
  Id uuid not null,
   Posted timestamp not null,
   Subject varchar(255) not null,
   Replies int4 not null,
   PosterId uuid not null,
   Tags int2[],
   primary key (Id)
);

and then find threads using the ANY function:

select * from Threads where 5 = ANY (Tags);

To me this seems cleaner, but I'm wondering about performance.  If I  
had millions of threads, is a JOIN going to be faster?  I guess what  
I'm asking about is the underlying implementation of ANY.  Is it  
doing a sequential search?  Can I index Tags and will ANY() then use  
that index?  Any other opinions on what option is better?


If you modify the array the entire array needs to be rewritten. I  
don't think you'd want that with millions of threads in it. I don't  
think array values are indexable either. So while they're probably  
faster to query for small amounts of threads, the join is likely  
faster to query for large amounts (provided they're indexed properly,  
of course).


If you want to be sure, play around with explain analyse with both  
implementations.


Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.


!DSPAM:737,49fc1d20129743379199738!



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to begin to debug FATAL: invalid frontend message type 77 error messages?

2009-05-02 Thread Thomas Pundt

Keaton Adams schrieb:

Any ideas on how to debug these types of error messages?

Apr 30 01:36:02 mxlqa401 postgres[23600]: [3-1] FATAL:  invalid frontend 
message type 77
Apr 30 01:36:02 mxlqa401 postgres[23601]: [3-1] LOG:  unexpected EOF on client 
connection
Apr 30 01:36:02 mxlqa401 postgres[23602]: [3-1] LOG:  unexpected EOF on client 
connection
Apr 30 01:36:13 mxlqa401 postgres[23631]: [3-1] FATAL:  invalid frontend 
message type 77
Apr 30 01:36:13 mxlqa401 postgres[23632]: [3-1] LOG:  unexpected EOF on client 
connection
Apr 30 01:36:13 mxlqa401 postgres[23633]: [3-1] LOG:  unexpected EOF on client 
connection
Apr 30 01:36:24 mxlqa401 postgres[23664]: [3-1] FATAL:  invalid frontend 
message type 77
Apr 30 01:36:24 mxlqa401 postgres[23666]: [3-1] LOG:  unexpected EOF on client 
connection
Apr 30 01:36:24 mxlqa401 postgres[23665]: [3-1] LOG:  unexpected EOF on client 
connection
Apr 30 01:36:35 mxlqa401 postgres[23696]: [3-1] FATAL:  invalid frontend 
message type 77
Apr 30 01:36:35 mxlqa401 postgres[23698]: [3-1] LOG:  unexpected EOF on client 
connection
Apr 30 01:36:35 mxlqa401 postgres[23697]: [3-1] LOG:  unexpected EOF on client 
connection
Apr 30 01:36:46 mxlqa401 postgres[23728]: [3-1] FATAL:  invalid frontend 
message type 77
Apr 30 01:36:46 mxlqa401 postgres[23730]: [3-1] LOG:  unexpected EOF on client 
connection
Apr 30 01:36:46 mxlqa401 postgres[23729]: [3-1] LOG:  unexpected EOF on client 
connection


Maybe some monitoring software that doesn't speak
the Postgres protocol? The connects seem to occur
quite regularly every 11 seconds...

You might want to configure the log output format to include
the IP address/host name of the creator of these messages
(log_line_prefix in postgresql.conf, "%h"). Or use a packet
sniffer like wireshark.

Ciao,
Thomas

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] 08P01: unexpected EOF on client connection

2009-05-02 Thread Jasen Betts
On 2009-05-01, Tomas Vondra  wrote:
> Hi,
>
> I've run into some strange troubles with a quite simple web application 
> that stores data in a PostgreSQL database. It is a simple image gallery, 
> displaying multiple thumbnails (stored in a bytea column) on a single 
> page generated by PHP. The problem is some of the images are not displayed.
>
> $conn = pg_connect(...);
> $res = pg_query("SELECT mime, thumbnail_data FROM images WHERE filename 
>= ");
> $row = pg_fetch_assoc($row);
> header('Content-Type: ' . $row['mime']);
> echo pg_unescape_bytea($row['thumbnail_data']);

This is cut-n-paste from a working site (pg 7.4, PHP 4.3.9)

 $res=pg_query("select * from images where id='$id' ;");
 $row=pg_fetch_assoc($res);
 header("content-type: $row[type]");
 echo pg_unescape_bytea($row['data']);

the date on the file is NOV 19 2006. 
wasn't I a crap programmer back then :)



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Possible to prevent transaction abort?

2009-05-02 Thread Thomas Kellerer

Adam B wrote on 02.05.2009 00:48:
Strange indeed.  Perhaps there's some background stuff happening that 
messes with the results (auto VACUUM?).


In my mind, however, it makes sense that it would take longer: 2 extra 
operations against the server (save&release).


Typical case of "problem sits between keyboard and chair".

My import program had an additional parameter which was needed to enable the 
savepoint. If that is set, the results are comparable to your test program (and 
reproducable). So my fast results were not using savepoints.


Sorry for the confusion.

Thomas


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Tracking down a deadlock

2009-05-02 Thread Bill Moseley
Not getting any nibbles, so allow me to try a short question:

If I have a deadlock situation (that will be reported as such by
Postgresql once the deadlock_timeout passes), does pg_stat_activity
show the queries that are blocking each other?

I'm wondering if I'm misinterpreting what I'm seeing below.




On Thu, Apr 30, 2009 at 10:30:26AM -0700, Bill Moseley wrote:
> 
> I need a bit of help understanding what might be causing a deadlock.
> 
> To duplicate the problem I'm running a test script that forks two
> child processes.  Each child runs the same transaction and thus the
> order of execution is exactly the same.  (i.e. not like the typical
> deadlock where the order of updates might be reversed between two
> sessions.)
> 
> The transaction inserts a new document into a document management
> system.  The transaction does a number of selects and inserts.  At the
> end of the transaction they both try and update the same row in the
> "account" table.
> 
> 
> It does not happen every time I run my test script -- but if I run it
> enough I get a deadlock.  If I fork more child process I can make it
> happen more often.  So, it does seem like a timing issue.
> 
> 
> No explicit LOCK or SELECT FOR UPDATE is used in the transaction.
> I'm running in the default "read committed" isolation level.
> The initial problem was reported on PostgreSQL 8.3.5, but
> I'm now testing on PostgreSQL 8.2.9.
> 
> 
> 
> I've set my deadlock_timeout high so I can review the locks.
> I see these entires:
> 
> 
> select * from pg_locks where not granted;
>locktype| database | relation | page | tuple | transactionid | classid 
> | objid | objsubid | transaction |  pid  | mode  | granted 
> ---+--+--+--+---+---+-+---+--+-+---+---+-
>  transactionid |  |  |  |   |  18410123 | 
> |   |  |18410135 | 13420 | ShareLock | f
>  tuple |  2474484 |  2474485 |   30 |11 |   | 
> |   |  |18410123 | 13419 | ExclusiveLock | f
> (2 rows)
> 
> select * from pg_locks where locktype='tuple';
>  locktype | database | relation | page | tuple | transactionid | classid | 
> objid | objsubid | transaction |  pid  | mode  | granted 
> --+--+--+--+---+---+-+---+--+-+---+---+-
>  tuple|  2474484 |  2474485 |   30 |11 |   | |
>|  |18410135 | 13420 | ExclusiveLock | t
>  tuple|  2474484 |  2474485 |   30 |11 |   | |
>|  |18410123 | 13419 | ExclusiveLock | f
> (2 rows)
> 
> 
> And pg_stat_activity shows two of the exact same queries in "waiting"
> state.  The "current_query" is just:
> 
> UPDATE account set foo = 123 where id = $1
> 
> and $1 is indeed the same for both.
> 
> 
> If I comment out that update to the "account" table from the
> transaction I never get a deadlock.
> 
> 
> 
> Maybe I'm missing something, but that by itself doesn't seem like a
> deadlock situation.
> 
> The "account" table does have a number of constraints, and one looks
> like:
> 
> CHECK( ( foo + bar ) <= 0 );
> 
> Could those be responsible?  For a test I dropped all the constraints
> (except foreign keys) and I'm still getting a deadlock.
> 
> In general, do the constraints need to be deferrable and then defer
> constraints at the start of the transaction?
> 
> What else can I do to debug?
> 

-- 
Bill Moseley.
mose...@hank.org
Sent from my iMutt

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Tracking down a deadlock

2009-05-02 Thread Tom Lane
Bill Moseley  writes:
> Not getting any nibbles, so allow me to try a short question:
> If I have a deadlock situation (that will be reported as such by
> Postgresql once the deadlock_timeout passes), does pg_stat_activity
> show the queries that are blocking each other?

In 8.2 or later it should do so; in prior versions there could be some
lag involved.  Another thing to keep in mind is that backends like to
cache copies of the pg_stat_activity view --- if you are watching it
to see what is happening, your view is only current as of the start
of your current transaction.  Or you can do pgstat_clear_snapshot()
to force collection of new info.

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Two Questions Re: Warm Backup

2009-05-02 Thread Daniel Verite

 Terry Lee Tucker writes


Q1: Can we set up a scenario where there is more that one
warm standby?


Yes. But you'll have to consider what you want to happen when one 
standby is correctly receiving the WAL files and another is not, 
because the archive_command has to either fail or succeed, it can't 
return "half-done".
You may need to provide your own layer that retry pushing local copies 
of WAL files to the remote nodes and deleting them only when they've 
been received by every standby server.



Q2: Am I correct in assuming that ALL changes to any of the
production schema will be written to the warm standby?
For example, if I drop a constraint in production I assume
the same will occur on the warm standby. If I create and
drop a table in production, I assume it will occur on the warm
standby.


It will, all DDL is replicated.

Best regards,
--
Daniel
PostgreSQL-powered mail user agent and storage:
http://www.manitou-mail.org 


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



Re: [GENERAL] Difference between array column type and separate table

2009-05-02 Thread Mike Christensen
Yeah I talked with some other SQL guru friends of mine and they all agree
the separate table is the way to go for a number of reasons, so that's what
I'll stick with.  It was just one of those things where you see a new
feature and try to find an excuse to try it out 
Thanks!
Mike

On Sat, May 2, 2009 at 3:14 AM, Alban Hertroys <
dal...@solfertje.student.utwente.nl> wrote:

> On May 2, 2009, at 9:33 AM, Mike Christensen wrote:
>
>  Using this, I could completely get rid of ThreadTags and have a table like
>> this:
>>
>> create table Threads (
>>  Id uuid not null,
>>   Posted timestamp not null,
>>   Subject varchar(255) not null,
>>   Replies int4 not null,
>>   PosterId uuid not null,
>>   Tags int2[],
>>   primary key (Id)
>> );
>>
>> and then find threads using the ANY function:
>>
>> select * from Threads where 5 = ANY (Tags);
>>
>> To me this seems cleaner, but I'm wondering about performance.  If I had
>> millions of threads, is a JOIN going to be faster?  I guess what I'm asking
>> about is the underlying implementation of ANY.  Is it doing a sequential
>> search?  Can I index Tags and will ANY() then use that index?  Any other
>> opinions on what option is better?
>>
>
> If you modify the array the entire array needs to be rewritten. I don't
> think you'd want that with millions of threads in it. I don't think array
> values are indexable either. So while they're probably faster to query for
> small amounts of threads, the join is likely faster to query for large
> amounts (provided they're indexed properly, of course).
>
> If you want to be sure, play around with explain analyse with both
> implementations.
>
> Alban Hertroys
>
> --
> If you can't see the forest for the trees,
> cut the trees and you'll see there is no forest.
>
>
> !DSPAM:880,49fc1d1e129741592332518!
>
>
>


Re: [GENERAL] Two Questions Re: Warm Backup

2009-05-02 Thread Terry Lee Tucker
On Saturday 02 May 2009 13:08, Daniel Verite wrote:
>Terry Lee Tucker writes
>
> > Q1: Can we set up a scenario where there is more that one
> > warm standby?
>
> Yes. But you'll have to consider what you want to happen when one
> standby is correctly receiving the WAL files and another is not,
> because the archive_command has to either fail or succeed, it can't
> return "half-done".
> You may need to provide your own layer that retry pushing local copies
> of WAL files to the remote nodes and deleting them only when they've
> been received by every standby server.
>
> > Q2: Am I correct in assuming that ALL changes to any of the
> > production schema will be written to the warm standby?
> > For example, if I drop a constraint in production I assume
> > the same will occur on the warm standby. If I create and
> > drop a table in production, I assume it will occur on the warm
> > standby.
>
> It will, all DDL is replicated.
>
> Best regards,

Daniel:

Thanks for the reply. I later read the answer to the first question in the 
docs. I expected the answer to number 2 to be affirmative, but just wanted to 
make sure.

Thanks for the help...
-- 
Terry Lee Tucker
Turbo's IT Manager
Turbo, division of OHL
2251 Jesse Jewell Pkwy
Gainesville, GA 30501
tel: (336) 372-6812 cell: (336) 404-6987
te...@turbocorp.com
www.turbocorp.com

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Re: Mapping output from a SEQUENCE into something non-repeating/colliding but random-looking?

2009-05-02 Thread Erik Jones


On May 1, 2009, at 6:06 AM, Craig Ringer wrote:




What I'm looking for is a function that, given an input within a
constrained range (say, a 32 bit integer) produces a different output
within the same range. For any given input, the output should be the
same each time, and for any given output there should only be one  
input

that results in that output.


I think you drop the idea of a repeatable mapping you may have some  
success with the Knuth (aka Fisher-Yates) shuffle algorithm:  http://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle 
  Why does anything need to be repeatable when you only need to make  
sure that each number is only generated once?


Erik Jones, Database Administrator
Engine Yard
Support, Scalability, Reliability
866.518.9273 x 260
Location: US/Pacific
IRC: mage2k






--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Re: Mapping output from a SEQUENCE into something non-repeating/colliding but random-looking?

2009-05-02 Thread Jasen Betts
On 2009-05-03, Erik Jones  wrote:

>> What I'm looking for is a function that, given an input within a
>> constrained range (say, a 32 bit integer) produces a different output
>> within the same range. For any given input, the output should be the
>> same each time, and for any given output there should only be one  
>> input
>> that results in that output.
>
> I think you drop the idea of a repeatable mapping you may have some  
> success with the Knuth (aka Fisher-Yates) shuffle algorithm:  
> http://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle 
>Why does anything need to be repeatable when you only need to make  
> sure that each number is only generated once?

That means storing a long list of numbers and doing queries similar to
the following to get ne next value for the sequence.

select id from idtable 
  order by id 
  limit 1 
  offset random(0, (select count (*) from idtable)

a ramdom-looking  1:1 mapping is potentially much more efficient.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Re: Mapping output from a SEQUENCE into something non-repeating/colliding but random-looking?

2009-05-02 Thread Craig Ringer
Jasen Betts wrote:

> That means storing a long list of numbers and doing queries similar to
> the following to get ne next value for the sequence.
> 
> select id from idtable 
>   order by id 
>   limit 1 
>   offset random(0, (select count (*) from idtable)
> 
> a ramdom-looking  1:1 mapping is potentially much more efficient.

You'd probably be better off generating it with something like:

CREATE TABLE shuffled AS (n integer, s integer)
AS SELECT n, NULL FROM generate_series(0, max_value) AS n;

SELECT shuffle(); -- sets `s' for each `n'

... then querying it with:

SELECT s FROM shuffled WHERE n = ;

... but you still have to generate, shuffle, and store a huge collection
of values.

--
Craig Ringer

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Re: Mapping output from a SEQUENCE into something non-repeating/colliding but random-looking?

2009-05-02 Thread Craig Ringer
Daniel Verite wrote:

> That's a permutation, as used in symmetric ciphering. A proven way to
> build one is to use a Feistel network:

Thanks. That looks to be pretty much what I'm after.

> Here is a plpgqsl implementation:

Wow. I really appreciate that. I'll have to test it out and chuck it on
the wiki (if that's OK with you) for future use, alongside whatever I
end up using for my check digit generator and verifier.

Come to think of it, check digits are almost unnecessary; after all, in
a large numeric space the chances of any misheard, typo'd, etc value
happening to be another valid value is pretty minimal. A simple
mod-10-of-sum-of-digits would probably be quite sufficient just to help
the UI differentiate between "Whoops, that number was incorrectly
entered or misheard" and "Customer not found".

--
Craig Ringer

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general