date:20040725

[PERFORM] variable length - user defined types/storage place

2004-07-25 Thread martin . bokler

Hi all,

in a large table (millions of rows) I am using 
a variable-length user
defined type 
to store relatively short field entries, i.e. the 
length of certain
fields could be restricted to 2^16 or even 
2^8 characters.

Now I wonder whether it would be possible 
to save storage place, by
using a smaller length field, i.e. 2 Byte or 
even better 1 Byte per entry instead of 4 
Byte.

Is there a possibility to customize the size of 
the length field or does
Postgresql already internally optimize this?

Thanks in advance,
 Martin. 



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [PERFORM] Working on huge RAM based datasets

2004-07-25 Thread Andy Ballingall

Sorry for the late reply - I've been away.

Merlin, I'd like to come back with a few more points!

>That's the whole point: memory is a limited resource.  If pg is
>crawling, then the problem is simple: you need more memory.

My posting only relates to the scenario where RAM is not a limiting factor,
a scenario which shall become increasingly common  over the next few years,
as 64 bit processors and OSs allow the exploitation of ever larger, ever
cheaper RAM.

>  The question is: is it postgresql's responsibility to manage that
resource?

I think you are confusing the issue of RAM and address space.

Any application can acquire a piece of address space for its own use. It is
the responsibility of the application to do what it needs with that address
space. I'm interested in how PG could do something better in its address
space when it knows that it can fit all the data it operates on within that
address space.

Though the OS is responsible for determining whether that address space is
RAM resident or not, in my scenario, this is irrelevant, because there
*will* be enough RAM for everything, and the OS will, in that scenario,
allow all the address space to become RAM resident.

I am not advocating undermining the OS in any way. It would be stupid to
make PGSQL take over the running of the hardware.

>Pg is a data management tool, not a memory management tool.

I'm not criticising PG. PG is actually a 'DISK/MEMORY' data management tool.
It manages data which lives on disks, but it can only operate on that data
in memory, and goes to some lengths to try to fit bits of disk data in a
defined piece of memory, and push them back out again.

At the moment, this model assumes that RAM is a scarce resource.

The model still 'sort of' works when RAM is actually not scarce, because the
OS effectively uses that extra RAM to make IO *appear* to be quicker, and
indeed, I've found that a hint has been added to PG to tell it how much the
OS is likely to be caching.

But the question is this:

"If you wrote a DB from scratch with the assumption that *all* the data
could fit in the address space allocated by the postmaster, and you were
confident that the OS had enough RAM so that you never suffered vmem page
misses, couldn't you make things go much faster?"

A more pertinent question is:

"Could PG be extended to have a flag, which when enabled, told it to operate
with the assumption that it could fit all the disk data in RAM, and
implement the data organisation optimisations that rely on the persistence
of data in address space?"


>The same
>'let's manage everything' argument also frequently gets brought up wrt
>file i/o, because people assume the o/s sucks at file management.

Well, I'm not saying this.

I have substantial experience with high performance file IO through a
filesystem.

But if you are interested in high speed IO, naive 'let the OS do everything'
approach isn't often good enough. You, the application, have to be aware
that the order and timing of IO requests, along with the size of IO block
you cause to trigger, have a dramatic impact on the speed with which the
data reaches your app, OS or no OS. Most high speed storage still relies on
spinning things containing data that can only be accessed in a certain way,
and data movement is page boundary sensitive. The OS may hide these details
from you, but you, the app writer, have to have an understanding of the
underlying reality if you want to optimise performance.

I want to stress that at no point am I advocating *not* using the OS. PG
should do ALL IO and memory allocation through the OS, otherwise you end up
with a platform specific product that is of little use.

That given, there is still the opportunity for PG to be able to operate far
more efficiently in my high memory scenario.

Wouldn't your backend processes like to have the entire database sitting
ready in address space (ram resident, of course!), indexes all fully built?
No tuple more than a few machine instructions away?

Imagine the postmaster isn't having to frantically decide which bits of data
to kick out of the workspace in order to keep the backends happy. Imagine
the postmaster isn't having to build structures to keep track of the newly
read in blocks of data from 'disk' (or OS cache).

Is this not a compelling scenario?

>At some point, hard disks will be replaced by solid state memory
>technologies...

This is irrelevant to my scenario, though solid state disks would allow
write speeds to improve, which would add to the gains which I am fishing for
here.

>do you really want to recode your memory manager when
>this happens because all your old assumptions are no longer correct?

My scenario assumes nothing about how the data is stored, but you are right
to flag the problems that arise when original assumptions about hardware
become incorrect.

For example, PG assumes that RAM is a rare resource, and it assumes the
postmaster cannot fit the entire database in a single address space.

*Th

[PERFORM] Index type

2004-07-25 Thread Ilia Kantor

Hello,

I have a request like SELECT ... WHERE x<=A<=y AND t<=B<=u AND z<=C<=w AND ..
5 columns are in BETWEEN clauses.

What is the best index I could use?

If I create btree index on all columns (A,B,C..), here is what explain
analyze gives me:
-
 Index Scan using all_ind on test2  (cost=0.00..4.51 rows=1 width=24) (actual ti
me=0.000..0.000 rows=5 loops=1)
   Index Cond: ((a >= '2004-07-20 23:50:50'::timestamp without time zone) AND (a
 <= '2004-07-21 23:50:50'::timestamp without time zone) AND (b >= '2004-07-20 23
:50:50'::timestamp without time zone) AND (b <= '2004-07-21 23:50:50'::timestamp
 without time zone) AND (c >= '2004-07-20 23:50:50'::timestamp without time zone
) AND (c <= '2004-07-21 23:50:50'::timestamp without time zone))


Is such search really optimal?

I remember we used k-d trees for geometric data with independent
coords.. Is that the same as btree for multiple columns I wonder.



-- 
Best regards,
 Iliamailto:[EMAIL PROTECTED]


---(end of broadcast)---
TIP 8: explain analyze is your friend

[PERFORM] arrays and indexes

2004-07-25 Thread Ross J. Reedstrom

Hi all - 
I've got a schema I'm working on modifying, nad I need some help getting
the best performance out. The orginal schema has a many to many linkage
between a couple tables, using a two column linkage table. This is used
to represent groups of people and their relationship to an object
(authors, copyrightholders, maintainers) This worked fine, and, with the
right indixes, is quite zippy. Approximate schems:

table content (
contentid serial,
name text,
<...>
authorgroupid int,
cpholdergroupid int,
maintgroupid int)

table groups (
personid text,
groupid int)

Note that neither grouid nor personid are unique.

Now the users want not just groups, but ordered lists. Well, that's just
fine: we could do it with another column in the groups linkage table,
and some additional logic in the middleware for detecting identical
groups, but it occured to me that PG's array types are just the ticket
for ordered lists like this.

So, by dropping arrays of personids (authors, copyrightholders,
maintainers, ...) into the content table, I can do everything I need.

Only one problem. Retreiving all the content for a particular
person/role is fairly common. Queries of the form:

SELECT * from content c join groups g on c.authorgroupid = g.personid
where personid = 'ross';

work fine and use the index on groups.personid.

In the new schema, the same thing is:

SELECT * from content where 42 = ANY (authors);

Works fine, but for the life of me I can't find nor figure out how to
build an index that will be used to speed this along. Any ideas?

I'm using 7.4.3, BTW.

Ross
-- 
Ross Reedstrom, Ph.D. [EMAIL PROTECTED]
Research Scientist  phone: 713-348-6166
The Connexions Project  http://cnx.rice.edu   fax: 713-348-3665
Rice University MS-375, Houston, TX 77005
GPG Key fingerprint = F023 82C8 9B0E 2CC6 0D8E  F888 D3AE 810E 88F0 BEDE



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] arrays and indexes

2004-07-25 Thread Greg Stark

"Ross J. Reedstrom" <[EMAIL PROTECTED]> writes:

> In the new schema, the same thing is:
> 
> SELECT * from content where 42 = ANY (authors);
> 
> Works fine, but for the life of me I can't find nor figure out how to
> build an index that will be used to speed this along. Any ideas?

Well that's basically the problem with denormalized data like this.

Have you resolved what you're going to do if two sessions try to add a user to
the same group at the same time? Or how you'll go about removing a user from
all his groups in one shot?

Basically, if you denormalize in this fashion it becomes hard to use the
groups as anything but single monolithic objects. Whereas normalized data can
be queried and updated from other points of view like in the case you name
above.

Postgres does have a way to do what you ask, though. It involves GiST indexes
and the operators from the contrib/intarray directory from the Postgres
source.

However I warn you in advance that this is fairly esoteric stuff and will take
some time to get used to. And at least in my case I found the indexes didn't
actually help much for my data sets, probably because they just weren't big
enough to benefit.

-- 
greg

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org

[PERFORM] variable length - user defined types/storage place

Re: [PERFORM] Working on huge RAM based datasets

[PERFORM] Index type

[PERFORM] arrays and indexes

Re: [PERFORM] arrays and indexes

5 matches

Site Navigation

Mail list logo

Footer information