Re: [PERFORM] Any advantage to integer vs stored date w. timestamp

2007-03-07 Thread Richard Huxton

Zoolin Lin wrote:

Hi,

I have database with a huge amount of data so i'm trying to make it
as fast as possible and minimize space.

One thing i've done is join on a prepopulated date lookup table to
prevent a bunch of rows with duplicate date columns. Without this I'd
have about 2500 rows per hour with the exact same date w. timestamp
in them.

My question is, with postgres do I really gain anything by this, or
should I just use the date w. timestamp column on the primary table
and ditch the join on the date_id table.

Primary table is all integers like:

date id | num1 | num2 | num3 | num4 | num5 | num6 | num7 | num 8 
-

 primary key is on date to num->6 columns


What types are num1->8?


date_id lookup table:

This table is prepopulated with the date values that will be used.

date_id | date w timestamp  1
| 2007-2-15 Midnight 2 | 2007-2-15 1 am 3 | 2007-2-15
2 am  etc for 24 hours each day


If you only want things accurate to an hour, you could lost the join and
just store it as an int: 2007021500, 2007021501 etc.

That should see you good to year 2100 or so.

--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] compact flash disks?

2007-03-07 Thread Florian Weimer
* James Mansion:

> If I were to use a filesystem with noatime etc and little non-sql traffic,
> does the physical update pattern tend to have hot sectors that will tend to
> wear out CF?

Thanks to the FAT file system and its update pattern, most (if not
all) CF disks implement wear leveling nowadays.  I wouldn't worry
about the issue, unless your write rates are pretty high.

-- 
Florian Weimer<[EMAIL PROTECTED]>
BFK edv-consulting GmbH   http://www.bfk.de/
Kriegsstraße 100  tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Any advantage to integer vs stored date w. timestamp

2007-03-07 Thread Zoolin Lin
thanks for your reply

> Primary table is all integers like:
> 
> date id | num1 | num2 | num3 | num4 | num5 | num6 | num7 | num 8 
> -
>  primary key is on date to num->6 columns

>>What types are num1->8?

They are all integer


> date_id | date w timestamp  1
> | 2007-2-15 Midnight 2 | 2007-2-15 1 am 3 | 2007-2-15
> 2 am  etc for 24 hours each day

>>If you only want things accurate to an hour, you could lost the join and
>>just store it as an int: 2007021500, 2007021501 etc.

Hmm yeh I could, I think with the amount of data in the db though it behooves 
me to use one of the date types, even if via lookup table.

So I guess I'm just not sure if I'm really gaining anything by using an integer 
 date id column and doing a join on a date lookup table, vs just making it a 
date w. timestamp column and having duplicate dates in that column.

I would imagine internally that the date w. timestamp is stored as perhaps a 
time_t type  plus some timezone information. I don't know if it takes that much 
more space, or there's a significant performance penalty in using it

2,500 rows per hour, with duplicate date columns, seems like it could add up 
though.

thanks

Richard Huxton  wrote: Zoolin Lin wrote:
> Hi,
> 
> I have database with a huge amount of data so i'm trying to make it
> as fast as possible and minimize space.
> 
> One thing i've done is join on a prepopulated date lookup table to
> prevent a bunch of rows with duplicate date columns. Without this I'd
> have about 2500 rows per hour with the exact same date w. timestamp
> in them.
> 
> My question is, with postgres do I really gain anything by this, or
> should I just use the date w. timestamp column on the primary table
> and ditch the join on the date_id table.
> 
> Primary table is all integers like:
> 
> date id | num1 | num2 | num3 | num4 | num5 | num6 | num7 | num 8 
> -
>  primary key is on date to num->6 columns

What types are num1->8?

> date_id lookup table:
> 
> This table is prepopulated with the date values that will be used.
> 
> date_id | date w timestamp  1
> | 2007-2-15 Midnight 2 | 2007-2-15 1 am 3 | 2007-2-15
> 2 am  etc for 24 hours each day

If you only want things accurate to an hour, you could lost the join and
just store it as an int: 2007021500, 2007021501 etc.

That should see you good to year 2100 or so.

-- 
   Richard Huxton
   Archonet Ltd

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


 
-
Food fight? Enjoy some healthy debate
in the Yahoo! Answers Food & Drink Q&A.

Re: [PERFORM] Any advantage to integer vs stored date w. timestamp

2007-03-07 Thread Richard Huxton

Zoolin Lin wrote:

thanks for your reply


Primary table is all integers like:

date id | num1 | num2 | num3 | num4 | num5 | num6 | num7 | num 8 
-

 primary key is on date to num->6 columns



What types are num1->8?


They are all integer


Hmm - not sure if you'd get any better packing if you could make some
int2 and put them next to each other. Need to test.


date_id | date w timestamp 
1 | 2007-2-15 Midnight 2 | 2007-2-15 1 am 3 |
2007-2-15 2 am  etc for 24 hours each day



If you only want things accurate to an hour, you could lost the
join and just store it as an int: 2007021500, 2007021501 etc.


Hmm yeh I could, I think with the amount of data in the db though it
behooves me to use one of the date types, even if via lookup table.


You can always create it as a custom ZLDate type. All it really needs to 
be is an int with a few casts.



So I guess I'm just not sure if I'm really gaining anything by using
an integer  date id column and doing a join on a date lookup table,
vs just making it a date w. timestamp column and having duplicate
dates in that column.

I would imagine internally that the date w. timestamp is stored as
perhaps a time_t type  plus some timezone information. I don't know
if it takes that much more space, or there's a significant
performance penalty in using it


It's a double or int64 I believe, so allow 8 bytes instead of 4 for your 
int.



2,500 rows per hour, with duplicate date columns, seems like it could
add up though.


Well, let's see 2500*24*365 = 21,900,000 * 4 bytes extra = 83MB 
additional storage over a year.	Not sure it's worth worrying about.


--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PERFORM] compact flash disks?

2007-03-07 Thread Ron

At 05:18 PM 3/6/2007, James Mansion wrote:

I see that one can now get compact flash to SATA connectors.

If I were to use a filesystem with noatime etc and little non-sql traffic,
does the physical update pattern tend to have hot sectors that will tend to
wear out CF?
Most flash RAMs have drivers that make sure the pattern of writes 
over time is uniform across the entire device.




I'm wondering about a RAID5 with data on CF drives and RAID1 for the WAL on
a fast SATA or SAS drive pair.  I'm thinking that this would tend to have
good performance because the seek time for the data is very low, even if the
actual write speed can be slower than state of the art.


WARNING:  modern TOtL flash RAMs are only good for ~1.2M writes per 
memory cell.  and that's the =good= ones.
Using flash RAM for write heavy applications like OLTP, or for WAL, 
etc can be very dangerous

Flash write speeds also stink; being ~1/2 flash's already low read speed.

Much better to use flash RAM for read heavy applications.
Even there you have to be careful that seek performance, not 
throughput, is what is gating your day to day performance with those tables.


Got tables or indexes that are
a= too big to fit in RAM and
b= are write few, read many times and
c= whose pattern of access is large enough that it does not cache well?
=Those= are what should be put into flash RAMs


Side Note:
In the long run, we are going to have to seriously rethink pg's use 
of WAL as the way we implement MVCC as it becomes more and more of a 
performance bottleneck.
We have WAL because Stonebreaker made an assumption about the future 
dominance of optical media that has turned out to be false.

...and it's been one of pg's big issues every since.



  2GB CF isn't so
pricey any more.
Heck =16= GB Flash only costs ~$300 US and 128GB SSDs based on flash 
RAM are due out this year.



Cheers,
Ron



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] compact flash disks?

2007-03-07 Thread Carlos Moreno




Much better to use flash RAM for read heavy applications.
Even there you have to be careful that seek performance, not 
throughput, is what is gating your day to day performance with those 
tables.



Isn't precisely there where Flash disks would have *the* big advantage??

I mean, access time is severely held down by the *mechanical* movement of
the heads to the right cylinder on the disk --- that's a brutally large 
amount of

time compared to anything else happening on the computer  (well, floppy
disks aside, and things like trying-to-break-128-bit-encryption  aside  
:-)).


Or are these Flash disks so slow that they compare to the HD's latency 
figures?


Carlos
--


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PERFORM] compact flash disks?

2007-03-07 Thread Csaba Nagy
> Or are these Flash disks so slow that they compare to the HD's latency 
> figures?

On sequential read speed HDs outperform flash disks... only on random
access the flash disks are better. So if your application is a DW one,
you're very likely better off using HDs.

Cheers,
Csaba.



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] compact flash disks?

2007-03-07 Thread cedric
Le mardi 6 mars 2007 23:18, James Mansion a écrit :
> I see that one can now get compact flash to SATA connectors.
I can suggest you to have a look at Gigabyte i-ram .
We use it on a website with higth traffic with lot of succes. Unfortunely, I 
can not provide any benchmark...
>
> If I were to use a filesystem with noatime etc and little non-sql traffic,
> does the physical update pattern tend to have hot sectors that will tend to
> wear out CF?
>
> I'm wondering about a RAID5 with data on CF drives and RAID1 for teh WAL on
> a fast SATA or SAS drive pair.  I'm thhinking that this would tend to have
> good performance because the seek time for the data is very low, even if
> the actual write speed can be slower than state of the art.  2GB CF isn't
> so pricey any more.
>
> Just wondering.
>
> James
>
> --
> No virus found in this outgoing message.
> Checked by AVG Free Edition.
> Version: 7.5.446 / Virus Database: 268.18.7/711 - Release Date: 05/03/2007
> 09:41
>
>
> ---(end of broadcast)---
> TIP 1: if posting/reading through Usenet, please send an appropriate
>subscribe-nomail command to [EMAIL PROTECTED] so that your
>message can get through to the mailing list cleanly

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM]

2007-03-07 Thread Tom Lane
Jeff Cole <[EMAIL PROTECTED]> writes:
> Hi Tom, you are correct, the distribution is uneven... In the 13k  
> symptom_reports rows, there are 105 distinct symptom_ids.   But the  
> first 8k symptom_reports rows only have 10 distinct symptom_ids.
> Could this cause the problem and would there be anything I could do  
> to address it?

Ah-hah, yeah, that explains it.  Is it worth your time to deliberately
randomize the order of the rows in symptom_reports?  It wasn't clear
whether this query is actually something you need to optimize.  You
might have other queries that benefit from the rows being in nonrandom
order, so I'm not entirely sure that this is a good thing to do ...

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] [EMAIL PROTECTED]: Progress on scaling of FreeBSD on 8 CPU systems]

2007-03-07 Thread Stefan Kaltenbrunner
Tom Lane wrote:
> Stefan Kaltenbrunner <[EMAIL PROTECTED]> writes:
>> Arjen van der Meijden wrote:
>>> Stefan Kaltenbrunner wrote:
 ouch - do I read that right that even after tom's fixes for the
 "regressions" in 8.2.0 we are still 30% slower then the -HEAD checkout
 from the middle of the 8.2 development cycle ?
>>> Yes, and although I tested about 17 different cvs-checkouts, Tom and I
>>> weren't really able to figure out where "it" happened. So its a bit of a
>>> mystery why the performance is so much worse.
> 
>> double ouch - losing that much in performance without an idea WHY it
>> happened is really unfortunate :-(
> 
> Keep in mind that Arjen's test exercises some rather narrow scenarios;
> IIRC its performance is mostly determined by some complicated
> bitmap-indexscan cases.  So that "30% slower" bit certainly doesn't
> represent an across-the-board figure.  As best I can tell, the decisions
> the planner happened to be making in late June were peculiarly nicely
> suited to his test, but not so much for other cases.

understood - I was not trying to imply that we suffer a 30% performance
drop overall.
But still it means we know about a set of queries that we once could
handle faster than we can now ...

Stefan

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Any advantage to integer vs stored date w. timestamp

2007-03-07 Thread Zoolin Lin
Thank you for the reply
 
>> Primary table is all integers like:
>> 
>> date id | num1 | num2 | num3 | num4 | num5 | num6 | num7 | num 8 
>> -
>>  primary key is on date to num->6 columns
> 
>>> What types are num1->8?
> 
> They are all integer

>>Hmm - not sure if you'd get any better packing if you could make some
>>int2 and put them next to each other. Need to test.

Thanks, I find virtually nothing on the int2 column type? beyond brief mention 
here
http://www.postgresql.org/docs/8.2/interactive/datatype-numeric.html#DATATYPE-INT

Could i prevail on you to expand on packing wtih int2 a bit more, or point me 
in the right direction for documentation?

If there's some way I can pack multipe columns into one to save space, yet 
still effectively query on them, even if it's a lot slower, that would be great.

My current scheme, though as normalized and summarized as I can make it, really 
chews up a ton of space. It might even be chewing up more than the data files 
i'm summarizing, I assume due to the indexing.

Regading saving disk space, I saw someone mention doing a custom build and 
changing

TOAST_TUPLE_THRESHOLD/TOAST_TUPLE_TARGET

So data is compressed sooner, it seems like that might be a viable option as 
well.

http://www.thescripts.com/forum/thread422854.html

> 2,500 rows per hour, with duplicate date columns, seems like it could
> add up though.

>>>Well, let's see 2500*24*365 = 21,900,000 * 4 bytes extra = 83MB 
>>>additional storage over a year. Not sure it's worth worrying about.

Ahh yes probably better to make it a date w. timestamp column then.

Z






Richard Huxton  wrote: Zoolin Lin wrote:
> thanks for your reply
> 
>> Primary table is all integers like:
>> 
>> date id | num1 | num2 | num3 | num4 | num5 | num6 | num7 | num 8 
>> -
>>  primary key is on date to num->6 columns
> 
>>> What types are num1->8?
> 
> They are all integer

Hmm - not sure if you'd get any better packing if you could make some
int2 and put them next to each other. Need to test.

>> date_id | date w timestamp 
>> 1 | 2007-2-15 Midnight 2 | 2007-2-15 1 am 3 |
>> 2007-2-15 2 am  etc for 24 hours each day
> 
>>> If you only want things accurate to an hour, you could lost the
>>> join and just store it as an int: 2007021500, 2007021501 etc.
> 
> Hmm yeh I could, I think with the amount of data in the db though it
> behooves me to use one of the date types, even if via lookup table.

You can always create it as a custom ZLDate type. All it really needs to 
be is an int with a few casts.

> So I guess I'm just not sure if I'm really gaining anything by using
> an integer  date id column and doing a join on a date lookup table,
> vs just making it a date w. timestamp column and having duplicate
> dates in that column.
> 
> I would imagine internally that the date w. timestamp is stored as
> perhaps a time_t type  plus some timezone information. I don't know
> if it takes that much more space, or there's a significant
> performance penalty in using it

It's a double or int64 I believe, so allow 8 bytes instead of 4 for your 
int.

> 2,500 rows per hour, with duplicate date columns, seems like it could
> add up though.

Well, let's see 2500*24*365 = 21,900,000 * 4 bytes extra = 83MB 
additional storage over a year. Not sure it's worth worrying about.

-- 
   Richard Huxton
   Archonet Ltd


 
-
No need to miss a message. Get email on-the-go 
with Yahoo! Mail for Mobile. Get started.

Re: [PERFORM] Any advantage to integer vs stored date w. timestamp

2007-03-07 Thread Shane Ambler

Zoolin Lin wrote:

Thank you for the reply
 

Primary table is all integers like:

date id | num1 | num2 | num3 | num4 | num5 | num6 | num7 | num 8 
-

 primary key is on date to num->6 columns

What types are num1->8?

They are all integer



Hmm - not sure if you'd get any better packing if you could make some
int2 and put them next to each other. Need to test.


Thanks, I find virtually nothing on the int2 column type? beyond brief mention 
here
http://www.postgresql.org/docs/8.2/interactive/datatype-numeric.html#DATATYPE-INT

Could i prevail on you to expand on packing wtih int2 a bit more, or point me 
in the right direction for documentation?


int4 is the internal name for integer (4 bytes)
int2 is the internal name for smallint (2 bytes)

Try
SELECT format_type(oid, NULL) AS friendly, typname AS internal,
typlen AS length FROM pg_type WHERE typlen>0;

to see them all (negative typlen is a variable size (usually an array or 
bytea etc))


I think with packing he was referring to simply having more values in 
the same disk space by using int2 instead of int4. (half the storage space)




If there's some way I can pack multipe columns into one to save space, yet 
still effectively query on them, even if it's a lot slower, that would be great.


Depending on the size of data you need to store you may be able to get 
some benefit from "Packing" multiple values into one column. But I'm not 
sure if you need to go that far. What range of numbers do you need to 
store? If you don't need the full int4 range of values then try a 
smaller data type. If int2 is sufficient then just change the columns 
from integer to int2 and cut your storage in half. Easy gain.


The "packing" theory would fall under general programming algorithms not 
postgres specific.


Basically let's say you have 4 values that are in the range of 1-254 (1 
byte) you can do something like

col1=((val1<<0)&(val2<<8)&(val3<<16)&(val4<<24))

This will put the four values into one 4 byte int.

So searching would be something like
WHERE col1 & ((val1<<8)&(val3<<0))=((val1<<8)&(val3<<0))
if you needed to search on more than one value at a time.

Guess you can see what your queries will be looking like.

(Actually I'm not certain I got that 100% correct)

That's a simple example that should give you the general idea. In 
practice you would only get gains if you have unusual length values, so 
if you had value ranges from 0 to 1023 (10 bits each) then you could 
pack 3 values into an int4 instead of using 3 int2 cols. (that's 32 bits 
for the int4 against 64 bits for the 3 int2 cols) and you would use <<10 
and <<20 in the above example.


You may find it easier to define a function or two to automate this 
instead of repeating it for each query. But with disks and ram as cheap 
as they are these days this sort of packing is getting rarer (except 
maybe embedded systems with limited resources)



My current scheme, though as normalized and summarized as I can make it, really 
chews up a ton of space. It might even be chewing up more than the data files 
i'm summarizing, I assume due to the indexing.




--

Shane Ambler
[EMAIL PROTECTED]

Get Sheeky @ http://Sheeky.Biz

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] compact flash disks?

2007-03-07 Thread James Mansion
>WARNING:  modern TOtL flash RAMs are only good for ~1.2M writes per
>memory cell.  and that's the =good= ones.
>Using flash RAM for write heavy applications like OLTP, or for WAL,
>etc can be very dangerous

Well, that's why I suggested that the WAL would stream to a hard disk
array, where the large IO sequential write speed will be helpful.

Whether OLTP is a problem will presumably depend on the freqency of updates
and vacuum to each physical cluster of rows in a disk block.

Most rows in a trading application will have quite a long lifetime, and be
updated relatively few times (even where we writing fixings info into
trades).

>Flash write speeds also stink; being ~1/2 flash's already low read speed.

Sure - but it may still be an effective tradoff where the limiting factor
would otherwise be seek time.

>Much better to use flash RAM for read heavy applications.

Why?  I can get a 'PC' server with 128GB of RAM quite easily now,
and that will mean I can cache most of not all hot data for any trading
app I've worked on.  Settled trades that matured in prior periods can
be moved to tables on real disks - they are hardly ever accessed
anyway.


In the long run, we are going to have to seriously rethink pg's use
of WAL as the way we implement MVCC as it becomes more and more of a
performance bottleneck.
We have WAL because Stonebreaker made an assumption about the future
dominance of optical media that has turned out to be false.
...and it's been one of pg's big issues every since.


>>   2GB CF isn't so
>>pricey any more.
>Heck =16= GB Flash only costs ~$300 US and 128GB SSDs based on flash
>RAM are due out this year.

Quite.  Suppose I have a RAID with double redundancy, then I get enough
capacity
for quite a lot of raw data, and can swap a card out every weekend and let
the
RAID rebuild it in rotation to keep them within conservative wear limits.

So long as the wear levelling works moderately well (and without needing FAT
on the disk or whatever) then I should be fine.

I think.  Maybe.

James

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.446 / Virus Database: 268.18.7/713 - Release Date: 07/03/2007
09:24


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org