Re: [PERFORM] Any advantage to integer vs stored date w. timestamp
Zoolin Lin wrote: Hi, I have database with a huge amount of data so i'm trying to make it as fast as possible and minimize space. One thing i've done is join on a prepopulated date lookup table to prevent a bunch of rows with duplicate date columns. Without this I'd have about 2500 rows per hour with the exact same date w. timestamp in them. My question is, with postgres do I really gain anything by this, or should I just use the date w. timestamp column on the primary table and ditch the join on the date_id table. Primary table is all integers like: date id | num1 | num2 | num3 | num4 | num5 | num6 | num7 | num 8 - primary key is on date to num->6 columns What types are num1->8? date_id lookup table: This table is prepopulated with the date values that will be used. date_id | date w timestamp 1 | 2007-2-15 Midnight 2 | 2007-2-15 1 am 3 | 2007-2-15 2 am etc for 24 hours each day If you only want things accurate to an hour, you could lost the join and just store it as an int: 2007021500, 2007021501 etc. That should see you good to year 2100 or so. -- Richard Huxton Archonet Ltd ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] compact flash disks?
* James Mansion: > If I were to use a filesystem with noatime etc and little non-sql traffic, > does the physical update pattern tend to have hot sectors that will tend to > wear out CF? Thanks to the FAT file system and its update pattern, most (if not all) CF disks implement wear leveling nowadays. I wouldn't worry about the issue, unless your write rates are pretty high. -- Florian Weimer<[EMAIL PROTECTED]> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99 ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Any advantage to integer vs stored date w. timestamp
thanks for your reply > Primary table is all integers like: > > date id | num1 | num2 | num3 | num4 | num5 | num6 | num7 | num 8 > - > primary key is on date to num->6 columns >>What types are num1->8? They are all integer > date_id | date w timestamp 1 > | 2007-2-15 Midnight 2 | 2007-2-15 1 am 3 | 2007-2-15 > 2 am etc for 24 hours each day >>If you only want things accurate to an hour, you could lost the join and >>just store it as an int: 2007021500, 2007021501 etc. Hmm yeh I could, I think with the amount of data in the db though it behooves me to use one of the date types, even if via lookup table. So I guess I'm just not sure if I'm really gaining anything by using an integer date id column and doing a join on a date lookup table, vs just making it a date w. timestamp column and having duplicate dates in that column. I would imagine internally that the date w. timestamp is stored as perhaps a time_t type plus some timezone information. I don't know if it takes that much more space, or there's a significant performance penalty in using it 2,500 rows per hour, with duplicate date columns, seems like it could add up though. thanks Richard Huxton wrote: Zoolin Lin wrote: > Hi, > > I have database with a huge amount of data so i'm trying to make it > as fast as possible and minimize space. > > One thing i've done is join on a prepopulated date lookup table to > prevent a bunch of rows with duplicate date columns. Without this I'd > have about 2500 rows per hour with the exact same date w. timestamp > in them. > > My question is, with postgres do I really gain anything by this, or > should I just use the date w. timestamp column on the primary table > and ditch the join on the date_id table. > > Primary table is all integers like: > > date id | num1 | num2 | num3 | num4 | num5 | num6 | num7 | num 8 > - > primary key is on date to num->6 columns What types are num1->8? > date_id lookup table: > > This table is prepopulated with the date values that will be used. > > date_id | date w timestamp 1 > | 2007-2-15 Midnight 2 | 2007-2-15 1 am 3 | 2007-2-15 > 2 am etc for 24 hours each day If you only want things accurate to an hour, you could lost the join and just store it as an int: 2007021500, 2007021501 etc. That should see you good to year 2100 or so. -- Richard Huxton Archonet Ltd ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster - Food fight? Enjoy some healthy debate in the Yahoo! Answers Food & Drink Q&A.
Re: [PERFORM] Any advantage to integer vs stored date w. timestamp
Zoolin Lin wrote: thanks for your reply Primary table is all integers like: date id | num1 | num2 | num3 | num4 | num5 | num6 | num7 | num 8 - primary key is on date to num->6 columns What types are num1->8? They are all integer Hmm - not sure if you'd get any better packing if you could make some int2 and put them next to each other. Need to test. date_id | date w timestamp 1 | 2007-2-15 Midnight 2 | 2007-2-15 1 am 3 | 2007-2-15 2 am etc for 24 hours each day If you only want things accurate to an hour, you could lost the join and just store it as an int: 2007021500, 2007021501 etc. Hmm yeh I could, I think with the amount of data in the db though it behooves me to use one of the date types, even if via lookup table. You can always create it as a custom ZLDate type. All it really needs to be is an int with a few casts. So I guess I'm just not sure if I'm really gaining anything by using an integer date id column and doing a join on a date lookup table, vs just making it a date w. timestamp column and having duplicate dates in that column. I would imagine internally that the date w. timestamp is stored as perhaps a time_t type plus some timezone information. I don't know if it takes that much more space, or there's a significant performance penalty in using it It's a double or int64 I believe, so allow 8 bytes instead of 4 for your int. 2,500 rows per hour, with duplicate date columns, seems like it could add up though. Well, let's see 2500*24*365 = 21,900,000 * 4 bytes extra = 83MB additional storage over a year. Not sure it's worth worrying about. -- Richard Huxton Archonet Ltd ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] compact flash disks?
At 05:18 PM 3/6/2007, James Mansion wrote: I see that one can now get compact flash to SATA connectors. If I were to use a filesystem with noatime etc and little non-sql traffic, does the physical update pattern tend to have hot sectors that will tend to wear out CF? Most flash RAMs have drivers that make sure the pattern of writes over time is uniform across the entire device. I'm wondering about a RAID5 with data on CF drives and RAID1 for the WAL on a fast SATA or SAS drive pair. I'm thinking that this would tend to have good performance because the seek time for the data is very low, even if the actual write speed can be slower than state of the art. WARNING: modern TOtL flash RAMs are only good for ~1.2M writes per memory cell. and that's the =good= ones. Using flash RAM for write heavy applications like OLTP, or for WAL, etc can be very dangerous Flash write speeds also stink; being ~1/2 flash's already low read speed. Much better to use flash RAM for read heavy applications. Even there you have to be careful that seek performance, not throughput, is what is gating your day to day performance with those tables. Got tables or indexes that are a= too big to fit in RAM and b= are write few, read many times and c= whose pattern of access is large enough that it does not cache well? =Those= are what should be put into flash RAMs Side Note: In the long run, we are going to have to seriously rethink pg's use of WAL as the way we implement MVCC as it becomes more and more of a performance bottleneck. We have WAL because Stonebreaker made an assumption about the future dominance of optical media that has turned out to be false. ...and it's been one of pg's big issues every since. 2GB CF isn't so pricey any more. Heck =16= GB Flash only costs ~$300 US and 128GB SSDs based on flash RAM are due out this year. Cheers, Ron ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] compact flash disks?
Much better to use flash RAM for read heavy applications. Even there you have to be careful that seek performance, not throughput, is what is gating your day to day performance with those tables. Isn't precisely there where Flash disks would have *the* big advantage?? I mean, access time is severely held down by the *mechanical* movement of the heads to the right cylinder on the disk --- that's a brutally large amount of time compared to anything else happening on the computer (well, floppy disks aside, and things like trying-to-break-128-bit-encryption aside :-)). Or are these Flash disks so slow that they compare to the HD's latency figures? Carlos -- ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] compact flash disks?
> Or are these Flash disks so slow that they compare to the HD's latency > figures? On sequential read speed HDs outperform flash disks... only on random access the flash disks are better. So if your application is a DW one, you're very likely better off using HDs. Cheers, Csaba. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] compact flash disks?
Le mardi 6 mars 2007 23:18, James Mansion a écrit : > I see that one can now get compact flash to SATA connectors. I can suggest you to have a look at Gigabyte i-ram . We use it on a website with higth traffic with lot of succes. Unfortunely, I can not provide any benchmark... > > If I were to use a filesystem with noatime etc and little non-sql traffic, > does the physical update pattern tend to have hot sectors that will tend to > wear out CF? > > I'm wondering about a RAID5 with data on CF drives and RAID1 for teh WAL on > a fast SATA or SAS drive pair. I'm thhinking that this would tend to have > good performance because the seek time for the data is very low, even if > the actual write speed can be slower than state of the art. 2GB CF isn't > so pricey any more. > > Just wondering. > > James > > -- > No virus found in this outgoing message. > Checked by AVG Free Edition. > Version: 7.5.446 / Virus Database: 268.18.7/711 - Release Date: 05/03/2007 > 09:41 > > > ---(end of broadcast)--- > TIP 1: if posting/reading through Usenet, please send an appropriate >subscribe-nomail command to [EMAIL PROTECTED] so that your >message can get through to the mailing list cleanly ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM]
Jeff Cole <[EMAIL PROTECTED]> writes: > Hi Tom, you are correct, the distribution is uneven... In the 13k > symptom_reports rows, there are 105 distinct symptom_ids. But the > first 8k symptom_reports rows only have 10 distinct symptom_ids. > Could this cause the problem and would there be anything I could do > to address it? Ah-hah, yeah, that explains it. Is it worth your time to deliberately randomize the order of the rows in symptom_reports? It wasn't clear whether this query is actually something you need to optimize. You might have other queries that benefit from the rows being in nonrandom order, so I'm not entirely sure that this is a good thing to do ... regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] [EMAIL PROTECTED]: Progress on scaling of FreeBSD on 8 CPU systems]
Tom Lane wrote: > Stefan Kaltenbrunner <[EMAIL PROTECTED]> writes: >> Arjen van der Meijden wrote: >>> Stefan Kaltenbrunner wrote: ouch - do I read that right that even after tom's fixes for the "regressions" in 8.2.0 we are still 30% slower then the -HEAD checkout from the middle of the 8.2 development cycle ? >>> Yes, and although I tested about 17 different cvs-checkouts, Tom and I >>> weren't really able to figure out where "it" happened. So its a bit of a >>> mystery why the performance is so much worse. > >> double ouch - losing that much in performance without an idea WHY it >> happened is really unfortunate :-( > > Keep in mind that Arjen's test exercises some rather narrow scenarios; > IIRC its performance is mostly determined by some complicated > bitmap-indexscan cases. So that "30% slower" bit certainly doesn't > represent an across-the-board figure. As best I can tell, the decisions > the planner happened to be making in late June were peculiarly nicely > suited to his test, but not so much for other cases. understood - I was not trying to imply that we suffer a 30% performance drop overall. But still it means we know about a set of queries that we once could handle faster than we can now ... Stefan ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Any advantage to integer vs stored date w. timestamp
Thank you for the reply >> Primary table is all integers like: >> >> date id | num1 | num2 | num3 | num4 | num5 | num6 | num7 | num 8 >> - >> primary key is on date to num->6 columns > >>> What types are num1->8? > > They are all integer >>Hmm - not sure if you'd get any better packing if you could make some >>int2 and put them next to each other. Need to test. Thanks, I find virtually nothing on the int2 column type? beyond brief mention here http://www.postgresql.org/docs/8.2/interactive/datatype-numeric.html#DATATYPE-INT Could i prevail on you to expand on packing wtih int2 a bit more, or point me in the right direction for documentation? If there's some way I can pack multipe columns into one to save space, yet still effectively query on them, even if it's a lot slower, that would be great. My current scheme, though as normalized and summarized as I can make it, really chews up a ton of space. It might even be chewing up more than the data files i'm summarizing, I assume due to the indexing. Regading saving disk space, I saw someone mention doing a custom build and changing TOAST_TUPLE_THRESHOLD/TOAST_TUPLE_TARGET So data is compressed sooner, it seems like that might be a viable option as well. http://www.thescripts.com/forum/thread422854.html > 2,500 rows per hour, with duplicate date columns, seems like it could > add up though. >>>Well, let's see 2500*24*365 = 21,900,000 * 4 bytes extra = 83MB >>>additional storage over a year. Not sure it's worth worrying about. Ahh yes probably better to make it a date w. timestamp column then. Z Richard Huxton wrote: Zoolin Lin wrote: > thanks for your reply > >> Primary table is all integers like: >> >> date id | num1 | num2 | num3 | num4 | num5 | num6 | num7 | num 8 >> - >> primary key is on date to num->6 columns > >>> What types are num1->8? > > They are all integer Hmm - not sure if you'd get any better packing if you could make some int2 and put them next to each other. Need to test. >> date_id | date w timestamp >> 1 | 2007-2-15 Midnight 2 | 2007-2-15 1 am 3 | >> 2007-2-15 2 am etc for 24 hours each day > >>> If you only want things accurate to an hour, you could lost the >>> join and just store it as an int: 2007021500, 2007021501 etc. > > Hmm yeh I could, I think with the amount of data in the db though it > behooves me to use one of the date types, even if via lookup table. You can always create it as a custom ZLDate type. All it really needs to be is an int with a few casts. > So I guess I'm just not sure if I'm really gaining anything by using > an integer date id column and doing a join on a date lookup table, > vs just making it a date w. timestamp column and having duplicate > dates in that column. > > I would imagine internally that the date w. timestamp is stored as > perhaps a time_t type plus some timezone information. I don't know > if it takes that much more space, or there's a significant > performance penalty in using it It's a double or int64 I believe, so allow 8 bytes instead of 4 for your int. > 2,500 rows per hour, with duplicate date columns, seems like it could > add up though. Well, let's see 2500*24*365 = 21,900,000 * 4 bytes extra = 83MB additional storage over a year. Not sure it's worth worrying about. -- Richard Huxton Archonet Ltd - No need to miss a message. Get email on-the-go with Yahoo! Mail for Mobile. Get started.
Re: [PERFORM] Any advantage to integer vs stored date w. timestamp
Zoolin Lin wrote: Thank you for the reply Primary table is all integers like: date id | num1 | num2 | num3 | num4 | num5 | num6 | num7 | num 8 - primary key is on date to num->6 columns What types are num1->8? They are all integer Hmm - not sure if you'd get any better packing if you could make some int2 and put them next to each other. Need to test. Thanks, I find virtually nothing on the int2 column type? beyond brief mention here http://www.postgresql.org/docs/8.2/interactive/datatype-numeric.html#DATATYPE-INT Could i prevail on you to expand on packing wtih int2 a bit more, or point me in the right direction for documentation? int4 is the internal name for integer (4 bytes) int2 is the internal name for smallint (2 bytes) Try SELECT format_type(oid, NULL) AS friendly, typname AS internal, typlen AS length FROM pg_type WHERE typlen>0; to see them all (negative typlen is a variable size (usually an array or bytea etc)) I think with packing he was referring to simply having more values in the same disk space by using int2 instead of int4. (half the storage space) If there's some way I can pack multipe columns into one to save space, yet still effectively query on them, even if it's a lot slower, that would be great. Depending on the size of data you need to store you may be able to get some benefit from "Packing" multiple values into one column. But I'm not sure if you need to go that far. What range of numbers do you need to store? If you don't need the full int4 range of values then try a smaller data type. If int2 is sufficient then just change the columns from integer to int2 and cut your storage in half. Easy gain. The "packing" theory would fall under general programming algorithms not postgres specific. Basically let's say you have 4 values that are in the range of 1-254 (1 byte) you can do something like col1=((val1<<0)&(val2<<8)&(val3<<16)&(val4<<24)) This will put the four values into one 4 byte int. So searching would be something like WHERE col1 & ((val1<<8)&(val3<<0))=((val1<<8)&(val3<<0)) if you needed to search on more than one value at a time. Guess you can see what your queries will be looking like. (Actually I'm not certain I got that 100% correct) That's a simple example that should give you the general idea. In practice you would only get gains if you have unusual length values, so if you had value ranges from 0 to 1023 (10 bits each) then you could pack 3 values into an int4 instead of using 3 int2 cols. (that's 32 bits for the int4 against 64 bits for the 3 int2 cols) and you would use <<10 and <<20 in the above example. You may find it easier to define a function or two to automate this instead of repeating it for each query. But with disks and ram as cheap as they are these days this sort of packing is getting rarer (except maybe embedded systems with limited resources) My current scheme, though as normalized and summarized as I can make it, really chews up a ton of space. It might even be chewing up more than the data files i'm summarizing, I assume due to the indexing. -- Shane Ambler [EMAIL PROTECTED] Get Sheeky @ http://Sheeky.Biz ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] compact flash disks?
>WARNING: modern TOtL flash RAMs are only good for ~1.2M writes per >memory cell. and that's the =good= ones. >Using flash RAM for write heavy applications like OLTP, or for WAL, >etc can be very dangerous Well, that's why I suggested that the WAL would stream to a hard disk array, where the large IO sequential write speed will be helpful. Whether OLTP is a problem will presumably depend on the freqency of updates and vacuum to each physical cluster of rows in a disk block. Most rows in a trading application will have quite a long lifetime, and be updated relatively few times (even where we writing fixings info into trades). >Flash write speeds also stink; being ~1/2 flash's already low read speed. Sure - but it may still be an effective tradoff where the limiting factor would otherwise be seek time. >Much better to use flash RAM for read heavy applications. Why? I can get a 'PC' server with 128GB of RAM quite easily now, and that will mean I can cache most of not all hot data for any trading app I've worked on. Settled trades that matured in prior periods can be moved to tables on real disks - they are hardly ever accessed anyway. In the long run, we are going to have to seriously rethink pg's use of WAL as the way we implement MVCC as it becomes more and more of a performance bottleneck. We have WAL because Stonebreaker made an assumption about the future dominance of optical media that has turned out to be false. ...and it's been one of pg's big issues every since. >> 2GB CF isn't so >>pricey any more. >Heck =16= GB Flash only costs ~$300 US and 128GB SSDs based on flash >RAM are due out this year. Quite. Suppose I have a RAID with double redundancy, then I get enough capacity for quite a lot of raw data, and can swap a card out every weekend and let the RAID rebuild it in rotation to keep them within conservative wear limits. So long as the wear levelling works moderately well (and without needing FAT on the disk or whatever) then I should be fine. I think. Maybe. James -- No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.5.446 / Virus Database: 268.18.7/713 - Release Date: 07/03/2007 09:24 ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org