Re: Growth planning

Rob Sargent Mon, 04 Oct 2021 10:22:25 -0700

On 10/4/21 11:09 AM, Israel Brewster wrote:

On Oct 4, 2021, at 8:46 AM, Rob Sargent <robjsarg...@gmail.com<mailto:robjsarg...@gmail.com>> wrote:
On Oct 4, 2021, at 10:22 AM, Israel Brewster <ijbrews...@alaska.edu<mailto:ijbrews...@alaska.edu>> wrote:
Guessing the “sd” is "standard deviation”? Any chance those stddevsare easily calculable from base data? Could cut your table size inhalf (and put those 20 cores to work on the reporting).
Possible - I’d have to dig into that with the script author. I wasjust handed an R script (I don’t work with R…) and told here’s thedata it needs, here’s the output we need stored in the DB. I thenspent just enough time with the script to figure out how to hook upthe I/O. The schema is pretty much just a raw dump of the output - Ihaven’t really spent any resources figuring out what, exactly, thedata is. Maybe I should :-)
And I wonder if the last three indices are strictly necessary? Theytake disc space too.
Not sure. Here’s the output from pg_stat_all_indexes:
volcano_seismology=# select * from pg_stat_all_indexes whererelname='data'; relid | indexrelid | schemaname | relname | indexrelname |idx_scan | idx_tup_read | idx_tup_fetch
-------+------------+------------+---------+---------------------------+----------+--------------+---------------
19847 | 19869 | public | data | data_pkey | 0 | 0 | 0 19847 | 19873 | public | data |date_station_channel_idx | 811884 | 12031143199 | 1192412952 19847 | 19875 | public | data | station_channel_epoch_idx| 8 | 318506 | 318044 19847 | 19876 | public | data | station_data_idx | 9072 | 9734 | 1235 19847 | 19877 | public | data | station_date_idx | 721616 | 10927533403 | 10908912092 19847 | 20479 | public | data |data_station_channel_idx | 47293 | 194422257262 | 6338753379
(6 rows)
so they *have* been used (although not the station_data_idx so much),but this doesn’t tell me when it was last used, so some of those maybe queries I was experimenting with to see what was fastest, but areno longer in use. Maybe I should keep an eye on this for a while, seewhich values are increasing.
But my bet is you’re headed for partitioning on datetime or perhapsstation.
While datetime partitioning seems to be the most common, I’m not clearon how that would help here, as the most intensive queries need *all*the datetimes for a given station, and even the smaller queries wouldbe getting an arbitrary time range potentially spanning several, ifnot all, partitions. Now portioning on station seems to make sense -there are over 100 of those, and pretty much any query will only dealwith a single station at a time. Perhaps if more partitioning would bebetter, portion by both station and channel? The queries that need tobe fastest will only be looking at a single channel of a single station.
I’ll look into this a bit more, maybe try some experimenting while Istill have *relatively* little data. My main hesitation here is thatin the brief look I’ve given partitioning so far, it looks to be aroyal pain to get set up. Any tips for making that easier?

If no queries address multiple stations you could do a table perstation. Doesn't smell good but you have a lot of data and well, speedkills.

I think the date-station-channel could "take over" for thestation-date. Naturally the latter is chosen if you give just the twofields, but I would be curious to see how well the former performs givenjust its first two fields(when station-date doesn't exist).

Re: Growth planning

Reply via email to