> On Oct 4, 2021, at 9:22 AM, Rob Sargent <robjsarg...@gmail.com> wrote: > > On 10/4/21 11:09 AM, Israel Brewster wrote: >>> On Oct 4, 2021, at 8:46 AM, Rob Sargent <robjsarg...@gmail.com >>> <mailto:robjsarg...@gmail.com>> wrote: >>> >>>> On Oct 4, 2021, at 10:22 AM, Israel Brewster <ijbrews...@alaska.edu >>>> <mailto:ijbrews...@alaska.edu>> wrote: >>> Guessing the “sd” is "standard deviation”? Any chance those stddevs are >>> easily calculable from base data? Could cut your table size in half (and >>> put those 20 cores to work on the reporting). >> >> Possible - I’d have to dig into that with the script author. I was just >> handed an R script (I don’t work with R…) and told here’s the data it needs, >> here’s the output we need stored in the DB. I then spent just enough time >> with the script to figure out how to hook up the I/O. The schema is pretty >> much just a raw dump of the output - I haven’t really spent any resources >> figuring out what, exactly, the data is. Maybe I should :-) >> >>> And I wonder if the last three indices are strictly necessary? They take >>> disc space too. >> >> Not sure. Here’s the output from pg_stat_all_indexes: >> >> volcano_seismology=# select * from pg_stat_all_indexes where relname='data'; >> relid | indexrelid | schemaname | relname | indexrelname | >> idx_scan | idx_tup_read | idx_tup_fetch >> -------+------------+------------+---------+---------------------------+----------+--------------+--------------- >> 19847 | 19869 | public | data | data_pkey | >> 0 | 0 | 0 >> 19847 | 19873 | public | data | date_station_channel_idx | >> 811884 | 12031143199 | 1192412952 >> 19847 | 19875 | public | data | station_channel_epoch_idx | >> 8 | 318506 | 318044 >> 19847 | 19876 | public | data | station_data_idx | >> 9072 | 9734 | 1235 >> 19847 | 19877 | public | data | station_date_idx | >> 721616 | 10927533403 | 10908912092 >> 19847 | 20479 | public | data | data_station_channel_idx | >> 47293 | 194422257262 | 6338753379 >> (6 rows) >> >> so they *have* been used (although not the station_data_idx so much), but >> this doesn’t tell me when it was last used, so some of those may be queries >> I was experimenting with to see what was fastest, but are no longer in use. >> Maybe I should keep an eye on this for a while, see which values are >> increasing. >> >>> >>> But my bet is you’re headed for partitioning on datetime or perhaps station. >> >> While datetime partitioning seems to be the most common, I’m not clear on >> how that would help here, as the most intensive queries need *all* the >> datetimes for a given station, and even the smaller queries would be getting >> an arbitrary time range potentially spanning several, if not all, >> partitions. Now portioning on station seems to make sense - there are over >> 100 of those, and pretty much any query will only deal with a single station >> at a time. Perhaps if more partitioning would be better, portion by both >> station and channel? The queries that need to be fastest will only be >> looking at a single channel of a single station. >> >> I’ll look into this a bit more, maybe try some experimenting while I still >> have *relatively* little data. My main hesitation here is that in the brief >> look I’ve given partitioning so far, it looks to be a royal pain to get set >> up. Any tips for making that easier? >> >> > If no queries address multiple stations you could do a table per station. > Doesn't smell good but you have a lot of data and well, speed kills.
Indeed. Table per station as opposed to partitioning? The *most* I can reasonably envision needing is to query two stations, i.e. I could see potentially wanting to compare station a to some “baseline” station b. In general, though, the stations are independent, and it seems unlikely that we will need any multi-station queries. Perhaps query one station, then a second query for a second to display graphs for both side-by-side to look for correlations or something, but nothing like that has been suggested at the moment. > > I think the date-station-channel could "take over" for the station-date. > Naturally the latter is chosen if you give just the two fields, but I would > be curious to see how well the former performs given just its first two > fields(when station-date doesn't exist). Ah, that makes sense. I’ll try to run some benchmarks later today/tomorrow. Thanks again! --- Israel Brewster Software Engineer Alaska Volcano Observatory Geophysical Institute - UAF 2156 Koyukuk Drive Fairbanks AK 99775-7320 Work: 907-474-5172 cell: 907-328-9145