Re: Growth planning

Israel Brewster Mon, 04 Oct 2021 10:36:53 -0700

> On Oct 4, 2021, at 9:22 AM, Rob Sargent <robjsarg...@gmail.com> wrote:
> 
> On 10/4/21 11:09 AM, Israel Brewster wrote:
>>> On Oct 4, 2021, at 8:46 AM, Rob Sargent <robjsarg...@gmail.com 
>>> <mailto:robjsarg...@gmail.com>> wrote:
>>> 
>>>> On Oct 4, 2021, at 10:22 AM, Israel Brewster <ijbrews...@alaska.edu 
>>>> <mailto:ijbrews...@alaska.edu>> wrote:
>>> Guessing the “sd” is "standard deviation”?  Any chance those stddevs are 
>>> easily calculable from base data?  Could cut your table size in half (and 
>>> put those 20 cores to work on the reporting).
>> 
>> Possible - I’d have to dig into that with the script author. I was just 
>> handed an R script (I don’t work with R…) and told here’s the data it needs, 
>> here’s the output we need stored in the DB. I then spent just enough time 
>> with the script to figure out how to hook up the I/O. The schema is pretty 
>> much just a raw dump of the output - I haven’t really spent any resources 
>> figuring out what, exactly, the data is. Maybe I should :-)
>> 
>>>  And I wonder if the last three indices are strictly necessary? They take 
>>> disc space too.
>> 
>> Not sure. Here’s the output from pg_stat_all_indexes:
>> 
>> volcano_seismology=# select * from pg_stat_all_indexes where relname='data';
>>  relid | indexrelid | schemaname | relname |       indexrelname        | 
>> idx_scan | idx_tup_read | idx_tup_fetch 
>> -------+------------+------------+---------+---------------------------+----------+--------------+---------------
>>  19847 |      19869 | public     | data    | data_pkey                 |     
>>    0 |            0 |             0
>>  19847 |      19873 | public     | data    | date_station_channel_idx  |   
>> 811884 |  12031143199 |    1192412952
>>  19847 |      19875 | public     | data    | station_channel_epoch_idx |     
>>    8 |       318506 |        318044
>>  19847 |      19876 | public     | data    | station_data_idx          |     
>> 9072 |         9734 |          1235
>>  19847 |      19877 | public     | data    | station_date_idx          |   
>> 721616 |  10927533403 |   10908912092
>>  19847 |      20479 | public     | data    | data_station_channel_idx  |    
>> 47293 | 194422257262 |    6338753379
>> (6 rows)
>> 
>> so they *have* been used (although not the station_data_idx so much), but 
>> this doesn’t tell me when it was last used, so some of those may be queries 
>> I was experimenting with to see what was fastest, but are no longer in use. 
>> Maybe I should keep an eye on this for a while, see which values are 
>> increasing.
>> 
>>> 
>>> But my bet is you’re headed for partitioning on datetime or perhaps station.
>> 
>> While datetime partitioning seems to be the most common, I’m not clear on 
>> how that would help here, as the most intensive queries need *all* the 
>> datetimes for a given station, and even the smaller queries would be getting 
>> an arbitrary time range potentially spanning several, if not all, 
>> partitions. Now portioning on station seems to make sense - there are over 
>> 100 of those, and pretty much any query will only deal with a single station 
>> at a time. Perhaps if more partitioning would be better, portion by both 
>> station and channel? The queries that need to be fastest will only be 
>> looking at a single channel of a single station.
>> 
>> I’ll look into this a bit more, maybe try some experimenting while I still 
>> have *relatively* little data. My main hesitation here is that in the brief 
>> look I’ve given partitioning so far, it looks to be a royal pain to get set 
>> up. Any tips for making that easier?
>> 
>> 
> If no queries address multiple stations you could do a table per station.  
> Doesn't smell good but you have a lot of data and well, speed kills.


Indeed. Table per station as opposed to partitioning? The *most* I can 
reasonably envision needing is to query two stations, i.e. I could see 
potentially wanting to compare station a to some “baseline” station b. In 
general, though, the stations are independent, and it seems unlikely that we 
will need any multi-station queries. Perhaps query one station, then a second 
query for a second to display graphs for both side-by-side to look for 
correlations or something, but nothing like that has been suggested at the 
moment.

> 
> I think the date-station-channel could "take over" for the station-date.  
> Naturally the latter is chosen if you give just the two fields, but I would 
> be curious to see how well the former performs given just its first two 
> fields(when station-date doesn't exist).

Ah, that makes sense. I’ll try to run some benchmarks later today/tomorrow.

Thanks again!

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

Re: Growth planning

Reply via email to