Also if we are speaking about indexes - a bit tangential but I know we were planning to replace some of the primary keys (mainly because of mysql limitations) with synthetic keys for DAG versioning casse where we planned to use UUIDs).
We should be very, very careful when doing it because I've learned recently that due to UUID randomness the algorithm to rebalance binary tree do not work well with random UUIDs which means that when you do a lot of inserts, rebalancing of the trees is happening very often and performance of inserts is heavily affected. Postgres has a dedicated UUID field type for that that produces an optimal index https://dba.stackexchange.com/questions/102448/how-should-i-index-a-uuid-in-postgres - however it might or might not be affected. Similarly MariaDB, but MySQL does not seem to have proper UUID support, so we should really use UUID7 rather that UUID4 for such UUIDs in case we do not want to affect insert performance on MySQL. WAT???? Yeah. That is. - I think - rather non-obvious side-effect. This learning I had adds much more strength to two related discussions: 1) This only adds more fire to "let's only stick to Postgres". We can optimize the experience for our users way better and move faster. 2) Let's REALLY not put the power of deciding what indexes to put on the database of ours in the hands of our users. There are many unobvious consequences of adding indexes and while we can aim to learn, foresee and control them (mostly when we have a working performance test suite), getting a random user to put a random index on a random set of keys in Airflow DB might have a number of unobvious consequences which might give our users (and us) a LOT of headache. J. On Fri, May 31, 2024 at 8:13 AM Pankaj Koti <pankaj.k...@astronomer.io.invalid> wrote: > Addressing one of Pierre's questions: > > Should I index foreign keys? Is that done by default or should I explicitly > do it? > > The answer varies depending on the database engine. > PostgreSQL and SQLite do not add indexes on foreign keys > by default, while MySQL does. Developers should keep this > in mind. Many of us (myself included) may have assumed > from our university days, indexes are always created for > foreign keys, but that's not always the case. :) > > Best regards, > > *Pankaj Koti* > Senior Software Engineer (Airflow OSS Engineering team) > Location: Pune, Maharashtra, India > Timezone: Indian Standard Time (IST) > > > On Fri, May 31, 2024 at 11:37 AM Daniel Standish > <daniel.stand...@astronomer.io.invalid> wrote: > > > I would be in favor of this for sure. Let's see what others think :) > > > > On Thu, May 30, 2024 at 10:55 PM Jarek Potiuk <ja...@potiuk.com> wrote: > > > > > Simply speaking - let's make "lack of optimisation for these and that" > > part > > > of the API specification. > > > > > > On Fri, May 31, 2024 at 7:54 AM Jarek Potiuk <ja...@potiuk.com> wrote: > > > > > > > So let's document as part of the API which queries are not performant > > and > > > > suggest users that want to use them to make their analytics queries > > > > elsewhere. I'd very much prefer that it's slow "by design" for > everyone > > > > rather than add option for the user to speed them up where we decided > > > > not to do it ourselves because we know the consequences. > > > > > > > > I think the root cause of the problem is `US` adding new filtering > > > options > > > > to public API without thinking about and documenting consequences. > > > > > > > > J. > > > > > > > > > > > > On Fri, May 31, 2024 at 7:38 AM Daniel Standish > > > > <daniel.stand...@astronomer.io.invalid> wrote: > > > > > > > >> So I think the notion that *all possibly expensive queries* should > > have > > > an > > > >> index to support them is not a tenable one. E.g. there are > something > > > like > > > >> 5 params on TI list endpoint that don't have an index. > > > >> > > > >> In contrast with queries from airflow itself, the API queries are > more > > > >> arbitrary -- user can combine in any way. So it's not so simple to > > > >> determine just from the endpoint params what indexes will be needed. > > > And > > > >> we're not going to add an index for all possible param combinations, > > > that > > > >> would be silly. So I think like it or not I think we will > necessarily > > > >> draw > > > >> the line at *some* point and say we're trying to cover the most > > likely / > > > >> normal / reasonable uses and if your usage pattern is an outlier you > > > might > > > >> need to add an index yourself. And I don't think that's > unreasonable > > > >> > > > >> Maybe at least one conclusion from this is to add an index for each > of > > > >> start_date, end_date, and updated_at on TI. Already the storage > > > occupied > > > >> by indexes on TI is more than the table data size -- I don't > remember > > it > > > >> might have been like 2x or something. So then index count will > > increase > > > >> from 8 to 11 and overhead increases. Whether this is a material > > > concern, > > > >> I > > > >> am honestly not sure -- I am not a db specialist. But it does feel > > icky > > > >> to > > > >> add all the indexes when they are used by, I suspect, a very small > > > >> percentage of airflow users overall. > > > >> > > > >> Part of this for me is also the reason these types of queries are > run. > > > >> Typically it is for analytics purposes, either on the fly or as part > > of > > > a > > > >> replication process to a DWH or something. But there are more > > efficient > > > >> mechanisms for this than pulling the data through the webserver, > e.g. > > > >> read-only replicas, or standard ETL processes, or just making better > > use > > > >> of > > > >> the indexes that are already there. Do we consider "replication" a > > > >> reasonable use of the REST API? > > > >> > > > >> So I may very well be alone on this. But (1) the API is not > something > > > >> everyone uses. And (2) even of users of the API, probably only a > very > > > >> small percentage are using it in a way that needs these indexes -- > > > >> otherwise why haven't we seen requests for indexes on these columns > on > > > >> github? So, to me, it doesn't seem crazy to just leave it to the > > user / > > > >> platform to add. But as usual, not a hill to die on. > > > >> > > > > > > > > > >