Thanks Gus! I appreciate this information. It is very helpful. From my POC, I can see that TRAs are very powerful and very helpful. I am excited to build out a more full implementation within our use case which is right in line with the boundaries you set for standard TRA use.
Matt On Sat, Aug 21, 2021 at 1:16 PM Gus Heck <gus.h...@gmail.com> wrote: > Hi Matt, > > TRA's were put into use almost immediately by at least one organization as > soon as Dave and I implemented them, and the CRA and DRA's followed because > the same organization wanted to further subdivide their data. They have > been around for a while, and I'm currently helping a client move to use > them, and past clients have adopted them. Always hard to know exactly how > much any feature is used, but I have heard other mentions of folks using > them, and not heard a failure story yet, so I think so long as your use > case is a good fit for them (non-trivial amounts of data, never re-index > the same doc with a different routing date, typically data flows in over > time, optionally old data needs periodic removal, etc) then they are good. > Of course every particular case is individual and there's always the chance > that YOU are the lucky one who discovers something subtle, or find a scale > at which things break down, but you aren't the first to use it, that's for > sure :). They definitely were meant to make it easier to handle large > amounts of temporal data. > > Also, it's open source so if something needs tweaking the process for that > is open and well defined :) As with everything technical, test often and > test well. > > -Gus > > On Fri, Aug 20, 2021 at 10:07 AM Matt Kuiper <kuipe...@gmail.com> wrote: > > > Sending this question out again to learn about how well Time Routed > Aliases > > have worked out for others. > > > > Would like to know if a number of others have used this approach > > successfully as our team is planning for the use of TRAs in a very large > > SolrCloud deployment. > > > > Thanks, > > Matt > > > > On Fri, Aug 13, 2021, 2:56 PM Matt Kuiper <kuipe...@gmail.com> wrote: > > > > > Thanks David, this test link is helpful. > > > > > > @David @Gus - From your viewpoint do you see TRAs as an accepted/proven > > > technique within SolrCloud? My small POC works great. Would like to > > hear > > > if others are using TRA in production deployments successfully at > scale. > > > > > > Thanks, > > > Matt > > > > > > On Wed, Aug 11, 2021 at 8:10 PM David Smiley <dsmi...@apache.org> > wrote: > > > > > >> I hope you have success with TRAs! > > >> > > >> You can delete some number of collections from the rear of the chain, > > but > > >> you must first update the TRA to exclude these collections. This is > > >> tested: > > >> > > >> > > > https://github.com/apache/solr/blob/f6c4f8a755603c3049e48eaf9511041252f2dbad/solr/core/src/test/org/apache/solr/update/processor/TimeRoutedAliasUpdateProcessorTest.java#L184 > > >> It'd be nice if it would remove itself from the alias. > > >> > > >> ~ David Smiley > > >> Apache Lucene/Solr Search Developer > > >> http://www.linkedin.com/in/davidwsmiley > > >> > > >> > > >> On Tue, Aug 10, 2021 at 9:26 PM Matt Kuiper <kuipe...@gmail.com> > wrote: > > >> > > >> > I found some helpful information while testing TRAs: > > >> > > > >> > For our use-case I am hesitant to set up an autoDeleteAge (unless it > > >> can be > > >> > modified - still need to test). So I wondered about a little more > > >> manual > > >> > delete management approach. > > >> > > > >> > I confirmed that I cannot simply delete a collection that is > > registered > > >> as > > >> > part of a TRA. The delete collection api call will fail with a > > message > > >> > that the collection is a part of the alias. > > >> > > > >> > I did learn that I could use the same create TRA api call I used to > > >> create > > >> > the TRA, but modify the router.start to date more recent than one or > > >> more > > >> > of the older collections associated with the TRA. Then when I > queried > > >> the > > >> > TRA, I only received documents from the collections after the new > > >> > router.start date. Also, I was now able to successfully delete the > > older > > >> > collections with a standard collection delete command. > > >> > > > >> > I think this satisfies my initial use-case requirements to be able > to > > >> > modify an existing TRA and delete older collections. > > >> > > > >> > Matt > > >> > > > >> > On Mon, Aug 9, 2021 at 11:27 AM Matt Kuiper <kuipe...@gmail.com> > > wrote: > > >> > > > >> > > Hi Gus, Jan, > > >> > > > > >> > > I am considering implementing TRA for a large-scale Solr > deployment. > > >> > Your > > >> > > Q&A is helpful! > > >> > > > > >> > > I am curious if you have experience/ideas regarding modifying the > TR > > >> > Alias > > >> > > when one desires to manually delete old collections or modify the > > >> > > router.autoDeleteAge to shorten or extend the delete age. Here's > a > > >> few > > >> > > specific questions? > > >> > > > > >> > > 1) Can you manually delete an old collection (via collection api) > > and > > >> > then > > >> > > edit the start date (to a more recent date) of the TRA so that it > no > > >> > longer > > >> > > sees/processes the deleted collection? > > >> > > 2) Is the only way to manage the deletion of collections within a > > TRA > > >> > > using the automatic deletion configuration? The > router.autoDeleteAge > > >> > > parameter. > > >> > > 3) If you can only manage deletes using the router.autoDeleteAge > > >> > > parameter, are you able to update this parameter to either: > > >> > > > > >> > > - Set the delete age earlier so that older collections are > > >> triggered > > >> > > for automatic deletion sooner? > > >> > > - Set the delete age to a larger value to extend the life of a > > >> > > collection? Say you originally would like the collections to > > stay > > >> > around > > >> > > for 5 years, but then change your mind to 7 years. > > >> > > > > >> > > I will likely do some experimentation, but am interested to learn > if > > >> you > > >> > > have covered these use-cases with TRA. > > >> > > > > >> > > Thanks, > > >> > > Matt > > >> > > > > >> > > > > >> > > On Fri, Aug 6, 2021 at 8:08 AM Gus Heck <gus.h...@gmail.com> > wrote: > > >> > > > > >> > >> Hi Jan, > > >> > >> > > >> > >> The key thing to remember about TRA's (or any Routed Alias) is > that > > >> it > > >> > >> only > > >> > >> actively does two things: > > >> > >> 1) Routes document updates to the correct collection by > inspecting > > >> the > > >> > >> routed field in the document > > >> > >> 2) Detects when a new collection is required and creates it. > > >> > >> > > >> > >> If you don't send it data *nothing* happens. The collections are > > not > > >> > >> created until data requires them (with an async create possible > > when > > >> it > > >> > >> sees an update that has a timestamp "near" the next interval, see > > >> docs > > >> > for > > >> > >> router.preemptiveCreateMath ) > > >> > >> > > >> > >> A) Dave's half of our talk at 2018 activate talks about it: > > >> > >> https://youtu.be/RB1-7Y5NQeI?t=839 > > >> > >> B) Time Routed Aliases are a means by which to automate creation > of > > >> > >> collections and route documents to the created collections. > Sizing, > > >> and > > >> > >> performance of the individual collections is not otherwise > special, > > >> and > > >> > >> you > > >> > >> can interact with the collections individually after they are > > >> created, > > >> > >> with > > >> > >> the obvious caveats that you probably don't want to be doing > things > > >> that > > >> > >> get them out of sync schema wise unless your client programs know > > >> how to > > >> > >> handle documents of both types etc. A less obvious consequence of > > the > > >> > >> routing is that your data must not ever republish the same > document > > >> > with a > > >> > >> different route key (date for TRA), since that can lead to > > duplicate > > >> > id's > > >> > >> across collections. The "normal" use case is event data, things > > that > > >> > >> happened and are done, and are correctly recorded (or at least > > their > > >> > time > > >> > >> is correctly recorded) the first time > > >> > >> C) Configure the higher number of replicas, remove old ones > > manually > > >> if > > >> > >> not > > >> > >> needed. At query time it's "just an alias". Managing collections > > >> based > > >> > on > > >> > >> recency could be automated here, before autoscaling was > deprecated > > I > > >> was > > >> > >> thinking that adding a couple of hooks into autoscaling such that > > it > > >> > could > > >> > >> react to collection creation by a TRA specifically would get us > to > > a > > >> > place > > >> > >> much like Elastic's Hot/Warm architecture. I haven't kept track > of > > >> > what's > > >> > >> being done to replace auto scaling however. I think Atri was > > >> interested > > >> > in > > >> > >> that at one point as well. > > >> > >> D) TRA's create collections under the hood with a CREATE command > > just > > >> > like > > >> > >> you would manually (based on the config in the TRA). Anything in > > Solr > > >> > that > > >> > >> would influence that placement should apply. > > >> > >> E) See D above, for fill rate, Utilizing new nodes over time > should > > >> be > > >> > as > > >> > >> simple as adding new nodes and waiting for new collections to be > > >> > created. > > >> > >> One could also manually move replicas as with any other > collection, > > >> > >> (aside: > > >> > >> be sure to refer to a current version of MOVEREPLICA docs, prior > to > > >> > >> something like 8.6 they were incomplete and even wrong in a few > > >> places). > > >> > >> F) If you are talking about router.autoDeleteAge here, old > > collection > > >> > >> removal is a regular DELETE (just automatically issued), Not sure > > >> what > > >> > you > > >> > >> mean by rotation interval. > > >> > >> G) They are just collections with special names that can be > parsed > > >> > during > > >> > >> update to select a destination for the incoming document. > > >> > >> H) They are just collections, and there's nothing to prevent you > > from > > >> > >> upgrading the schema, and new collections will begin using that, > > >> > >> individual > > >> > >> collections would need to be reloaded, non-safe schema changes > (in > > >> the > > >> > >> usual sense) require a re-index as usual. In a cloud environment > > >> where > > >> > you > > >> > >> can temporarily add machines or disk this is not so bad aside > from > > >> the > > >> > >> time > > >> > >> to re-index of course. If you are on-prem then plan to have a > > >> > significant > > >> > >> level of spare disk to handle this case without running yourself > > into > > >> > the > > >> > >> danger zone for segment merging. > > >> > >> H.2) TRA is just an alias with fancy collection creation (and > > >> naming). > > >> > >> Once > > >> > >> they collections exist, it's just an alias. All the action (at > this > > >> > point) > > >> > >> happens at update. So long as the collection is listed in the TRA > > in > > >> > >> zookeeper in aliases.json ***in the correct, (chronological, > desc) > > >> > >> order*** > > >> > >> and the naming of the collection can be parsed by the TRA code > you > > >> > should > > >> > >> be fine. Incoming updates iterate down the list of collections > > >> during an > > >> > >> update, and stop at the first one where the collection name > matches > > >> the > > >> > >> date in the routing field for the document for a normal TRA the > > vast > > >> > >> majority of updates hit one of the most recent two or three > > >> collections. > > >> > >> Frequent updates to old data in a TRA with very many time slices > > (sub > > >> > >> collections) might suffer some since this is a simple linear > > >> iteration, > > >> > >> optimizing that was deferred until it seemed important to > someone's > > >> less > > >> > >> normal use case :). > > >> > >> > > >> > >> > > >> > >> > > >> > >> Otherwise it's just an alias of collections with funky looking > > names > > >> > >> (unless someone added something when I wasn't looking ;) ). > > >> > >> > > >> > >> -Gus > > >> > >> > > >> > >> On Fri, Aug 6, 2021 at 4:13 AM Jan Høydahl < > jan....@cominvent.com> > > >> > wrote: > > >> > >> > > >> > >> > Hi, > > >> > >> > > > >> > >> > I have never used TRA, but a client of mine is considering it. > A > > >> few > > >> > >> > questions. > > >> > >> > > > >> > >> > A) Do you have links to talks (slides/video) on the feature? Or > > >> blog > > >> > >> posts > > >> > >> > going into more detail than the RefGuide? > > >> > >> > B) For ingestion performance, sharding may make sense. But only > > for > > >> > the > > >> > >> > current collection. Have anyone tried merging "static" shards? > > >> > >> > C) Is there a trick to have more relicas on recent collections > > than > > >> > old > > >> > >> > ones? > > >> > >> > D) Is there a way to manage what nodes that get selected for > new > > >> > >> > collections, or you need to rely on replica placement policies? > > >> > >> > E) How do you guys ensure you get a good fill-rate on the > nodes, > > >> and > > >> > >> what > > >> > >> > procedure do you use when adding more nodes in the cluster? > > >> > >> > * I.e. do you simply add a few new nodes and let Solr > > >> > automatically > > >> > >> > place new collections onto those? > > >> > >> > F) How many sub-collections/cores do you plan for on a single > > node? > > >> > >> > * You could try to configure the "rotation interval" such > > that > > >> a > > >> > >> node > > >> > >> > gets filled by a single core, but that seems hard to predict > > >> > >> > * Having a too rapid "rotation interval" will leave behind > > too > > >> > many > > >> > >> > cores per node, causing inefficiencies? > > >> > >> > * Have you found a strategy to balance this? I'd likely try > > to > > >> > plan > > >> > >> > for 10 cores per node, and monitor fill-rate such that I > > (manually) > > >> > add > > >> > >> > more HW once a threshold is reached. > > >> > >> > G) Have anyone tried backup of a TRA? Does it even work, or do > > you > > >> > need > > >> > >> to > > >> > >> > run the command for each single collection? > > >> > >> > H) A typical requirement is to migrate all data from one > cluster > > >> to a > > >> > >> new > > >> > >> > cluster on a newer version or with a new schema. Have you tried > > >> doing > > >> > >> that > > >> > >> > with a TRA? > > >> > >> > * Would you need to migrate each sub collection at a time? > > >> > >> > * Will TRA on the new cluster accept that someone > "external" > > >> adds > > >> > >> > collections, and how it is initialized/bootstrapped to fill the > > >> > internal > > >> > >> > collection registry? > > >> > >> > > > >> > >> > That's what I could think of before trying the feature. I'm > sure > > >> there > > >> > >> > would be other questions after some trial and error :) > > >> > >> > > > >> > >> > Jan > > >> > >> > > >> > >> > > >> > >> > > >> > >> -- > > >> > >> http://www.needhamsoftware.com (work) > > >> > >> http://www.the111shift.com (play) > > >> > >> > > >> > > > > >> > > > >> > > > > > > > > -- > http://www.needhamsoftware.com (work) > http://www.the111shift.com (play) >