I found some helpful information while testing TRAs: For our use-case I am hesitant to set up an autoDeleteAge (unless it can be modified - still need to test). So I wondered about a little more manual delete management approach.
I confirmed that I cannot simply delete a collection that is registered as part of a TRA. The delete collection api call will fail with a message that the collection is a part of the alias. I did learn that I could use the same create TRA api call I used to create the TRA, but modify the router.start to date more recent than one or more of the older collections associated with the TRA. Then when I queried the TRA, I only received documents from the collections after the new router.start date. Also, I was now able to successfully delete the older collections with a standard collection delete command. I think this satisfies my initial use-case requirements to be able to modify an existing TRA and delete older collections. Matt On Mon, Aug 9, 2021 at 11:27 AM Matt Kuiper <kuipe...@gmail.com> wrote: > Hi Gus, Jan, > > I am considering implementing TRA for a large-scale Solr deployment. Your > Q&A is helpful! > > I am curious if you have experience/ideas regarding modifying the TR Alias > when one desires to manually delete old collections or modify the > router.autoDeleteAge to shorten or extend the delete age. Here's a few > specific questions? > > 1) Can you manually delete an old collection (via collection api) and then > edit the start date (to a more recent date) of the TRA so that it no longer > sees/processes the deleted collection? > 2) Is the only way to manage the deletion of collections within a TRA > using the automatic deletion configuration? The router.autoDeleteAge > parameter. > 3) If you can only manage deletes using the router.autoDeleteAge > parameter, are you able to update this parameter to either: > > - Set the delete age earlier so that older collections are triggered > for automatic deletion sooner? > - Set the delete age to a larger value to extend the life of a > collection? Say you originally would like the collections to stay around > for 5 years, but then change your mind to 7 years. > > I will likely do some experimentation, but am interested to learn if you > have covered these use-cases with TRA. > > Thanks, > Matt > > > On Fri, Aug 6, 2021 at 8:08 AM Gus Heck <gus.h...@gmail.com> wrote: > >> Hi Jan, >> >> The key thing to remember about TRA's (or any Routed Alias) is that it >> only >> actively does two things: >> 1) Routes document updates to the correct collection by inspecting the >> routed field in the document >> 2) Detects when a new collection is required and creates it. >> >> If you don't send it data *nothing* happens. The collections are not >> created until data requires them (with an async create possible when it >> sees an update that has a timestamp "near" the next interval, see docs for >> router.preemptiveCreateMath ) >> >> A) Dave's half of our talk at 2018 activate talks about it: >> https://youtu.be/RB1-7Y5NQeI?t=839 >> B) Time Routed Aliases are a means by which to automate creation of >> collections and route documents to the created collections. Sizing, and >> performance of the individual collections is not otherwise special, and >> you >> can interact with the collections individually after they are created, >> with >> the obvious caveats that you probably don't want to be doing things that >> get them out of sync schema wise unless your client programs know how to >> handle documents of both types etc. A less obvious consequence of the >> routing is that your data must not ever republish the same document with a >> different route key (date for TRA), since that can lead to duplicate id's >> across collections. The "normal" use case is event data, things that >> happened and are done, and are correctly recorded (or at least their time >> is correctly recorded) the first time >> C) Configure the higher number of replicas, remove old ones manually if >> not >> needed. At query time it's "just an alias". Managing collections based on >> recency could be automated here, before autoscaling was deprecated I was >> thinking that adding a couple of hooks into autoscaling such that it could >> react to collection creation by a TRA specifically would get us to a place >> much like Elastic's Hot/Warm architecture. I haven't kept track of what's >> being done to replace auto scaling however. I think Atri was interested in >> that at one point as well. >> D) TRA's create collections under the hood with a CREATE command just like >> you would manually (based on the config in the TRA). Anything in Solr that >> would influence that placement should apply. >> E) See D above, for fill rate, Utilizing new nodes over time should be as >> simple as adding new nodes and waiting for new collections to be created. >> One could also manually move replicas as with any other collection, >> (aside: >> be sure to refer to a current version of MOVEREPLICA docs, prior to >> something like 8.6 they were incomplete and even wrong in a few places). >> F) If you are talking about router.autoDeleteAge here, old collection >> removal is a regular DELETE (just automatically issued), Not sure what you >> mean by rotation interval. >> G) They are just collections with special names that can be parsed during >> update to select a destination for the incoming document. >> H) They are just collections, and there's nothing to prevent you from >> upgrading the schema, and new collections will begin using that, >> individual >> collections would need to be reloaded, non-safe schema changes (in the >> usual sense) require a re-index as usual. In a cloud environment where you >> can temporarily add machines or disk this is not so bad aside from the >> time >> to re-index of course. If you are on-prem then plan to have a significant >> level of spare disk to handle this case without running yourself into the >> danger zone for segment merging. >> H.2) TRA is just an alias with fancy collection creation (and naming). >> Once >> they collections exist, it's just an alias. All the action (at this point) >> happens at update. So long as the collection is listed in the TRA in >> zookeeper in aliases.json ***in the correct, (chronological, desc) >> order*** >> and the naming of the collection can be parsed by the TRA code you should >> be fine. Incoming updates iterate down the list of collections during an >> update, and stop at the first one where the collection name matches the >> date in the routing field for the document for a normal TRA the vast >> majority of updates hit one of the most recent two or three collections. >> Frequent updates to old data in a TRA with very many time slices (sub >> collections) might suffer some since this is a simple linear iteration, >> optimizing that was deferred until it seemed important to someone's less >> normal use case :). >> >> >> >> Otherwise it's just an alias of collections with funky looking names >> (unless someone added something when I wasn't looking ;) ). >> >> -Gus >> >> On Fri, Aug 6, 2021 at 4:13 AM Jan Høydahl <jan....@cominvent.com> wrote: >> >> > Hi, >> > >> > I have never used TRA, but a client of mine is considering it. A few >> > questions. >> > >> > A) Do you have links to talks (slides/video) on the feature? Or blog >> posts >> > going into more detail than the RefGuide? >> > B) For ingestion performance, sharding may make sense. But only for the >> > current collection. Have anyone tried merging "static" shards? >> > C) Is there a trick to have more relicas on recent collections than old >> > ones? >> > D) Is there a way to manage what nodes that get selected for new >> > collections, or you need to rely on replica placement policies? >> > E) How do you guys ensure you get a good fill-rate on the nodes, and >> what >> > procedure do you use when adding more nodes in the cluster? >> > * I.e. do you simply add a few new nodes and let Solr automatically >> > place new collections onto those? >> > F) How many sub-collections/cores do you plan for on a single node? >> > * You could try to configure the "rotation interval" such that a >> node >> > gets filled by a single core, but that seems hard to predict >> > * Having a too rapid "rotation interval" will leave behind too many >> > cores per node, causing inefficiencies? >> > * Have you found a strategy to balance this? I'd likely try to plan >> > for 10 cores per node, and monitor fill-rate such that I (manually) add >> > more HW once a threshold is reached. >> > G) Have anyone tried backup of a TRA? Does it even work, or do you need >> to >> > run the command for each single collection? >> > H) A typical requirement is to migrate all data from one cluster to a >> new >> > cluster on a newer version or with a new schema. Have you tried doing >> that >> > with a TRA? >> > * Would you need to migrate each sub collection at a time? >> > * Will TRA on the new cluster accept that someone "external" adds >> > collections, and how it is initialized/bootstrapped to fill the internal >> > collection registry? >> > >> > That's what I could think of before trying the feature. I'm sure there >> > would be other questions after some trial and error :) >> > >> > Jan >> >> >> >> -- >> http://www.needhamsoftware.com (work) >> http://www.the111shift.com (play) >> >