Hi Gus, Jan, I am considering implementing TRA for a large-scale Solr deployment. Your Q&A is helpful!
I am curious if you have experience/ideas regarding modifying the TR Alias when one desires to manually delete old collections or modify the router.autoDeleteAge to shorten or extend the delete age. Here's a few specific questions? 1) Can you manually delete an old collection (via collection api) and then edit the start date (to a more recent date) of the TRA so that it no longer sees/processes the deleted collection? 2) Is the only way to manage the deletion of collections within a TRA using the automatic deletion configuration? The router.autoDeleteAge parameter. 3) If you can only manage deletes using the router.autoDeleteAge parameter, are you able to update this parameter to either: - Set the delete age earlier so that older collections are triggered for automatic deletion sooner? - Set the delete age to a larger value to extend the life of a collection? Say you originally would like the collections to stay around for 5 years, but then change your mind to 7 years. I will likely do some experimentation, but am interested to learn if you have covered these use-cases with TRA. Thanks, Matt On Fri, Aug 6, 2021 at 8:08 AM Gus Heck <gus.h...@gmail.com> wrote: > Hi Jan, > > The key thing to remember about TRA's (or any Routed Alias) is that it only > actively does two things: > 1) Routes document updates to the correct collection by inspecting the > routed field in the document > 2) Detects when a new collection is required and creates it. > > If you don't send it data *nothing* happens. The collections are not > created until data requires them (with an async create possible when it > sees an update that has a timestamp "near" the next interval, see docs for > router.preemptiveCreateMath ) > > A) Dave's half of our talk at 2018 activate talks about it: > https://youtu.be/RB1-7Y5NQeI?t=839 > B) Time Routed Aliases are a means by which to automate creation of > collections and route documents to the created collections. Sizing, and > performance of the individual collections is not otherwise special, and you > can interact with the collections individually after they are created, with > the obvious caveats that you probably don't want to be doing things that > get them out of sync schema wise unless your client programs know how to > handle documents of both types etc. A less obvious consequence of the > routing is that your data must not ever republish the same document with a > different route key (date for TRA), since that can lead to duplicate id's > across collections. The "normal" use case is event data, things that > happened and are done, and are correctly recorded (or at least their time > is correctly recorded) the first time > C) Configure the higher number of replicas, remove old ones manually if not > needed. At query time it's "just an alias". Managing collections based on > recency could be automated here, before autoscaling was deprecated I was > thinking that adding a couple of hooks into autoscaling such that it could > react to collection creation by a TRA specifically would get us to a place > much like Elastic's Hot/Warm architecture. I haven't kept track of what's > being done to replace auto scaling however. I think Atri was interested in > that at one point as well. > D) TRA's create collections under the hood with a CREATE command just like > you would manually (based on the config in the TRA). Anything in Solr that > would influence that placement should apply. > E) See D above, for fill rate, Utilizing new nodes over time should be as > simple as adding new nodes and waiting for new collections to be created. > One could also manually move replicas as with any other collection, (aside: > be sure to refer to a current version of MOVEREPLICA docs, prior to > something like 8.6 they were incomplete and even wrong in a few places). > F) If you are talking about router.autoDeleteAge here, old collection > removal is a regular DELETE (just automatically issued), Not sure what you > mean by rotation interval. > G) They are just collections with special names that can be parsed during > update to select a destination for the incoming document. > H) They are just collections, and there's nothing to prevent you from > upgrading the schema, and new collections will begin using that, individual > collections would need to be reloaded, non-safe schema changes (in the > usual sense) require a re-index as usual. In a cloud environment where you > can temporarily add machines or disk this is not so bad aside from the time > to re-index of course. If you are on-prem then plan to have a significant > level of spare disk to handle this case without running yourself into the > danger zone for segment merging. > H.2) TRA is just an alias with fancy collection creation (and naming). Once > they collections exist, it's just an alias. All the action (at this point) > happens at update. So long as the collection is listed in the TRA in > zookeeper in aliases.json ***in the correct, (chronological, desc) order*** > and the naming of the collection can be parsed by the TRA code you should > be fine. Incoming updates iterate down the list of collections during an > update, and stop at the first one where the collection name matches the > date in the routing field for the document for a normal TRA the vast > majority of updates hit one of the most recent two or three collections. > Frequent updates to old data in a TRA with very many time slices (sub > collections) might suffer some since this is a simple linear iteration, > optimizing that was deferred until it seemed important to someone's less > normal use case :). > > > > Otherwise it's just an alias of collections with funky looking names > (unless someone added something when I wasn't looking ;) ). > > -Gus > > On Fri, Aug 6, 2021 at 4:13 AM Jan Høydahl <jan....@cominvent.com> wrote: > > > Hi, > > > > I have never used TRA, but a client of mine is considering it. A few > > questions. > > > > A) Do you have links to talks (slides/video) on the feature? Or blog > posts > > going into more detail than the RefGuide? > > B) For ingestion performance, sharding may make sense. But only for the > > current collection. Have anyone tried merging "static" shards? > > C) Is there a trick to have more relicas on recent collections than old > > ones? > > D) Is there a way to manage what nodes that get selected for new > > collections, or you need to rely on replica placement policies? > > E) How do you guys ensure you get a good fill-rate on the nodes, and what > > procedure do you use when adding more nodes in the cluster? > > * I.e. do you simply add a few new nodes and let Solr automatically > > place new collections onto those? > > F) How many sub-collections/cores do you plan for on a single node? > > * You could try to configure the "rotation interval" such that a node > > gets filled by a single core, but that seems hard to predict > > * Having a too rapid "rotation interval" will leave behind too many > > cores per node, causing inefficiencies? > > * Have you found a strategy to balance this? I'd likely try to plan > > for 10 cores per node, and monitor fill-rate such that I (manually) add > > more HW once a threshold is reached. > > G) Have anyone tried backup of a TRA? Does it even work, or do you need > to > > run the command for each single collection? > > H) A typical requirement is to migrate all data from one cluster to a new > > cluster on a newer version or with a new schema. Have you tried doing > that > > with a TRA? > > * Would you need to migrate each sub collection at a time? > > * Will TRA on the new cluster accept that someone "external" adds > > collections, and how it is initialized/bootstrapped to fill the internal > > collection registry? > > > > That's what I could think of before trying the feature. I'm sure there > > would be other questions after some trial and error :) > > > > Jan > > > > -- > http://www.needhamsoftware.com (work) > http://www.the111shift.com (play) >