Hi Gus, Jan,

I am considering implementing TRA for a large-scale Solr deployment.  Your
Q&A is helpful!

I am curious if you have experience/ideas regarding modifying the TR Alias
when one desires to manually delete old collections or modify the
router.autoDeleteAge to shorten or extend the delete age.  Here's a few
specific questions?

1) Can you manually delete an old collection (via collection api) and then
edit the start date (to a more recent date) of the TRA so that it no longer
sees/processes the deleted collection?
2) Is the only way to manage the deletion of collections within a TRA using
the automatic deletion configuration? The router.autoDeleteAge parameter.
3) If you can only manage deletes using the router.autoDeleteAge parameter,
are you able to update this parameter to either:

   - Set the delete age earlier so that older collections are triggered for
   automatic deletion sooner?
   - Set the delete age to a larger value to extend the life of a
   collection?  Say you originally  would like the collections to stay around
   for 5 years, but then change your mind to 7 years.

I will likely do some experimentation, but am interested to learn if you
have covered these use-cases with TRA.

Thanks,
Matt


On Fri, Aug 6, 2021 at 8:08 AM Gus Heck <gus.h...@gmail.com> wrote:

> Hi Jan,
>
> The key thing to remember about TRA's (or any Routed Alias) is that it only
> actively does two things:
> 1) Routes document updates to the correct collection by inspecting the
> routed field in the document
> 2) Detects when a new collection is required and creates it.
>
> If you don't send it data *nothing* happens. The collections are not
> created until data requires them (with an async create possible when it
> sees an update that has a timestamp "near" the next interval, see docs for
> router.preemptiveCreateMath )
>
> A) Dave's half of our talk at 2018 activate talks about it:
> https://youtu.be/RB1-7Y5NQeI?t=839
> B) Time Routed Aliases are a means by which to automate creation of
> collections and route documents to the created collections. Sizing, and
> performance of the individual collections is not otherwise special, and you
> can interact with the collections individually after they are created, with
> the obvious caveats that you probably don't want to be doing things that
> get them out of sync schema wise unless your client programs know how to
> handle documents of both types etc. A less obvious consequence of the
> routing is that your data must not ever republish the same document with a
> different route key (date for TRA), since that can lead to duplicate id's
> across collections. The "normal" use case is event data, things that
> happened and are done, and are correctly recorded (or at least their time
> is correctly recorded) the first time
> C) Configure the higher number of replicas, remove old ones manually if not
> needed. At query time it's "just an alias". Managing collections based on
> recency could be automated here, before autoscaling was deprecated I was
> thinking that adding a couple of hooks into autoscaling such that it could
> react to collection creation by a TRA specifically would get us to a place
> much like Elastic's Hot/Warm architecture. I haven't kept track of what's
> being done to replace auto scaling however. I think Atri was interested in
> that at one point as well.
> D) TRA's create collections under the hood with a CREATE command just like
> you would manually (based on the config in the TRA). Anything in Solr that
> would influence that placement should apply.
> E) See D above, for fill rate, Utilizing new nodes over time should be as
> simple as adding new nodes and waiting for new collections to be created.
> One could also manually move replicas as with any other collection, (aside:
> be sure to refer to a current version of MOVEREPLICA docs, prior to
> something like 8.6 they were incomplete and even wrong in a few places).
> F) If you are talking about router.autoDeleteAge here, old collection
> removal is a regular DELETE (just automatically issued), Not sure what you
> mean by rotation interval.
> G) They are just collections with special names that can be parsed during
> update to select a destination for the incoming document.
> H) They are just collections, and there's nothing to prevent you from
> upgrading the schema, and new collections will begin using that, individual
> collections would need to be reloaded, non-safe schema changes (in the
> usual sense) require a re-index as usual. In a cloud environment where you
> can temporarily add machines or disk this is not so bad aside from the time
> to re-index of course. If you are on-prem then plan to have a significant
> level of spare disk to handle this case without running yourself into the
> danger zone for segment merging.
> H.2) TRA is just an alias with fancy collection creation (and naming). Once
> they collections exist, it's just an alias. All the action (at this point)
> happens at update. So long as the collection is listed in the TRA in
> zookeeper in aliases.json ***in the correct, (chronological, desc) order***
> and the naming of the collection can be parsed by the TRA code you should
> be fine. Incoming updates iterate down the list of collections during an
> update, and stop at the first one where the collection name matches the
> date in the routing field for the document for a normal TRA the vast
> majority of updates hit one of the most recent two or three collections.
> Frequent updates to old data in a TRA with very many time slices (sub
> collections) might suffer some since this is a simple linear iteration,
> optimizing that was deferred until it seemed important to someone's less
> normal use case :).
>
>
>
> Otherwise it's just an alias of collections with funky looking names
> (unless someone added something when I wasn't looking ;) ).
>
> -Gus
>
> On Fri, Aug 6, 2021 at 4:13 AM Jan Høydahl <jan....@cominvent.com> wrote:
>
> > Hi,
> >
> > I have never used TRA, but a client of mine is considering it. A few
> > questions.
> >
> > A) Do you have links to talks (slides/video) on the feature? Or blog
> posts
> > going into more detail than the RefGuide?
> > B) For ingestion performance, sharding may make sense. But only for the
> > current collection. Have anyone tried merging "static" shards?
> > C) Is there a trick to have more relicas on recent collections than old
> > ones?
> > D) Is there a way to manage what nodes that get selected for new
> > collections, or you need to rely on replica placement policies?
> > E) How do you guys ensure you get a good fill-rate on the nodes, and what
> > procedure do you use when adding more nodes in the cluster?
> >     * I.e. do you simply add a few new nodes and let Solr automatically
> > place new collections onto those?
> > F) How many sub-collections/cores do you plan for on a single node?
> >     * You could try to configure the "rotation interval" such that a node
> > gets filled by a single core, but that seems hard to predict
> >     * Having a too rapid "rotation interval" will leave behind too many
> > cores per node, causing inefficiencies?
> >     * Have you found a strategy to balance this? I'd likely try to plan
> > for 10 cores per node, and monitor fill-rate such that I (manually) add
> > more HW once a threshold is reached.
> > G) Have anyone tried backup of a TRA? Does it even work, or do you need
> to
> > run the command for each single collection?
> > H) A typical requirement is to migrate all data from one cluster to a new
> > cluster on a newer version or with a new schema. Have you tried doing
> that
> > with a TRA?
> >     * Would you need to migrate each sub collection at a time?
> >     * Will TRA on the new cluster accept that someone "external" adds
> > collections, and how it is initialized/bootstrapped to fill the internal
> > collection registry?
> >
> > That's what I could think of before trying the feature. I'm sure there
> > would be other questions after some trial and error :)
> >
> > Jan
>
>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>

Reply via email to