Re: Time Routed Alias

David Smiley Wed, 11 Aug 2021 19:10:13 -0700

I hope you have success with TRAs!

You can delete some number of collections from the rear of the chain, but
you must first update the TRA to exclude these collections.  This is
tested:
https://github.com/apache/solr/blob/f6c4f8a755603c3049e48eaf9511041252f2dbad/solr/core/src/test/org/apache/solr/update/processor/TimeRoutedAliasUpdateProcessorTest.java#L184
It'd be nice if it would remove itself from the alias.


~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Tue, Aug 10, 2021 at 9:26 PM Matt Kuiper <[email protected]> wrote:

> I found some helpful information while testing TRAs:
>
> For our use-case I am hesitant to set up an autoDeleteAge (unless it can be
> modified - still need to test).  So I wondered about a little more manual
> delete management approach.
>
> I confirmed that I cannot simply delete a collection that is registered as
> part of a TRA.  The delete collection api call will fail with a message
> that the collection is a part of the alias.
>
> I did learn that I could use the same create TRA api call I used to create
> the TRA, but modify the router.start to date more recent than one or more
> of the older collections associated with the TRA. Then when I queried the
> TRA, I only received documents from the collections after the new
> router.start date. Also, I was now able to successfully delete the older
> collections with a standard collection delete command.
>
> I think this satisfies my initial use-case requirements to be able to
> modify an existing TRA and delete older collections.
>
> Matt
>
> On Mon, Aug 9, 2021 at 11:27 AM Matt Kuiper <[email protected]> wrote:
>
> > Hi Gus, Jan,
> >
> > I am considering implementing TRA for a large-scale Solr deployment.
> Your
> > Q&A is helpful!
> >
> > I am curious if you have experience/ideas regarding modifying the TR
> Alias
> > when one desires to manually delete old collections or modify the
> > router.autoDeleteAge to shorten or extend the delete age.  Here's a few
> > specific questions?
> >
> > 1) Can you manually delete an old collection (via collection api) and
> then
> > edit the start date (to a more recent date) of the TRA so that it no
> longer
> > sees/processes the deleted collection?
> > 2) Is the only way to manage the deletion of collections within a TRA
> > using the automatic deletion configuration? The router.autoDeleteAge
> > parameter.
> > 3) If you can only manage deletes using the router.autoDeleteAge
> > parameter, are you able to update this parameter to either:
> >
> >    - Set the delete age earlier so that older collections are triggered
> >    for automatic deletion sooner?
> >    - Set the delete age to a larger value to extend the life of a
> >    collection?  Say you originally  would like the collections to stay
> around
> >    for 5 years, but then change your mind to 7 years.
> >
> > I will likely do some experimentation, but am interested to learn if you
> > have covered these use-cases with TRA.
> >
> > Thanks,
> > Matt
> >
> >
> > On Fri, Aug 6, 2021 at 8:08 AM Gus Heck <[email protected]> wrote:
> >
> >> Hi Jan,
> >>
> >> The key thing to remember about TRA's (or any Routed Alias) is that it
> >> only
> >> actively does two things:
> >> 1) Routes document updates to the correct collection by inspecting the
> >> routed field in the document
> >> 2) Detects when a new collection is required and creates it.
> >>
> >> If you don't send it data *nothing* happens. The collections are not
> >> created until data requires them (with an async create possible when it
> >> sees an update that has a timestamp "near" the next interval, see docs
> for
> >> router.preemptiveCreateMath )
> >>
> >> A) Dave's half of our talk at 2018 activate talks about it:
> >> https://youtu.be/RB1-7Y5NQeI?t=839
> >> B) Time Routed Aliases are a means by which to automate creation of
> >> collections and route documents to the created collections. Sizing, and
> >> performance of the individual collections is not otherwise special, and
> >> you
> >> can interact with the collections individually after they are created,
> >> with
> >> the obvious caveats that you probably don't want to be doing things that
> >> get them out of sync schema wise unless your client programs know how to
> >> handle documents of both types etc. A less obvious consequence of the
> >> routing is that your data must not ever republish the same document
> with a
> >> different route key (date for TRA), since that can lead to duplicate
> id's
> >> across collections. The "normal" use case is event data, things that
> >> happened and are done, and are correctly recorded (or at least their
> time
> >> is correctly recorded) the first time
> >> C) Configure the higher number of replicas, remove old ones manually if
> >> not
> >> needed. At query time it's "just an alias". Managing collections based
> on
> >> recency could be automated here, before autoscaling was deprecated I was
> >> thinking that adding a couple of hooks into autoscaling such that it
> could
> >> react to collection creation by a TRA specifically would get us to a
> place
> >> much like Elastic's Hot/Warm architecture. I haven't kept track of
> what's
> >> being done to replace auto scaling however. I think Atri was interested
> in
> >> that at one point as well.
> >> D) TRA's create collections under the hood with a CREATE command just
> like
> >> you would manually (based on the config in the TRA). Anything in Solr
> that
> >> would influence that placement should apply.
> >> E) See D above, for fill rate, Utilizing new nodes over time should be
> as
> >> simple as adding new nodes and waiting for new collections to be
> created.
> >> One could also manually move replicas as with any other collection,
> >> (aside:
> >> be sure to refer to a current version of MOVEREPLICA docs, prior to
> >> something like 8.6 they were incomplete and even wrong in a few places).
> >> F) If you are talking about router.autoDeleteAge here, old collection
> >> removal is a regular DELETE (just automatically issued), Not sure what
> you
> >> mean by rotation interval.
> >> G) They are just collections with special names that can be parsed
> during
> >> update to select a destination for the incoming document.
> >> H) They are just collections, and there's nothing to prevent you from
> >> upgrading the schema, and new collections will begin using that,
> >> individual
> >> collections would need to be reloaded, non-safe schema changes (in the
> >> usual sense) require a re-index as usual. In a cloud environment where
> you
> >> can temporarily add machines or disk this is not so bad aside from the
> >> time
> >> to re-index of course. If you are on-prem then plan to have a
> significant
> >> level of spare disk to handle this case without running yourself into
> the
> >> danger zone for segment merging.
> >> H.2) TRA is just an alias with fancy collection creation (and naming).
> >> Once
> >> they collections exist, it's just an alias. All the action (at this
> point)
> >> happens at update. So long as the collection is listed in the TRA in
> >> zookeeper in aliases.json ***in the correct, (chronological, desc)
> >> order***
> >> and the naming of the collection can be parsed by the TRA code you
> should
> >> be fine. Incoming updates iterate down the list of collections during an
> >> update, and stop at the first one where the collection name matches the
> >> date in the routing field for the document for a normal TRA the vast
> >> majority of updates hit one of the most recent two or three collections.
> >> Frequent updates to old data in a TRA with very many time slices (sub
> >> collections) might suffer some since this is a simple linear iteration,
> >> optimizing that was deferred until it seemed important to someone's less
> >> normal use case :).
> >>
> >>
> >>
> >> Otherwise it's just an alias of collections with funky looking names
> >> (unless someone added something when I wasn't looking ;) ).
> >>
> >> -Gus
> >>
> >> On Fri, Aug 6, 2021 at 4:13 AM Jan Høydahl <[email protected]>
> wrote:
> >>
> >> > Hi,
> >> >
> >> > I have never used TRA, but a client of mine is considering it. A few
> >> > questions.
> >> >
> >> > A) Do you have links to talks (slides/video) on the feature? Or blog
> >> posts
> >> > going into more detail than the RefGuide?
> >> > B) For ingestion performance, sharding may make sense. But only for
> the
> >> > current collection. Have anyone tried merging "static" shards?
> >> > C) Is there a trick to have more relicas on recent collections than
> old
> >> > ones?
> >> > D) Is there a way to manage what nodes that get selected for new
> >> > collections, or you need to rely on replica placement policies?
> >> > E) How do you guys ensure you get a good fill-rate on the nodes, and
> >> what
> >> > procedure do you use when adding more nodes in the cluster?
> >> >     * I.e. do you simply add a few new nodes and let Solr
> automatically
> >> > place new collections onto those?
> >> > F) How many sub-collections/cores do you plan for on a single node?
> >> >     * You could try to configure the "rotation interval" such that a
> >> node
> >> > gets filled by a single core, but that seems hard to predict
> >> >     * Having a too rapid "rotation interval" will leave behind too
> many
> >> > cores per node, causing inefficiencies?
> >> >     * Have you found a strategy to balance this? I'd likely try to
> plan
> >> > for 10 cores per node, and monitor fill-rate such that I (manually)
> add
> >> > more HW once a threshold is reached.
> >> > G) Have anyone tried backup of a TRA? Does it even work, or do you
> need
> >> to
> >> > run the command for each single collection?
> >> > H) A typical requirement is to migrate all data from one cluster to a
> >> new
> >> > cluster on a newer version or with a new schema. Have you tried doing
> >> that
> >> > with a TRA?
> >> >     * Would you need to migrate each sub collection at a time?
> >> >     * Will TRA on the new cluster accept that someone "external" adds
> >> > collections, and how it is initialized/bootstrapped to fill the
> internal
> >> > collection registry?
> >> >
> >> > That's what I could think of before trying the feature. I'm sure there
> >> > would be other questions after some trial and error :)
> >> >
> >> > Jan
> >>
> >>
> >>
> >> --
> >> http://www.needhamsoftware.com (work)
> >> http://www.the111shift.com (play)
> >>
> >
>

Re: Time Routed Alias

Reply via email to