Re: Time Routed Alias

Matt Kuiper Tue, 10 Aug 2021 18:26:11 -0700

I found some helpful information while testing TRAs:

For our use-case I am hesitant to set up an autoDeleteAge (unless it can be
modified - still need to test).  So I wondered about a little more manual
delete management approach.


I confirmed that I cannot simply delete a collection that is registered as
part of a TRA.  The delete collection api call will fail with a message
that the collection is a part of the alias.

I did learn that I could use the same create TRA api call I used to create
the TRA, but modify the router.start to date more recent than one or more
of the older collections associated with the TRA. Then when I queried the
TRA, I only received documents from the collections after the new
router.start date. Also, I was now able to successfully delete the older
collections with a standard collection delete command.

I think this satisfies my initial use-case requirements to be able to
modify an existing TRA and delete older collections.

Matt

On Mon, Aug 9, 2021 at 11:27 AM Matt Kuiper <kuipe...@gmail.com> wrote:

> Hi Gus, Jan,
>
> I am considering implementing TRA for a large-scale Solr deployment.  Your
> Q&A is helpful!
>
> I am curious if you have experience/ideas regarding modifying the TR Alias
> when one desires to manually delete old collections or modify the
> router.autoDeleteAge to shorten or extend the delete age.  Here's a few
> specific questions?
>
> 1) Can you manually delete an old collection (via collection api) and then
> edit the start date (to a more recent date) of the TRA so that it no longer
> sees/processes the deleted collection?
> 2) Is the only way to manage the deletion of collections within a TRA
> using the automatic deletion configuration? The router.autoDeleteAge
> parameter.
> 3) If you can only manage deletes using the router.autoDeleteAge
> parameter, are you able to update this parameter to either:
>
>    - Set the delete age earlier so that older collections are triggered
>    for automatic deletion sooner?
>    - Set the delete age to a larger value to extend the life of a
>    collection?  Say you originally  would like the collections to stay around
>    for 5 years, but then change your mind to 7 years.
>
> I will likely do some experimentation, but am interested to learn if you
> have covered these use-cases with TRA.
>
> Thanks,
> Matt
>
>
> On Fri, Aug 6, 2021 at 8:08 AM Gus Heck <gus.h...@gmail.com> wrote:
>
>> Hi Jan,
>>
>> The key thing to remember about TRA's (or any Routed Alias) is that it
>> only
>> actively does two things:
>> 1) Routes document updates to the correct collection by inspecting the
>> routed field in the document
>> 2) Detects when a new collection is required and creates it.
>>
>> If you don't send it data *nothing* happens. The collections are not
>> created until data requires them (with an async create possible when it
>> sees an update that has a timestamp "near" the next interval, see docs for
>> router.preemptiveCreateMath )
>>
>> A) Dave's half of our talk at 2018 activate talks about it:
>> https://youtu.be/RB1-7Y5NQeI?t=839
>> B) Time Routed Aliases are a means by which to automate creation of
>> collections and route documents to the created collections. Sizing, and
>> performance of the individual collections is not otherwise special, and
>> you
>> can interact with the collections individually after they are created,
>> with
>> the obvious caveats that you probably don't want to be doing things that
>> get them out of sync schema wise unless your client programs know how to
>> handle documents of both types etc. A less obvious consequence of the
>> routing is that your data must not ever republish the same document with a
>> different route key (date for TRA), since that can lead to duplicate id's
>> across collections. The "normal" use case is event data, things that
>> happened and are done, and are correctly recorded (or at least their time
>> is correctly recorded) the first time
>> C) Configure the higher number of replicas, remove old ones manually if
>> not
>> needed. At query time it's "just an alias". Managing collections based on
>> recency could be automated here, before autoscaling was deprecated I was
>> thinking that adding a couple of hooks into autoscaling such that it could
>> react to collection creation by a TRA specifically would get us to a place
>> much like Elastic's Hot/Warm architecture. I haven't kept track of what's
>> being done to replace auto scaling however. I think Atri was interested in
>> that at one point as well.
>> D) TRA's create collections under the hood with a CREATE command just like
>> you would manually (based on the config in the TRA). Anything in Solr that
>> would influence that placement should apply.
>> E) See D above, for fill rate, Utilizing new nodes over time should be as
>> simple as adding new nodes and waiting for new collections to be created.
>> One could also manually move replicas as with any other collection,
>> (aside:
>> be sure to refer to a current version of MOVEREPLICA docs, prior to
>> something like 8.6 they were incomplete and even wrong in a few places).
>> F) If you are talking about router.autoDeleteAge here, old collection
>> removal is a regular DELETE (just automatically issued), Not sure what you
>> mean by rotation interval.
>> G) They are just collections with special names that can be parsed during
>> update to select a destination for the incoming document.
>> H) They are just collections, and there's nothing to prevent you from
>> upgrading the schema, and new collections will begin using that,
>> individual
>> collections would need to be reloaded, non-safe schema changes (in the
>> usual sense) require a re-index as usual. In a cloud environment where you
>> can temporarily add machines or disk this is not so bad aside from the
>> time
>> to re-index of course. If you are on-prem then plan to have a significant
>> level of spare disk to handle this case without running yourself into the
>> danger zone for segment merging.
>> H.2) TRA is just an alias with fancy collection creation (and naming).
>> Once
>> they collections exist, it's just an alias. All the action (at this point)
>> happens at update. So long as the collection is listed in the TRA in
>> zookeeper in aliases.json ***in the correct, (chronological, desc)
>> order***
>> and the naming of the collection can be parsed by the TRA code you should
>> be fine. Incoming updates iterate down the list of collections during an
>> update, and stop at the first one where the collection name matches the
>> date in the routing field for the document for a normal TRA the vast
>> majority of updates hit one of the most recent two or three collections.
>> Frequent updates to old data in a TRA with very many time slices (sub
>> collections) might suffer some since this is a simple linear iteration,
>> optimizing that was deferred until it seemed important to someone's less
>> normal use case :).
>>
>>
>>
>> Otherwise it's just an alias of collections with funky looking names
>> (unless someone added something when I wasn't looking ;) ).
>>
>> -Gus
>>
>> On Fri, Aug 6, 2021 at 4:13 AM Jan Høydahl <jan....@cominvent.com> wrote:
>>
>> > Hi,
>> >
>> > I have never used TRA, but a client of mine is considering it. A few
>> > questions.
>> >
>> > A) Do you have links to talks (slides/video) on the feature? Or blog
>> posts
>> > going into more detail than the RefGuide?
>> > B) For ingestion performance, sharding may make sense. But only for the
>> > current collection. Have anyone tried merging "static" shards?
>> > C) Is there a trick to have more relicas on recent collections than old
>> > ones?
>> > D) Is there a way to manage what nodes that get selected for new
>> > collections, or you need to rely on replica placement policies?
>> > E) How do you guys ensure you get a good fill-rate on the nodes, and
>> what
>> > procedure do you use when adding more nodes in the cluster?
>> >     * I.e. do you simply add a few new nodes and let Solr automatically
>> > place new collections onto those?
>> > F) How many sub-collections/cores do you plan for on a single node?
>> >     * You could try to configure the "rotation interval" such that a
>> node
>> > gets filled by a single core, but that seems hard to predict
>> >     * Having a too rapid "rotation interval" will leave behind too many
>> > cores per node, causing inefficiencies?
>> >     * Have you found a strategy to balance this? I'd likely try to plan
>> > for 10 cores per node, and monitor fill-rate such that I (manually) add
>> > more HW once a threshold is reached.
>> > G) Have anyone tried backup of a TRA? Does it even work, or do you need
>> to
>> > run the command for each single collection?
>> > H) A typical requirement is to migrate all data from one cluster to a
>> new
>> > cluster on a newer version or with a new schema. Have you tried doing
>> that
>> > with a TRA?
>> >     * Would you need to migrate each sub collection at a time?
>> >     * Will TRA on the new cluster accept that someone "external" adds
>> > collections, and how it is initialized/bootstrapped to fill the internal
>> > collection registry?
>> >
>> > That's what I could think of before trying the feature. I'm sure there
>> > would be other questions after some trial and error :)
>> >
>> > Jan
>>
>>
>>
>> --
>> http://www.needhamsoftware.com (work)
>> http://www.the111shift.com (play)
>>
>

Re: Time Routed Alias

Reply via email to