date:20210813

Considering SOLR as our new infra

2021-08-13 Thread Albert Dfm

Hello List!
I'm new to the list, and that's my first message.
We got to know about SOLR, and we are very excited about it to replace our
current elasticsearch infra.Currently, our main issue is regarding data and
model size running on each machine.

*Our setup:*
1. We use the following search arch: 1st tier, the fast search (low
response time) with most likely data to be retrieved,
2. 2nd tier with the rest (including on-disk data)

We saw the all features (solr wabpage) provided by SOLr, and we would like
to ask about them, more specifically we would like to know:
1. Can we do text search and vector similarity?
2. Can we filter by metadata?
3. How about index/memory consumption? 1st tier needs around 4000M
embeddings vector (128 fp32) + metadata stored in memory
4. Can we execute models in the DB itself? (not outside SOLr). We have
per-user models, and we need a way of executing TensorFlow models on the
database to prevent moving data outside of the DB
5. Subsecond queries
6. Real-time indexing (or near real-time) of new data
7. Easily scalable

Thank you so much!!

UnifiedHighlighter BreakIterator

2021-08-13 Thread Clive Lewis

Hello!

*Problem:*
I have a multivalue field that stores paragraphs of the text. (1 paragraph
= 1 value). position gap between values = 5000. Right now I use
fastVectorHighlighter and it works as expected for queries like "Big Bang
Theory"~5000 (because of 5000 slop it searches only inside of one value
(paragraph)). But apparently fastVectorHighlighter doesn't support phrase
queries with the wrong word order. So if I do "Big Theory Bang"~5000, it
will find the document, but won't find the snippet.

*Possible solution:*
I noticed that UnifiedHiglighter supports the slop and it returns snippets
for the query above. But sometimes snippets are empty for queries where
even fastVectorHighlighter returned something. I assume it's because of how
UnifiedHighlighter splits the text.

*Question:*
I want to make so UnifiedHighlighter will search for snippets in each value
of the field separately.
As I understand, by default UH splits text by sentences. You can modify
that using the parameter *hl.bs.type. *Possible values are CHARACTER, WORD,
LINE, SENTENCE, WHOLE, SEPARATOR. But how can I tell him to "treat each
value of my multivalue field separately and search inside of it"?

*I probably can add a special constant symbol at the end of each paragraph
and split it by SEPARATOR, but it feels like a shitty hack, plus it will
require me to reindex millions of documents. *

Re: Delete using Streaming Expressions

2021-08-13 Thread Jan Høydahl

Please see 
https://solr.apache.org/guide/8_9/stream-decorator-reference.html#delete

Jan

> 12. aug. 2021 kl. 21:05 skrev mtn search :
> 
> Hello,
> 
> I have heard that there can be issues when using the Solr delete by query
> approach ( *:*) for large sets of
> documents.  That it may block other indexing activities...
> 
> I have also heard that using Solr Streaming Expressions would be a better
> approach, more efficient for large doc sets to be deleted.  I did a quick
> search for the syntax of a streaming expression for delete, but did not
> find it.  Any tips on how to construct this expression?
> 
> Thanks,
> Matthew

Re: Considering SOLR as our new infra

2021-08-13 Thread Shawn Heisey


On 8/13/2021 2:25 AM, Albert Dfm wrote:

We got to know about SOLR, and we are very excited about it to replace our
current elasticsearch infra.Currently, our main issue is regarding data and
model size running on each machine.

*Our setup:*
1. We use the following search arch: 1st tier, the fast search (low
response time) with most likely data to be retrieved,
2. 2nd tier with the rest (including on-disk data)

We saw the all features (solr wabpage) provided by SOLr, and we would like
to ask about them, more specifically we would like to know:
1. Can we do text search and vector similarity?
2. Can we filter by metadata?
3. How about index/memory consumption? 1st tier needs around 4000M
embeddings vector (128 fp32) + metadata stored in memory
4. Can we execute models in the DB itself? (not outside SOLr). We have
per-user models, and we need a way of executing TensorFlow models on the
database to prevent moving data outside of the DB
5. Subsecond queries
6. Real-time indexing (or near real-time) of new data
7. Easily scalable



As Solr and ES both use Lucene for the vast majority of their 
functionality, they have nearly identical overall capabilities. If ES 
can do it, Solr most likely can too.  If the configs are nearly the 
same, Solr and ES will have similar performance.


Number 3: The bottom line here is that we do not know, and we can't 
know.  Any guess made by us about Solr or the ES team about ES would be 
just that -- a guess.  What works for one user with an index of a 
particular size might be way too low or way too high for another user 
with a similar size index.  When we guess, we're always going to err on 
the side of caution -- recommend significantly more resources than what 
might actually be required, so we know there will be enough.  And we 
generally need a lot of information that you might not have yet in order 
to make a guess.  If it works in ES with X amount of resources, it will 
probably also work in Solr with those resources too -- assuming that the 
configs are substantially similar.  In example configs, Solr tends to 
have a lot more features enabled than ES does, which is one reason that 
ES can claim that they perform better "out of the box".  When the 
configs are actually similar, performance tends to be similar.


https://lucidworks.com/post/solr-sizing-guide-estimating-solr-sizing-hardware/

First 1 and 2: You could set up different indexes for this purpose.  
Solr doesn't provide a way to automatically move older data from one 
index to another.  You would have to do that in your indexing software.  
For time-series data (think logs or similar), SolrCloud has the "Time 
Routed Aliases" feature -- it creates a new collection for the most 
recent data, and then later another new collection will be created.  I 
have never used the feature, though I do understand the concept.


1: Text search, definitely.  Vector similarity, probably ... but because 
I do not know what this is, I do not want to say the answer is 
definitely yes.  Solr provides a way to utilize Lucene TermVectors.
2: Generally, yes.  How you set up the schema and the nature of the data 
will determine exactly what you can do with filters. This would be the 
case for ES too.

3: See above.
4: I have no idea what you mean by this.  But as I have said before, if 
ES can do it, Solr probably can too.
5: If you have enough resources, particularly memory, Solr performs 
great.  If the index is REALLY big, it might be difficult to arrange to 
have enough unallocated memory for the OS to reliably cache the index.  
Neither Solr nor ES do that caching themselves, they rely on the OS to 
handle it.
6: Faster indexing generally means taking a hit on query performance 
whenever you update the index and commit changes. This would be the case 
for ES too.
7: This is such a vague question that I cannot answer it without knowing 
EXACTLY what you mean.


Additional reading (disclaimer: I wrote this wiki page):

https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems

Thanks,
Shawn

Re: ConcurrentUpdateSolrClient stall prevention bug in Solr 8.4+

2021-08-13 Thread Reej M

Hi Team,

Have any of you  found a solution for the Task Queue  processing has stalled 
for 20077 ms with 0 remaining elements to process.
We are using solr 8.8.2, randomly we get this error while indexing. Is there 
any way we need to tune the solr.autocommit.maxtime?
For few cores we have it as 15000, for few we have it as 500, 1 or -1.  Also 
the maxwarmingsearchers is set to 2
Kindly advise on how to tune this. 

Thanks
Reej

Re: ConcurrentUpdateSolrClient stall prevention bug in Solr 8.4+

2021-08-13 Thread Shawn Heisey


On 8/13/2021 7:36 AM, Reej M wrote:

Have any of you  found a solution for the Task Queue  processing has stalled 
for 20077 ms with 0 remaining elements to process.
We are using solr 8.8.2, randomly we get this error while indexing. Is there 
any way we need to tune the solr.autocommit.maxtime?
For few cores we have it as 15000, for few we have it as 500, 1 or -1.  Also 
the maxwarmingsearchers is set to 2
Kindly advise on how to tune this.


A value of 500 or 1 is far too aggressive.  It means that a commit will 
fire off either half a second or one millisecond after any updates are 
made.  A value of -1 turns the feature off and you do not want that.


The example configs have it set to 15000 (15 seconds), I personally 
would go with 6.  And I would make sure that openSearcher is set to 
false on autoCommit.  That option on autoSoftCommit makes no sense, so 
you won't find it in example configs.  For autoSoftCommit the maxTime 
value should be much larger -- two minutes (12) is about as low as I 
would dare go on that.


I don't know that any autoCommit setting is going to help with that 
particular message, though.  The message was added to 8.4 by this issue:


https://issues.apache.org/jira/browse/SOLR-13975

With version before 8.4, the stall would still occur but you would not 
be notified, and it would probably stall forever -- 8.4 can apparently 
break the stall.


As for what might be causing it, I do not know.  It might be that your 
heap is too small, causing Java to spend a lot of time doing garbage 
collection in order to keep Solr running.  Or it might be a general 
performance issue.  If you can provide your GC logs that Solr writes, I 
can look into the possibility of garbage collection pauses.  You'll need 
to put them on a file sharing site and provide links -- the mailing list 
eats message attachments.


Thanks,
Shawn

Re: Considering SOLR as our new infra

2021-08-13 Thread Albert Dfm

Thanks a lot Shawn for the very detailed reply, very informative and much
appreciated!!
I will check the link for performance problems.

Regarding executing models (question number 4), let me explain this a bit
better:
Can SOLr run custom tensorflow/pytorch models? This is not a feature in
lucene, it is something on top of it.

Thanks!!


On Fri, Aug 13, 2021 at 2:44 PM Shawn Heisey  wrote:

> On 8/13/2021 2:25 AM, Albert Dfm wrote:
> > We got to know about SOLR, and we are very excited about it to replace
> our
> > current elasticsearch infra.Currently, our main issue is regarding data
> and
> > model size running on each machine.
> >
> > *Our setup:*
> > 1. We use the following search arch: 1st tier, the fast search (low
> > response time) with most likely data to be retrieved,
> > 2. 2nd tier with the rest (including on-disk data)
> >
> > We saw the all features (solr wabpage) provided by SOLr, and we would
> like
> > to ask about them, more specifically we would like to know:
> > 1. Can we do text search and vector similarity?
> > 2. Can we filter by metadata?
> > 3. How about index/memory consumption? 1st tier needs around 4000M
> > embeddings vector (128 fp32) + metadata stored in memory
> > 4. Can we execute models in the DB itself? (not outside SOLr). We have
> > per-user models, and we need a way of executing TensorFlow models on the
> > database to prevent moving data outside of the DB
> > 5. Subsecond queries
> > 6. Real-time indexing (or near real-time) of new data
> > 7. Easily scalable
>
>
> As Solr and ES both use Lucene for the vast majority of their
> functionality, they have nearly identical overall capabilities. If ES
> can do it, Solr most likely can too.  If the configs are nearly the
> same, Solr and ES will have similar performance.
>
> Number 3: The bottom line here is that we do not know, and we can't
> know.  Any guess made by us about Solr or the ES team about ES would be
> just that -- a guess.  What works for one user with an index of a
> particular size might be way too low or way too high for another user
> with a similar size index.  When we guess, we're always going to err on
> the side of caution -- recommend significantly more resources than what
> might actually be required, so we know there will be enough.  And we
> generally need a lot of information that you might not have yet in order
> to make a guess.  If it works in ES with X amount of resources, it will
> probably also work in Solr with those resources too -- assuming that the
> configs are substantially similar.  In example configs, Solr tends to
> have a lot more features enabled than ES does, which is one reason that
> ES can claim that they perform better "out of the box".  When the
> configs are actually similar, performance tends to be similar.
>
>
> https://lucidworks.com/post/solr-sizing-guide-estimating-solr-sizing-hardware/
>
> First 1 and 2: You could set up different indexes for this purpose.
> Solr doesn't provide a way to automatically move older data from one
> index to another.  You would have to do that in your indexing software.
> For time-series data (think logs or similar), SolrCloud has the "Time
> Routed Aliases" feature -- it creates a new collection for the most
> recent data, and then later another new collection will be created.  I
> have never used the feature, though I do understand the concept.
>
> 1: Text search, definitely.  Vector similarity, probably ... but because
> I do not know what this is, I do not want to say the answer is
> definitely yes.  Solr provides a way to utilize Lucene TermVectors.
> 2: Generally, yes.  How you set up the schema and the nature of the data
> will determine exactly what you can do with filters. This would be the
> case for ES too.
> 3: See above.
> 4: I have no idea what you mean by this.  But as I have said before, if
> ES can do it, Solr probably can too.
> 5: If you have enough resources, particularly memory, Solr performs
> great.  If the index is REALLY big, it might be difficult to arrange to
> have enough unallocated memory for the OS to reliably cache the index.
> Neither Solr nor ES do that caching themselves, they rely on the OS to
> handle it.
> 6: Faster indexing generally means taking a hit on query performance
> whenever you update the index and commit changes. This would be the case
> for ES too.
> 7: This is such a vague question that I cannot answer it without knowing
> EXACTLY what you mean.
>
> Additional reading (disclaimer: I wrote this wiki page):
>
> https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems
>
> Thanks,
> Shawn
>
>

Re: Considering SOLR as our new infra

2021-08-13 Thread Shawn Heisey


On 8/13/2021 7:59 AM, Albert Dfm wrote:

Regarding executing models (question number 4), let me explain this a bit
better:
Can SOLr run custom tensorflow/pytorch models? This is not a feature in
lucene, it is something on top of it.


With that info, I am even less familiar with what you're doing than I 
was before.  I have no idea what either of those things are.  Google 
wasn't helpful ... I probably would have to spend a week or two 
researching to even have a minimal understanding.  I was able to tell 
that it's probably related to machine learning, but that's all.  I have 
zero experience in that arena.


It's unlikely that Solr has any direct support for those software 
programs, but if they can build queries that Solr understands, you could 
probably get something going.


Thanks,
Shawn

Re: Considering SOLR as our new infra

2021-08-13 Thread Albert Dfm

For example, for relevance ranking the usual approach is to execute a
machine learned model, e.g. using xgboost, or lightgbm. Tensorflow  and
pytorch are other frameworks to build machine learning models.
While xgboost and lightgbm are ensembles of decision trees, tensorflow and
pytorch are mainly related to neutal networks.

Elasticsearch allows to execute xgboost models for example for relevance
ranking.
The question could be applied similarly to SOLr: can we use pytorch or
tensorflow at relevance ranking phase?

On Fri, Aug 13, 2021 at 4:18 PM Shawn Heisey  wrote:

> On 8/13/2021 7:59 AM, Albert Dfm wrote:
> > Regarding executing models (question number 4), let me explain this a bit
> > better:
> > Can SOLr run custom tensorflow/pytorch models? This is not a feature in
> > lucene, it is something on top of it.
>
> With that info, I am even less familiar with what you're doing than I
> was before.  I have no idea what either of those things are.  Google
> wasn't helpful ... I probably would have to spend a week or two
> researching to even have a minimal understanding.  I was able to tell
> that it's probably related to machine learning, but that's all.  I have
> zero experience in that arena.
>
> It's unlikely that Solr has any direct support for those software
> programs, but if they can build queries that Solr understands, you could
> probably get something going.
>
> Thanks,
> Shawn
>
>

Re: Considering SOLR as our new infra

2021-08-13 Thread Shawn Heisey


On 8/13/2021 8:26 AM, Albert Dfm wrote:

The question could be applied similarly to SOLr: can we use pytorch or
tensorflow at relevance ranking phase?


I have no idea.  I have never touched that functionality.  Those terms 
are not mentioned in the docs:


https://solr.apache.org/guide/8_9/learning-to-rank.html

Thanks,
Shawn

Re: Considering SOLR as our new infra

2021-08-13 Thread Walter Underwood

pytorch and tensorflow are both written in Python and both Solr and 
Elasticsearch
are written in Java, so that seems like an obvious “no” for executing them 
internally.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Aug 13, 2021, at 7:26 AM, Albert Dfm  wrote:
> 
> For example, for relevance ranking the usual approach is to execute a
> machine learned model, e.g. using xgboost, or lightgbm. Tensorflow  and
> pytorch are other frameworks to build machine learning models.
> While xgboost and lightgbm are ensembles of decision trees, tensorflow and
> pytorch are mainly related to neutal networks.
> 
> Elasticsearch allows to execute xgboost models for example for relevance
> ranking.
> The question could be applied similarly to SOLr: can we use pytorch or
> tensorflow at relevance ranking phase?
> 
> On Fri, Aug 13, 2021 at 4:18 PM Shawn Heisey  wrote:
> 
>> On 8/13/2021 7:59 AM, Albert Dfm wrote:
>>> Regarding executing models (question number 4), let me explain this a bit
>>> better:
>>> Can SOLr run custom tensorflow/pytorch models? This is not a feature in
>>> lucene, it is something on top of it.
>> 
>> With that info, I am even less familiar with what you're doing than I
>> was before.  I have no idea what either of those things are.  Google
>> wasn't helpful ... I probably would have to spend a week or two
>> researching to even have a minimal understanding.  I was able to tell
>> that it's probably related to machine learning, but that's all.  I have
>> zero experience in that arena.
>> 
>> It's unlikely that Solr has any direct support for those software
>> programs, but if they can build queries that Solr understands, you could
>> probably get something going.
>> 
>> Thanks,
>> Shawn
>> 
>>

Re: Considering SOLR as our new infra

2021-08-13 Thread Jörn Franke

You probably need to write a plugin for this - both can be also used from 
within Java. 
Some of the models in eg tensorflowranking such as Svm maybe directly usable in 
Solr without a plugin.

> Am 13.08.2021 um 16:33 schrieb Shawn Heisey :
> 
> On 8/13/2021 8:26 AM, Albert Dfm wrote:
>> The question could be applied similarly to SOLr: can we use pytorch or
>> tensorflow at relevance ranking phase?
> 
> I have no idea.  I have never touched that functionality.  Those terms are 
> not mentioned in the docs:
> 
> https://solr.apache.org/guide/8_9/learning-to-rank.html
> 
> Thanks,
> Shawn
>

Re: Considering SOLR as our new infra

2021-08-13 Thread Jörn Franke

Tensorflow and Pytorch have Java bindings. However this is also not really 
needed. if the trained model weights are exported to json which I see at least 
possible for tensorflow ranking then they can be used out of the box, eg svm 
and lambda exist both in tensorflow ranking and solr. Xgboost could work with 
the MultipleAdditiveTree model.

> Am 13.08.2021 um 17:05 schrieb Walter Underwood :
> 
> pytorch and tensorflow are both written in Python and both Solr and 
> Elasticsearch
> are written in Java, so that seems like an obvious “no” for executing them 
> internally.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On Aug 13, 2021, at 7:26 AM, Albert Dfm  wrote:
>> 
>> For example, for relevance ranking the usual approach is to execute a
>> machine learned model, e.g. using xgboost, or lightgbm. Tensorflow  and
>> pytorch are other frameworks to build machine learning models.
>> While xgboost and lightgbm are ensembles of decision trees, tensorflow and
>> pytorch are mainly related to neutal networks.
>> 
>> Elasticsearch allows to execute xgboost models for example for relevance
>> ranking.
>> The question could be applied similarly to SOLr: can we use pytorch or
>> tensorflow at relevance ranking phase?
>> 
>>> On Fri, Aug 13, 2021 at 4:18 PM Shawn Heisey  wrote:
>>> 
>>> On 8/13/2021 7:59 AM, Albert Dfm wrote:
 Regarding executing models (question number 4), let me explain this a bit
 better:
 Can SOLr run custom tensorflow/pytorch models? This is not a feature in
 lucene, it is something on top of it.
>>> 
>>> With that info, I am even less familiar with what you're doing than I
>>> was before.  I have no idea what either of those things are.  Google
>>> wasn't helpful ... I probably would have to spend a week or two
>>> researching to even have a minimal understanding.  I was able to tell
>>> that it's probably related to machine learning, but that's all.  I have
>>> zero experience in that arena.
>>> 
>>> It's unlikely that Solr has any direct support for those software
>>> programs, but if they can build queries that Solr understands, you could
>>> probably get something going.
>>> 
>>> Thanks,
>>> Shawn
>>> 
>>> 
>

Re: Considering SOLR as our new infra

2021-08-13 Thread Stephen Green

Although you could export models to the ONNX format and then use the Java
API for the ONNX Runtime to run the models in Java.

On Fri, Aug 13, 2021 at 11:11 AM Walter Underwood 
wrote:

> pytorch and tensorflow are both written in Python and both Solr and
> Elasticsearch
> are written in Java, so that seems like an obvious “no” for executing them
> internally.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Aug 13, 2021, at 7:26 AM, Albert Dfm  wrote:
> >
> > For example, for relevance ranking the usual approach is to execute a
> > machine learned model, e.g. using xgboost, or lightgbm. Tensorflow  and
> > pytorch are other frameworks to build machine learning models.
> > While xgboost and lightgbm are ensembles of decision trees, tensorflow
> and
> > pytorch are mainly related to neutal networks.
> >
> > Elasticsearch allows to execute xgboost models for example for relevance
> > ranking.
> > The question could be applied similarly to SOLr: can we use pytorch or
> > tensorflow at relevance ranking phase?
> >
> > On Fri, Aug 13, 2021 at 4:18 PM Shawn Heisey 
> wrote:
> >
> >> On 8/13/2021 7:59 AM, Albert Dfm wrote:
> >>> Regarding executing models (question number 4), let me explain this a
> bit
> >>> better:
> >>> Can SOLr run custom tensorflow/pytorch models? This is not a feature in
> >>> lucene, it is something on top of it.
> >>
> >> With that info, I am even less familiar with what you're doing than I
> >> was before.  I have no idea what either of those things are.  Google
> >> wasn't helpful ... I probably would have to spend a week or two
> >> researching to even have a minimal understanding.  I was able to tell
> >> that it's probably related to machine learning, but that's all.  I have
> >> zero experience in that arena.
> >>
> >> It's unlikely that Solr has any direct support for those software
> >> programs, but if they can build queries that Solr understands, you could
> >> probably get something going.
> >>
> >> Thanks,
> >> Shawn
> >>
> >>
>
>

Re: Considering SOLR as our new infra

2021-08-13 Thread Jan Høydahl

I know you are in the Solr forum here, but I'll take the chance of mentioning 
the new kid on the block wrt open source search engines, namely Vespa. Since 
your use case seems to be highly geared towards personalization, it may be 
worth checking it out as they seem to push Tensors and personalized results as 
key differentiator. It is not Lucene based and may be quite different from what 
you already know with ES and Solr, and to be honest I have never tested it, nor 
am I affiliated in any way. Here's the link: https://vespa.ai/

Jan

> 13. aug. 2021 kl. 16:26 skrev Albert Dfm :
> 
> For example, for relevance ranking the usual approach is to execute a
> machine learned model, e.g. using xgboost, or lightgbm. Tensorflow  and
> pytorch are other frameworks to build machine learning models.
> While xgboost and lightgbm are ensembles of decision trees, tensorflow and
> pytorch are mainly related to neutal networks.
> 
> Elasticsearch allows to execute xgboost models for example for relevance
> ranking.
> The question could be applied similarly to SOLr: can we use pytorch or
> tensorflow at relevance ranking phase?
> 
> 
> 
> On Fri, Aug 13, 2021 at 4:18 PM Shawn Heisey  wrote:
> 
>> On 8/13/2021 7:59 AM, Albert Dfm wrote:
>>> Regarding executing models (question number 4), let me explain this a bit
>>> better:
>>> Can SOLr run custom tensorflow/pytorch models? This is not a feature in
>>> lucene, it is something on top of it.
>> 
>> With that info, I am even less familiar with what you're doing than I
>> was before.  I have no idea what either of those things are.  Google
>> wasn't helpful ... I probably would have to spend a week or two
>> researching to even have a minimal understanding.  I was able to tell
>> that it's probably related to machine learning, but that's all.  I have
>> zero experience in that arena.
>> 
>> It's unlikely that Solr has any direct support for those software
>> programs, but if they can build queries that Solr understands, you could
>> probably get something going.
>> 
>> Thanks,
>> Shawn
>> 
>>

Re: Delete using Streaming Expressions

2021-08-13 Thread mtn search

Thanks Jan - Exactly what I was looking for.

Matthew

On Fri, Aug 13, 2021 at 3:35 AM Jan Høydahl  wrote:

> Please see
> https://solr.apache.org/guide/8_9/stream-decorator-reference.html#delete
>
> Jan
>
> > 12. aug. 2021 kl. 21:05 skrev mtn search :
> >
> > Hello,
> >
> > I have heard that there can be issues when using the Solr delete by query
> > approach ( *:*) for large sets of
> > documents.  That it may block other indexing activities...
> >
> > I have also heard that using Solr Streaming Expressions would be a better
> > approach, more efficient for large doc sets to be deleted.  I did a quick
> > search for the syntax of a streaming expression for delete, but did not
> > find it.  Any tips on how to construct this expression?
> >
> > Thanks,
> > Matthew
>
>

Re: Time Routed Alias

2021-08-13 Thread Matt Kuiper

Thanks David, this test link is helpful.

@David @Gus - From your viewpoint do you see TRAs as an accepted/proven
technique within SolrCloud?  My small POC works great.  Would like to hear
if others are using TRA in production deployments successfully at scale.

Thanks,
Matt

On Wed, Aug 11, 2021 at 8:10 PM David Smiley  wrote:

> I hope you have success with TRAs!
>
> You can delete some number of collections from the rear of the chain, but
> you must first update the TRA to exclude these collections.  This is
> tested:
>
> https://github.com/apache/solr/blob/f6c4f8a755603c3049e48eaf9511041252f2dbad/solr/core/src/test/org/apache/solr/update/processor/TimeRoutedAliasUpdateProcessorTest.java#L184
> It'd be nice if it would remove itself from the alias.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Tue, Aug 10, 2021 at 9:26 PM Matt Kuiper  wrote:
>
> > I found some helpful information while testing TRAs:
> >
> > For our use-case I am hesitant to set up an autoDeleteAge (unless it can
> be
> > modified - still need to test).  So I wondered about a little more manual
> > delete management approach.
> >
> > I confirmed that I cannot simply delete a collection that is registered
> as
> > part of a TRA.  The delete collection api call will fail with a message
> > that the collection is a part of the alias.
> >
> > I did learn that I could use the same create TRA api call I used to
> create
> > the TRA, but modify the router.start to date more recent than one or more
> > of the older collections associated with the TRA. Then when I queried the
> > TRA, I only received documents from the collections after the new
> > router.start date. Also, I was now able to successfully delete the older
> > collections with a standard collection delete command.
> >
> > I think this satisfies my initial use-case requirements to be able to
> > modify an existing TRA and delete older collections.
> >
> > Matt
> >
> > On Mon, Aug 9, 2021 at 11:27 AM Matt Kuiper  wrote:
> >
> > > Hi Gus, Jan,
> > >
> > > I am considering implementing TRA for a large-scale Solr deployment.
> > Your
> > > Q&A is helpful!
> > >
> > > I am curious if you have experience/ideas regarding modifying the TR
> > Alias
> > > when one desires to manually delete old collections or modify the
> > > router.autoDeleteAge to shorten or extend the delete age.  Here's a few
> > > specific questions?
> > >
> > > 1) Can you manually delete an old collection (via collection api) and
> > then
> > > edit the start date (to a more recent date) of the TRA so that it no
> > longer
> > > sees/processes the deleted collection?
> > > 2) Is the only way to manage the deletion of collections within a TRA
> > > using the automatic deletion configuration? The router.autoDeleteAge
> > > parameter.
> > > 3) If you can only manage deletes using the router.autoDeleteAge
> > > parameter, are you able to update this parameter to either:
> > >
> > >- Set the delete age earlier so that older collections are triggered
> > >for automatic deletion sooner?
> > >- Set the delete age to a larger value to extend the life of a
> > >collection?  Say you originally  would like the collections to stay
> > around
> > >for 5 years, but then change your mind to 7 years.
> > >
> > > I will likely do some experimentation, but am interested to learn if
> you
> > > have covered these use-cases with TRA.
> > >
> > > Thanks,
> > > Matt
> > >
> > >
> > > On Fri, Aug 6, 2021 at 8:08 AM Gus Heck  wrote:
> > >
> > >> Hi Jan,
> > >>
> > >> The key thing to remember about TRA's (or any Routed Alias) is that it
> > >> only
> > >> actively does two things:
> > >> 1) Routes document updates to the correct collection by inspecting the
> > >> routed field in the document
> > >> 2) Detects when a new collection is required and creates it.
> > >>
> > >> If you don't send it data *nothing* happens. The collections are not
> > >> created until data requires them (with an async create possible when
> it
> > >> sees an update that has a timestamp "near" the next interval, see docs
> > for
> > >> router.preemptiveCreateMath )
> > >>
> > >> A) Dave's half of our talk at 2018 activate talks about it:
> > >> https://youtu.be/RB1-7Y5NQeI?t=839
> > >> B) Time Routed Aliases are a means by which to automate creation of
> > >> collections and route documents to the created collections. Sizing,
> and
> > >> performance of the individual collections is not otherwise special,
> and
> > >> you
> > >> can interact with the collections individually after they are created,
> > >> with
> > >> the obvious caveats that you probably don't want to be doing things
> that
> > >> get them out of sync schema wise unless your client programs know how
> to
> > >> handle documents of both types etc. A less obvious consequence of the
> > >> routing is that your data must not ever republish the same document
> > with a
> > >> different route key (date for TRA), since that can

Considering SOLR as our new infra

UnifiedHighlighter BreakIterator

Re: Delete using Streaming Expressions

Re: Considering SOLR as our new infra

Re: ConcurrentUpdateSolrClient stall prevention bug in Solr 8.4+

Re: ConcurrentUpdateSolrClient stall prevention bug in Solr 8.4+

Re: Considering SOLR as our new infra

Re: Considering SOLR as our new infra

Re: Considering SOLR as our new infra

Re: Considering SOLR as our new infra

Re: Considering SOLR as our new infra

Re: Considering SOLR as our new infra

Re: Considering SOLR as our new infra

Re: Considering SOLR as our new infra

Re: Considering SOLR as our new infra

Re: Delete using Streaming Expressions

Re: Time Routed Alias

17 matches

Site Navigation

Mail list logo

Footer information