FileBasedSpellChecker does not seem to build an index of its dictionary

2024-11-18 Thread Andrew Witt
Hello -

I am running Solr v9.4, attempting to add file-based spellchecking.

I have set up FileBasedSpellChecker following the Solr documentation, and have 
read quite a number of related articles.

Many sources state that in order for Solr to build the index it will use to 
perform these spellchecks, I need to issue a request to the URL:

http://HOST:PORT/solr/CORE/ENDPOINT?spellcheck=on&spellcheck.build=true

I've done this many times, but no trace of anything named after this parameter 
...

spellcheckerFile

... has ever appeared.

The Solr log logs the "spellcheck.build=true" request, but does not log 
anything additional - i.e., there is no error message suggesting anything went 
wrong.

What is the correct way to tell Solr to build its index for 
FileBasedSpellChecker?

What are the indicators I should expect to see to indicate that this index was 
built?

Thank you.

Andrew Witt
Senior Software Engineer II
Learning A-Z, a Cambium Learning(r) Group Company
andrew.w...@learninga-z.com<mailto:andrew.w...@learninga-z.com>



file-based spellcheck dictionary not working

2024-12-02 Thread Andrew Witt
Hello -

I have a working Solr v9.4.1 installation, to which I want to add file-based 
spellchecking.

I want to add file-based spellchecking so that I can identify search terms that 
are correctly spelled, even if they don't appear in the main index of 
searchable documents.

Following the Apache Solr documentation, I have set up file-based 
spellchecking, and a request handler endpoint referring to the defined 
file-based spellchecking.

However, every query results in a false-negative of : ... 
"correctlySpelled":false ...

This even occurs when supplying as a query term one of the suggestions for 
another term.  For example, "ablation" and "oblation" are each offered as a 
suggested correct spellings for the other, yet both are deemed incorrectly 
spelled when used as the query term.

Even with these results, I'm certain that Solr is in fact referring to the 
dictionary file I have defined, because words like "ablation" and "oblation" do 
not apper anywhere in the main index of searchable documents, so the only way 
that Solr can know about them to offer them as suggestions is by reading them 
from the specified dictionary file.

Is anyone successfully using file-based spellchecking?

Are there specific steps that need to be taken which aren't detailed in and/or 
perfectly clear in the Solr documentation?

Can someone share their Solr config for a working file-based spellchecking?

Thanks in advace for any hints, info, or details folks can share.

Andrew Witt
Senior Software Engineer II
Learning A-Z, a Cambium Learning(r) Group Company
andrew.w...@learninga-z.com<mailto:andrew.w...@learninga-z.com>



RE: Multiply connected data search

2024-12-30 Thread Andrew Witt
I have a few observations:

First, long experience tells me to avoid field names like "other_ids", because 
what "other" means is context-dependent.  Sure, it's all clear to you now, but 
would it be clear to someone else?  Will it be clear to you a year from now?  
What will you do when there is a third thing -- will you add a field called 
"the_other_other_ids"?  Just name the fields for what they are : give books an 
"author_ids" field and authors a "book_ids" field.  That way, the data schema 
speaks for itself, without needing to reference any logic that operates over it.

Second, look into Solr's "nested documents".  We use them to great success 
(although for a somewhat different usecase.)  Per Solr's nested documents page, 
" indexing the relationships between documents usually yields much faster 
queries than an equivalent 'query time join' ", so they might be exactly what 
you are looking for.

Third, I suspect you are prematurely optimizing.  Don’t be afraid of needing to 
query Solr with a follow-up query.  It's so fast.  As with any engineering 
problem, you should measure before deciding whether or not it's actually a 
problem.  It could very well turn out that a simpler solution which requires a 
primary query and a follow-up query is negligibly slower than a solution that 
takes you weeks or months to design, build, test, and finalize.

Hope this helps.

Andrew Witt
Senior Software Engineer II
Learning A-Z, a Cambium Learning® Group Company
andrew.w...@learninga-z.com


-Original Message-
From: Nikola Smolenski  
Sent: Thursday, December 26, 2024 6:17 AM
To: users@solr.apache.org
Subject: Re: Multiply connected data search

I agree, solr should not be used as the primary data store. However, it would 
still be handy to be able to retrieve as much information in a single query as 
possible.

I am experimenting with a solution where every solr document has "otherids"
multivalued field, with books having the ids of all the authors who 
contributed, authors having the ids of all the books they authored, and every 
document including its own id in the list; then, everything can be extracted 
using a single join query.

Does anyone see any drawbacks to this solution?

On Tue, Dec 24, 2024 at 6:22 PM Walter Underwood 
wrote:

> Do not use Solr as your primary data store. Solr is not a database. 
> Put your data in a relational database where it is easy to track all 
> those relationships and update them correctly.
>
> Extract the needed fields and load them into Solr.
>
> This can be a daily full dump and load job. That is what I did at 
> Chegg with millions of books. That is simple and fast, should be under 
> an hour for the whole job.
>
> An alternative to the all-in-one _text_ field is to use edismax and 
> give different weights to the different fields. Something like this, 
> with higher weighting for phrase matches.
>
> title^4 authors
> title^8 authors^2
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Dec 23, 2024, at 10:07 PM, Nikola Smolenski 
> wrote:
> >
> > Thank you for the suggestion, but that wouldn't work because there 
> > could
> be
> > multiple authors with the same name, who differ only by ID. If I 
> > were to change the name of an author, I wouldn't know which one 
> > should I change
> and
> > which one should stay. Additionally, there could be additional 
> > author information, such as external identifiers, that needs to be 
> > connected to the author.
> >
> > On Mon, Dec 23, 2024 at 11:07 PM Dmitri Maziuk 
> > 
> > wrote:
> >
> >> On 12/23/24 15:49, Nikola Smolenski wrote:
> >> ...
> >>> About the only way of doing this I can think of is to perform the
> search,
> >>> get all the found books and authors, then perform another query 
> >>> that fetches all the books and authors referenced by any of books 
> >>> or authors
> >> in
> >>> the first query. Is there a smarter way of doing this? What are 
> >>> the
> best
> >>> practices?
> >>>
> >>
> >> A book is a "document" that has a title and authors as separate fields.
> >> Documents usually also have a "big search" field, called _text_ in 
> >> the default config.
> >>
> >> Copy both author list and title into _text_, search in _text_, 
> >> facet on authors and/or titles.
> >>
> >> Dima
> >>
> >>
>
>


RE: index source code with solr combined with a SAST for cyber security purpose

2025-03-20 Thread Andrew Witt
Since you say, "I already started with Zoekt", I'm guessing you are facing some 
problem large enough to cause you to abandon your work so far.

I say this because to "rewrite it in solr" means, among other things, rewriting 
all your Go code in Java, since Zoekt is in Go and Solr is in Java.

I am guessing that instead, your best bet is to ask within the Zoekt community 
how to resolve the problem you are facing that's motivating you to switch.

If, after that, you still want to switch, then before you switch, you should 
recreate the same scenario using Solr, and ask here *a specific question about 
how to use Solr* to address the issue.

Only very rarely is the best approach to a problem to re-implement everything 
you've done so far in another language / another codebase.

Said differently, if you cannot resolve your issues using Zoekt, it's doubtful 
they can be resolved with Solr.  Whatever challenges you are facing working 
with the Zoekt code are likely to be the same challenges working with the Solr 
code.  What you are attempting is challenging, plain and simple.

Andrew Witt
Senior Software Engineer II
Learning A-Z, a Cambium Learning® Group Company
andrew.w...@learninga-z.com


-Original Message-
From: anon anon  
Sent: Wednesday, March 19, 2025 7:33 PM
To: users@solr.apache.org
Subject: index source code with solr combined with a SAST for cyber security 
purpose

Hello,

I wanted to combine the power of a search engine with a SAST. I already started 
with Zoekt. I want to know if I should fork zoekt or solr. I am
wondering:

- If I can rewrite it in solr for more maintainability
- if I SHOULD actually maintain zoekt instead of solar in order to not have to 
reinvent the whell (exactly like solr)
- AND THE MOST IMPORTANT: would it be easier to implement, maintain and use the 
SAST from zoekt code base instead of solr? Maybe could I transfer the SAST part 
to another program and the zoekt search from zoekt software only?

Whatis the best deal for the open source community please?

Best regards.


RE: Solr 9.7: Pb with restore from s3

2025-04-04 Thread Andrew Witt
First thought, off the top of my head : does the process running the Solr 
restore have permission / visibility to the new S3 location?


(If I'm understanding you correctly ...)  The Solr backup process clearly had 
permission to the old location; and you can see the new location with an admin 
user I'm guessing; and the tool used to copy the one S3 environment to another 
S3 environment can access both but I'm guessing that was also done with an S3 
admin tool or account.

But the user / identity used by the Solr process might not have the same 
awareness and permissions as what you use to look at S3, and the new S3 
location might not have the same access settings as the prior S3 location.

Just something generic to check, ahead of any Solr-specific things to check.  
Hope that helps.

Andrew Witt
Senior Software Engineer II
Learning A-Z, a Cambium Learning(r) Group Company
andrew.w...@learninga-z.com

-Original Message-
From: L'HARHANT Mikael  
Sent: Monday, March 17, 2025 6:00 PM
To: users@solr.apache.org
Subject: Solr 9.7: Pb with restore from s3

Hello,
I'm having trouble restoring to SolR from an S3 storage.
The backup was performed on an environment. The contents of the S3 were copied 
from one space to another to point to the new environment.
When running the restore, I get the following error:

"Operation restore caused 
exception:":"org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
 Couldn't restore since doesn't exist: s3:/x"

However, the directory is indeed present.
Do you have any ideas on how to verify that everything is OK?
Thanks.



RE: Advice on ways forward with or without Data Import Handler

2025-06-02 Thread Andrew Witt
About a year ago, we were in a similar position, running Solr v7 and using 
built-in DIH.

We upgraded to Solr v9.4 and updated DIH using the contrib package from the 
SearchScale GitHub repo.  It was a very successful upgrade.

(Solr v9.5 was available at the time, but in early 2024, the latest SeachrScale 
DIH release was for v9.4.)

We did need to update our DIH definitions to remove the JavaScript transformers 
we had been using, and rework them into the SQL and built-in Solr transformers, 
since the contrib package does not (or did not) support JavaScript transformers.

I'm not familiar with the custom code issue you describe, but you can 
successfully use DIH with Solr v9.  I see that SearchScale now has releases for 
v9.6 and v9.7.  Based on our v9.4 upgrade, I would assume that DIH can be used 
just fine with Solr v.9.6 or v9.7.

HOWEVER, we've just recently taken on a usecase where we need near-real-time 
updates to the Solr index.  The way we use DIH (essentially, a nightly 
re-indexing of all of our Solr cores -- which is okay for our situation because 
each core takes only a few minutes worst-case) is not easily amenable to 
on-the-fly index updates.  We found that replacing DIH with an importer (a 
basic ETL, as others have alluded to in other replies in this thread) was much 
simpler than re-working our use of DIH, and has the benefit of moving us away 
from the DIH add-on.

(I want to emphasize that we don't have any problems at all with the 
SearchScale DIH add-on.  The only motivations to move away from it are (a) it's 
not always at the latest Solr version, and (b) the general motivation to reduce 
the number of moving parts.)

In the end, even though the use of the SearchScale DIH has been quite 
successful, if a year ago we could have seen a year into the future, we 
probably would have leapfrogged the DIH update, and gone to the ETL / importer. 
 Thus, based on our experience, I recommend (as others have) dropping DIH and 
implementing your own importer.  I am certain that it will be less effort 
overall.

Andrew Witt
Senior Software Engineer II
Learning A-Z, a Cambium Learning® Group Company
andrew.w...@learninga-z.com

-Original Message-
From: Sarah Weissman  
Sent: Thursday, May 29, 2025 12:43 PM
To: users@solr.apache.org
Subject: Advice on ways forward with or without Data Import Handler

Hi all,

We’ve been using Solr with DIH for about 8 years or so but now we’re hitting an 
impasse with DIH being deprecated in Solr 9. Additionally, I’m looking to move 
our Solr deploy to Kubernetes and I’ve been struggling to figure out what to do 
with the DIH component in a cloud setting. I was hoping to get something that 
replicates our current setup up ad running pretty quickly, but our DIH 
implementation has some custom code and I’m unable to get the jar dependency to 
load as a runtime library from the blob store with Solr 8. Maybe this isn’t 
possible with DIH? I’ve never used the runtimelib feature before and I have 
been unable to get the examples from the docs to work because the jars are too 
old. The next thing I would try is building my own custom image of Solr that 
includes the jar I need, but I’m also hesitant to spend a bunch more time on 
making deprecated features in Solr 8 work.

Unfortunately, I’ve also been unable to get the new DIH 3rd party plugin to 
work with Solr 9 and I’ve found the plugin commands with the solr script to be 
pretty finicky and the syntax changes between 8 and 9 frustrating as I switch 
between versions trying to get something to work as documented. I’m really not 
in a position where writing my own plugin is feasible at this point.

I’ve been banging my head against this all week and I’m trying to figure out 
the best way forward at this point. Is DIH still a viable option or should I be 
moving off of that something else? Any advice or perspectives on this would be 
appreciated.

Thanks
Sarah