FileBasedSpellChecker does not seem to build an index of its dictionary
Hello - I am running Solr v9.4, attempting to add file-based spellchecking. I have set up FileBasedSpellChecker following the Solr documentation, and have read quite a number of related articles. Many sources state that in order for Solr to build the index it will use to perform these spellchecks, I need to issue a request to the URL: http://HOST:PORT/solr/CORE/ENDPOINT?spellcheck=on&spellcheck.build=true I've done this many times, but no trace of anything named after this parameter ... spellcheckerFile ... has ever appeared. The Solr log logs the "spellcheck.build=true" request, but does not log anything additional - i.e., there is no error message suggesting anything went wrong. What is the correct way to tell Solr to build its index for FileBasedSpellChecker? What are the indicators I should expect to see to indicate that this index was built? Thank you. Andrew Witt Senior Software Engineer II Learning A-Z, a Cambium Learning(r) Group Company andrew.w...@learninga-z.com<mailto:andrew.w...@learninga-z.com>
file-based spellcheck dictionary not working
Hello - I have a working Solr v9.4.1 installation, to which I want to add file-based spellchecking. I want to add file-based spellchecking so that I can identify search terms that are correctly spelled, even if they don't appear in the main index of searchable documents. Following the Apache Solr documentation, I have set up file-based spellchecking, and a request handler endpoint referring to the defined file-based spellchecking. However, every query results in a false-negative of : ... "correctlySpelled":false ... This even occurs when supplying as a query term one of the suggestions for another term. For example, "ablation" and "oblation" are each offered as a suggested correct spellings for the other, yet both are deemed incorrectly spelled when used as the query term. Even with these results, I'm certain that Solr is in fact referring to the dictionary file I have defined, because words like "ablation" and "oblation" do not apper anywhere in the main index of searchable documents, so the only way that Solr can know about them to offer them as suggestions is by reading them from the specified dictionary file. Is anyone successfully using file-based spellchecking? Are there specific steps that need to be taken which aren't detailed in and/or perfectly clear in the Solr documentation? Can someone share their Solr config for a working file-based spellchecking? Thanks in advace for any hints, info, or details folks can share. Andrew Witt Senior Software Engineer II Learning A-Z, a Cambium Learning(r) Group Company andrew.w...@learninga-z.com<mailto:andrew.w...@learninga-z.com>
RE: Multiply connected data search
I have a few observations: First, long experience tells me to avoid field names like "other_ids", because what "other" means is context-dependent. Sure, it's all clear to you now, but would it be clear to someone else? Will it be clear to you a year from now? What will you do when there is a third thing -- will you add a field called "the_other_other_ids"? Just name the fields for what they are : give books an "author_ids" field and authors a "book_ids" field. That way, the data schema speaks for itself, without needing to reference any logic that operates over it. Second, look into Solr's "nested documents". We use them to great success (although for a somewhat different usecase.) Per Solr's nested documents page, " indexing the relationships between documents usually yields much faster queries than an equivalent 'query time join' ", so they might be exactly what you are looking for. Third, I suspect you are prematurely optimizing. Don’t be afraid of needing to query Solr with a follow-up query. It's so fast. As with any engineering problem, you should measure before deciding whether or not it's actually a problem. It could very well turn out that a simpler solution which requires a primary query and a follow-up query is negligibly slower than a solution that takes you weeks or months to design, build, test, and finalize. Hope this helps. Andrew Witt Senior Software Engineer II Learning A-Z, a Cambium Learning® Group Company andrew.w...@learninga-z.com -Original Message- From: Nikola Smolenski Sent: Thursday, December 26, 2024 6:17 AM To: users@solr.apache.org Subject: Re: Multiply connected data search I agree, solr should not be used as the primary data store. However, it would still be handy to be able to retrieve as much information in a single query as possible. I am experimenting with a solution where every solr document has "otherids" multivalued field, with books having the ids of all the authors who contributed, authors having the ids of all the books they authored, and every document including its own id in the list; then, everything can be extracted using a single join query. Does anyone see any drawbacks to this solution? On Tue, Dec 24, 2024 at 6:22 PM Walter Underwood wrote: > Do not use Solr as your primary data store. Solr is not a database. > Put your data in a relational database where it is easy to track all > those relationships and update them correctly. > > Extract the needed fields and load them into Solr. > > This can be a daily full dump and load job. That is what I did at > Chegg with millions of books. That is simple and fast, should be under > an hour for the whole job. > > An alternative to the all-in-one _text_ field is to use edismax and > give different weights to the different fields. Something like this, > with higher weighting for phrase matches. > > title^4 authors > title^8 authors^2 > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > > On Dec 23, 2024, at 10:07 PM, Nikola Smolenski > wrote: > > > > Thank you for the suggestion, but that wouldn't work because there > > could > be > > multiple authors with the same name, who differ only by ID. If I > > were to change the name of an author, I wouldn't know which one > > should I change > and > > which one should stay. Additionally, there could be additional > > author information, such as external identifiers, that needs to be > > connected to the author. > > > > On Mon, Dec 23, 2024 at 11:07 PM Dmitri Maziuk > > > > wrote: > > > >> On 12/23/24 15:49, Nikola Smolenski wrote: > >> ... > >>> About the only way of doing this I can think of is to perform the > search, > >>> get all the found books and authors, then perform another query > >>> that fetches all the books and authors referenced by any of books > >>> or authors > >> in > >>> the first query. Is there a smarter way of doing this? What are > >>> the > best > >>> practices? > >>> > >> > >> A book is a "document" that has a title and authors as separate fields. > >> Documents usually also have a "big search" field, called _text_ in > >> the default config. > >> > >> Copy both author list and title into _text_, search in _text_, > >> facet on authors and/or titles. > >> > >> Dima > >> > >> > >
RE: index source code with solr combined with a SAST for cyber security purpose
Since you say, "I already started with Zoekt", I'm guessing you are facing some problem large enough to cause you to abandon your work so far. I say this because to "rewrite it in solr" means, among other things, rewriting all your Go code in Java, since Zoekt is in Go and Solr is in Java. I am guessing that instead, your best bet is to ask within the Zoekt community how to resolve the problem you are facing that's motivating you to switch. If, after that, you still want to switch, then before you switch, you should recreate the same scenario using Solr, and ask here *a specific question about how to use Solr* to address the issue. Only very rarely is the best approach to a problem to re-implement everything you've done so far in another language / another codebase. Said differently, if you cannot resolve your issues using Zoekt, it's doubtful they can be resolved with Solr. Whatever challenges you are facing working with the Zoekt code are likely to be the same challenges working with the Solr code. What you are attempting is challenging, plain and simple. Andrew Witt Senior Software Engineer II Learning A-Z, a Cambium Learning® Group Company andrew.w...@learninga-z.com -Original Message- From: anon anon Sent: Wednesday, March 19, 2025 7:33 PM To: users@solr.apache.org Subject: index source code with solr combined with a SAST for cyber security purpose Hello, I wanted to combine the power of a search engine with a SAST. I already started with Zoekt. I want to know if I should fork zoekt or solr. I am wondering: - If I can rewrite it in solr for more maintainability - if I SHOULD actually maintain zoekt instead of solar in order to not have to reinvent the whell (exactly like solr) - AND THE MOST IMPORTANT: would it be easier to implement, maintain and use the SAST from zoekt code base instead of solr? Maybe could I transfer the SAST part to another program and the zoekt search from zoekt software only? Whatis the best deal for the open source community please? Best regards.
RE: Solr 9.7: Pb with restore from s3
First thought, off the top of my head : does the process running the Solr restore have permission / visibility to the new S3 location? (If I'm understanding you correctly ...) The Solr backup process clearly had permission to the old location; and you can see the new location with an admin user I'm guessing; and the tool used to copy the one S3 environment to another S3 environment can access both but I'm guessing that was also done with an S3 admin tool or account. But the user / identity used by the Solr process might not have the same awareness and permissions as what you use to look at S3, and the new S3 location might not have the same access settings as the prior S3 location. Just something generic to check, ahead of any Solr-specific things to check. Hope that helps. Andrew Witt Senior Software Engineer II Learning A-Z, a Cambium Learning(r) Group Company andrew.w...@learninga-z.com -Original Message- From: L'HARHANT Mikael Sent: Monday, March 17, 2025 6:00 PM To: users@solr.apache.org Subject: Solr 9.7: Pb with restore from s3 Hello, I'm having trouble restoring to SolR from an S3 storage. The backup was performed on an environment. The contents of the S3 were copied from one space to another to point to the new environment. When running the restore, I get the following error: "Operation restore caused exception:":"org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Couldn't restore since doesn't exist: s3:/x" However, the directory is indeed present. Do you have any ideas on how to verify that everything is OK? Thanks.
RE: Advice on ways forward with or without Data Import Handler
About a year ago, we were in a similar position, running Solr v7 and using built-in DIH. We upgraded to Solr v9.4 and updated DIH using the contrib package from the SearchScale GitHub repo. It was a very successful upgrade. (Solr v9.5 was available at the time, but in early 2024, the latest SeachrScale DIH release was for v9.4.) We did need to update our DIH definitions to remove the JavaScript transformers we had been using, and rework them into the SQL and built-in Solr transformers, since the contrib package does not (or did not) support JavaScript transformers. I'm not familiar with the custom code issue you describe, but you can successfully use DIH with Solr v9. I see that SearchScale now has releases for v9.6 and v9.7. Based on our v9.4 upgrade, I would assume that DIH can be used just fine with Solr v.9.6 or v9.7. HOWEVER, we've just recently taken on a usecase where we need near-real-time updates to the Solr index. The way we use DIH (essentially, a nightly re-indexing of all of our Solr cores -- which is okay for our situation because each core takes only a few minutes worst-case) is not easily amenable to on-the-fly index updates. We found that replacing DIH with an importer (a basic ETL, as others have alluded to in other replies in this thread) was much simpler than re-working our use of DIH, and has the benefit of moving us away from the DIH add-on. (I want to emphasize that we don't have any problems at all with the SearchScale DIH add-on. The only motivations to move away from it are (a) it's not always at the latest Solr version, and (b) the general motivation to reduce the number of moving parts.) In the end, even though the use of the SearchScale DIH has been quite successful, if a year ago we could have seen a year into the future, we probably would have leapfrogged the DIH update, and gone to the ETL / importer. Thus, based on our experience, I recommend (as others have) dropping DIH and implementing your own importer. I am certain that it will be less effort overall. Andrew Witt Senior Software Engineer II Learning A-Z, a Cambium Learning® Group Company andrew.w...@learninga-z.com -Original Message- From: Sarah Weissman Sent: Thursday, May 29, 2025 12:43 PM To: users@solr.apache.org Subject: Advice on ways forward with or without Data Import Handler Hi all, We’ve been using Solr with DIH for about 8 years or so but now we’re hitting an impasse with DIH being deprecated in Solr 9. Additionally, I’m looking to move our Solr deploy to Kubernetes and I’ve been struggling to figure out what to do with the DIH component in a cloud setting. I was hoping to get something that replicates our current setup up ad running pretty quickly, but our DIH implementation has some custom code and I’m unable to get the jar dependency to load as a runtime library from the blob store with Solr 8. Maybe this isn’t possible with DIH? I’ve never used the runtimelib feature before and I have been unable to get the examples from the docs to work because the jars are too old. The next thing I would try is building my own custom image of Solr that includes the jar I need, but I’m also hesitant to spend a bunch more time on making deprecated features in Solr 8 work. Unfortunately, I’ve also been unable to get the new DIH 3rd party plugin to work with Solr 9 and I’ve found the plugin commands with the solr script to be pretty finicky and the syntax changes between 8 and 9 frustrating as I switch between versions trying to get something to work as documented. I’m really not in a position where writing my own plugin is feasible at this point. I’ve been banging my head against this all week and I’m trying to figure out the best way forward at this point. Is DIH still a viable option or should I be moving off of that something else? Any advice or perspectives on this would be appreciated. Thanks Sarah