Limits + locks when multiple clients /update at once

2021-09-07 Thread Andy Lester
* Are there any constraints as to how to safely call the /update handler for a 
core from multiple clients?

* Are there locking issues we should be aware of?

* Is there any way multiple /update calls can corrupt a core?

Backstory:

Back in 2013, when we first started using Solr 4.2.0, we had problems with our 
core getting corrupted if we tried to have multiple processes run the 
DataImportHandler at the same time.  We updated our app to make sure that could 
no longer happen.  Everything was fine, and we lived on Solr 4.2 for years.

Now, we are running 8.9, and we have moved our indexer from using the DIH to 
using /update handlers.  That works very nicely as well.  We have kept the same 
app constraints that guarantees that our app can only POST to /update one at a 
time.

However, we are adding a new core (I’ll call it userinfo) to our Solr instance 
that will require multiple clients to be updating a core at the same time.  
Each time a web user logs in to the site, we will /update a record in the 
userinfo core.  We could have, say, 100 users all updating the same core at the 
same time.  It’s also possible that there could be two clients updating the 
same record in the userinfo core at the same time.

My questions:

1) Are there any limits as to how many clients can post to 
/solr/userinfo/update at once?

2) Are there are problems with multiple clients trying to update the same 
record at the same time?  Will Solr just handle the requests sequentially, and 
the last client POSTing is the one that “wins”?  (I’m talking about updating 
the entire record, not doing partial updates at the field level)

3) Is there any way we could corrupt our Solr core through /update POSTs?

My assumption is that the answers are “No, this is safe to do.”  However, I 
can’t find anything in the docs that explicitly say that.  I also can’t find 
anything in the docs saying “Don’t do that.”  We want to make sure before we 
move forward.

Can someone please help point to something to address these questions?

Thanks,
Andy

How can I performance-tune my warming queries?

2021-10-18 Thread Andy Lester
I’m trying to figure out why my warming is taking so long.  It’s taking about 
20-40 seconds on average.   Can I measure where it’s spending its time?


I’ve got my firstSearcher and newSearcher set up like this:




...
world
popular_score desc, grouping asc, copyrightyear 
desc, flrid asc
2500

(languagecode:"eng")
(titletype:"BK")
((grouping:"1" OR grouping:"2" OR 
grouping:"4"))
(languagecode:"eng" OR solrtype:"N")
(ib_searchable:"Y")
((grouping:"1" OR grouping:"2"))
…
arrl
0
17.9
2
before
itemtypesubcode
fc



All the FQs are the most common FQs that come out of analyzing our app logs.  
There are about 35 of them.  The facet queries are all the facets that our app 
requests.  There are about 25 of them, spread across facet.field, facet.range 
and facet.query.

What I’m afraid of is that one of the warming facets or FQs is taking up all 
the time.  Can I tell where the warmer is spending its time?

Thanks,
Andy

Re: How can I performance-tune my warming queries?

2021-10-18 Thread Andy Lester
Thanks very much for this.  This is a huge help.


> What I would recommend is that (at a time when query traffic is lowest) you 
> turn off all warming, restart, and then do some manual queries where you 
> check each fq and facet individually.  Rebooting or clearing the OS disk 
> cache before each query test would give you worst-case information.

That’s my plan, but I wanted to check first to see if there was a tool to save 
me from the drudgery.


> I would personally remove all the fqs from the newSearcher config and let the 
> filterCache autowarming take care of warming the most commonly used fq 
> values.  Leave them in the firstSearcher and configure solr to use a cold 
> searcher. 

What should we have in the newSearcher startup query, if the new searcher is 
going to bring over the cached FQs from an existing searcher?

Thanks,
Andy




Re: How can I performance-tune my warming queries?

2021-10-18 Thread Andy Lester


> On Oct 18, 2021, at 2:38 PM, Shawn Heisey  wrote:
> 
>> What should we have in the newSearcher startup query, if the new searcher is 
>> going to bring over the cached FQs from an existing searcher?
> 
> I know that filterCache handles autowarming for fq parameters.  I do not know 
> whether queryResultCache stores anything related to facets, but I would guess 
> that it doesn't.
> 
> If your facet fields all have docValues, then I would expect OS disk caching 
> to be the most important thing to have to speed those up, not Solr/Lucene 
> caching.  In the absence of docValues, the data structures required for 
> faceting must be generated in the Java heap, which takes time and memory, 
> potentially a large amount.


This sounds like you’re saying there is no value in having warming queries in 
the newSearcher.  Is that correct?

First cross-join faceting after a commit is slow

2021-11-03 Thread Andy Lester
I’m on Solr 8.10.1 and having a performance problem with my cross-core join 
facets.

Here’s my basic query, with the interesting parts bolded (the joins in two 
facet.query fields)


curl "$URL" --silent --show-error \
-X POST \
-F "q=($word AND -parent_tracings:($word))" \
-F 'df=title_tracings_t' \
-F 'fl=flrid,nodeid' \
\
-F 'fq=((grouping:"1" OR grouping:"2" OR grouping:"3") OR solrtype:"N")' \
-F 'fq={!tag=grouping}((grouping:"1" OR grouping:"2") OR solrtype:"N")' \
-F 'fq={!tag=languagecode}(languagecode:"eng" OR solrtype:"N")' \
-F 'fq=(tw_searchable:"Y")' \
-F 'facet=on' \
-F "facet.query=(solrtype:T AND !{!join fromIndex=collmatchagg from=flrid 
to=flrid}id:$ID)" \
-F "facet.query=(solrtype:T AND {!join fromIndex=collmatchagg from=flrid 
to=flrid}id:$ID)" \
-F 'rows=0' \
-F 'wt=json' \
-F 'debugQuery=on’ \


The faceting works wonderfully, except for when I make a commit to the 
collmatchagg core, the “to” core in the facet.  The first query we make after a 
commit blocks for multiple seconds, depending on how many records are in the 
core.  With 10,000 records, which is standard load, that blocking is about 8-9 
seconds, which is too long.  If it’s only 1ten records in the core, the 
blocking is about 1 second.

Here are some things I’ve observed about these slow first queries after a 
commit:

* I’ve tried many different combinations of autowarmCount in the caching for 
collmatchagg, from 0 to do no warming, to 100 to pull in a minimal number of 
records.  Changing these settings does not seem to have any impact on the 
length of the block.

* If q=… in the query matches nothing, then the join is not slow.

* Making a direct query against collmatchagg is not slow after the commit.

* The only thing that seems to control the length of the block on the query is 
the number of records in the collmatchagg core.  It feels like Solr is reading 
the entire core into RAM each time.

Any suggestions or pointers would be helpful, because this is blocking us from 
releasing the feature. 

Thanks,
Andy

Re: NRT Searching (getting it working)

2021-11-17 Thread Andy Lester
> 
> I'm not sure how to understand the solr_gc.log file (but I'd like to)

There’s a product called gceasy at gceasy.io  .  You can get 
a basic report on your GC log from uploading your log to them for analysis.

Andy

Re: NRT Searching (getting it working)

2021-11-17 Thread Andy Lester


> On Nov 17, 2021, at 12:41 PM, Derek C  wrote:
> 
> That's an amazing online tool - thanks Andy

It was Shawn Heisey who pointed me to it.  There many other JVM GC tools out 
there if you search a bit.

https://sematext.com/blog/java-gc-log-analysis-tools/

Re: Searcher and autoSoftCommits + softCommit

2021-11-24 Thread Andy Lester


> 
> You were spot on, commitWithin was being set on each commit. I was able to
> verify by temporarily turning on debug logging for DirectUpdateHandler2



What did you do to enable that? I didn’t know you could do such a thing.  Is it 
specific to that handler?

Thanks
Andy 


Is the showItems argument for fieldValueCache used for anything?

2021-12-03 Thread Andy Lester
It looks to me like the showItems argument for the fieldValueCache is not used. 
 I can’t find any documentation of it, although it was mentioned in the 
changelog for v8.1.0.

I looked through the source and I can’t see where the value is used. It gets a 
default value, but never seems to be used, based on my understanding of the 
source code.

$ grep -w showItems -R .
./solr/core/src/java/org/apache/solr/core/SolrConfig.java:
args.put("showItems", "-1");
./solr/core/src/java/org/apache/solr/search/SolrCache.java:  String 
SHOW_ITEMS_PARAM = "showItems";
./solr/core/src/test-files/solr/configsets/exitable-directory/conf/solrconfig.xml:
showItems="0" />
./solr/server/solr/configsets/_default/conf/solrconfig.xml: 
   showItems="32" />
./solr/server/solr/configsets/sample_techproducts_configs/conf/solrconfig.xml:  
  showItems="32" />
./solr/CHANGES.txt:* SOLR-13432: Add .toString methods to BitDocSet and 
SortedIntDocSet so that enabling "showItems" on the filter caches

$ grep -w SHOW_ITEMS_PARAM -R .
./solr/core/src/java/org/apache/solr/search/SolrCache.java:  String 
SHOW_ITEMS_PARAM = "showItems”;

Is it a leftover from long ago?  If so, I’ll put in a patch to remove its last 
vestiges from the source and from the sample config files.

Thanks,
Andy

Re: Is the showItems argument for fieldValueCache used for anything?

2021-12-03 Thread Andy Lester


> On Dec 3, 2021, at 9:39 AM, Mikhail Khludnev  wrote:
> 
> It seems it's gone https://issues.apache.org/jira/browse/SOLR-15762 
> 
> I will miss showItems, it was really useful a long ago.

I don’t understand how that ticket relates.  I don’t see any mention of 
showitems in it.

Andy

Re: Not able to write solr logs in json format

2021-12-03 Thread Andy Lester


> On Dec 3, 2021, at 9:55 AM, Kakolu, Karthik  
> wrote:
> 
> Trying to write solr.log in json format but unsuccessful.


How exactly are you unsuccessful?

Do you get logs but they aren’t in JSON as expected? If so, what format are 
they in?

Are you able to write logs in other formats than JSON?

Do you get any logs at all?




Re: multiple values encountered for non multiValued field Solr 8.10.1

2021-12-09 Thread Andy Lester

> We are trying to index documents for a collection. This worked in Solr 3.6.1, 
> but running this search under Solr 8.10.1 generates the below error.
> 
> multiple values encountered for non multiValued field
> 
> Any help would be appreciated, and feel free to ask if you need further info.


We need to know a lot more to help you track it down.

What query are you running that gives you the error?  Do all queries cause that 
error or only certain ones?

Have you tried modifying the query to see if there’s a certain field that 
causes the error to appear or disappear?

How did you convert from Solr 3.6.1 to 8.10.1?  Did you do a full reindex of 
the cores?

Have you searched for that error message online?  This StackOverflow answer is 
the first thing that shows up, and summarizes what I expect is the problem: 
https://stackoverflow.com/questions/17521287/solr-3-5-mutliple-values-encountered-for-non-multivalued-field
 


Andy

Re: 0-day Apache log4j RCE vulnerability

2021-12-10 Thread Andy Lester
I trust that by now you’ve seen the discussion earlier today on this mailing 
list about it. 


Re: Zookeeper and Solr and CVE-2021-44228

2021-12-13 Thread Andy Lester


> On Dec 13, 2021, at 8:20 AM, Michael Conrad  wrote:
> 
> I presume this also needs fixing for zookeeper nodes?

Anything that logs with log4j.

Re: [ANNOUNCEMENT] Solr's Docker images were updated to remediate a CVE

2021-12-13 Thread Andy Lester
For those of you like me who want to explicitly set the variable without 
relying on which of the two Docker images with the same tag you’re pulling 
down, and you’re using a Dockerfile to add on to make your own Solr Docker 
image, add these lines:

# Add option to mitigate log4j security vulnerability.
USER root
RUN echo 'SOLR_OPTS="$SOLR_OPTS -Dlog4j2.formatMsgNoLookups=true"' >> 
/etc/default/solr.in.sh
USER solr

Andy

Re: When will solr 8.11.1 become available?

2021-12-13 Thread Andy Lester
> It is impossible to give you an accurate prediction for the release date of 
> Solr 8.11.1.

It sounds like it’s safe to say that the release will be “on the order of at 
least a week from now” would be safe, right? 

That might be all the accuracy that someone needs.



Re: Question Apache Solr 7.7.0, 8.7 and 8.9 - log4j vulnerability

2021-12-14 Thread Andy Lester


> On Dec 14, 2021, at 9:00 AM, Manisha Rahatadkar 
>  wrote:
> 
> We are using  Apache Solr 7.7.0, 8.7 and 8.9 on Windows and Linux 
> environment. What mitigation option do we need to take for this vulnerability?


https://solr.apache.org/security.html#apache-solr-affected-by-apache-log4j-cve-2021-44228

Re: Log4J saga (CVE-2021-45046)

2021-12-15 Thread Andy Lester


> 
> Is there already an Idea when 8.11.1 is supposed to be released ?


This was discussed yesterday. Check the archives for the full explanation. 

Short version: can’t give a definite date but it will be no sooner than a week 
from now. 




Log4j remediation in the Docker image

2021-12-16 Thread Andy Lester


> On Dec 16, 2021, at 8:26 AM, Carlos Cueto  wrote:
> 
> Any idea when it will be available on Docker Hub? 8.11.1 tag is still not
> added.


I don’t know, but yesterday I went and changed my build process for our Docker 
image of Solr to delete the JNDI classes from the jar files as a stopgap until 
a proper 8.11.1 came out. See 
https://logging.apache.org/log4j/2.x/security.html for details.

This is how I did it.

To be able to delete the class files, one must use zip, so I had to install 
that in the container. To install zip in the container, I had to make a fake 
UserAgent file for apt-get to not get refused by the mirrors.

Here is my Dockerfile:

FROM solr:8.11.0

# https://hub.docker.com/_/solr/
# https://github.com/docker-solr/docker-solr#extending-the-image
# https://solr.apache.org/docs/8_11_0/changes/Changes.html

# The SOLR_ vars override defaults.  See /etc/default/solr.in.sh in the 
container for more.

ENV \
TZ=America/Chicago \
SOLR_TIMEZONE=America/Chicago \
SOLR_HEAP=20g


USER root
RUN \
echo 'Installing additional packages' \
&& echo 'Create new agent to get around apt-get bugs per 
https://lists.debian.org/debian-user/2019/10/msg00629.html' \
&& echo 'Acquire' > /etc/apt/apt.conf.d/99useragent \
&& echo '{' >> /etc/apt/apt.conf.d/99useragent \
&& echo '  http::User-Agent "Mozilla/5.0 (X11; Linux x86_64; rv:60.0) 
Gecko/20100101 Firefox/60.0";' >> /etc/apt/apt.conf.d/99useragent \
&& echo '};' >> /etc/apt/apt.conf.d/99useragent \
&& echo 'Done populating user agent' \
&& echo 'Installing zip' \
&& apt-get update \
&& apt-get install -y zip \
&& rm -rf /var/lib/apt/lists/* \
&& apt-get clean all \
&& echo 'Done installing additional packages' \
&& echo 'Delete JNDI from the log4j files in both Solr and the exporter' \
&& zip -q -d 
/opt/solr-8.11.0/contrib/prometheus-exporter/lib/log4j-core-2.14.1.jar 
org/apache/logging/log4j/core/lookup/JndiLookup.class \
&& zip -q -d /opt/solr-8.11.0/server/lib/ext/log4j-core-2.14.1.jar 
org/apache/logging/log4j/core/lookup/JndiLookup.class \
&& echo 'Deleted JNDI from jars'
USER solr

I hope this helps someone.

Andy

Re: Question about unintended deletions of Solr documents

2021-12-18 Thread Andy Lester
Can you please post specific queries that you are trying? Cut and pasted, not 
paraphrases. 

> On Dec 18, 2021, at 3:13 PM, Claire Burke  wrote:
> 
> If I enter a query in the q field (which is associated with the 
> Request-Handler (qt) type /select), then I enter a delete query (which is 
> associated with the Request-Handler (qt) type /update), I am finding that I 
> did not just delete what was in the q field, based on the before and after 
> snapshots of total documents and lw_data_source_s.
> 
> How can I construct my delete query so that I am only deleting what is in the 
> q field in my /select query?
> 



Re: reloading all the cores

2022-01-03 Thread Andy Lester
> Is there an http request that can make all the cores on a solr server
> reload?

I don’t think there is, but you can use the STATUS API call 
(https://solr.apache.org/guide/8_11/coreadmin-api.html 
) to get a list of all 
the cores so that you can call the RELOAD command.

Andy

Re: Problem with Join query and FilterCache

2022-01-20 Thread Andy Lester


> On Jan 20, 2022, at 8:02 AM, Mike Drob  wrote:
> 
> Yep, you should change from LRUCache to CaffeineCache in your solrconfig.xml


And, the CaffeineCache has to be set as async. It defaults to async, but I 
added the async=“true” to be explicit in my solrconfig.xml



Re: SOLR 8.11.1 :: VELOCITY :: Can't access JAVA-object's static methods

2022-02-03 Thread Andy Lester


> On Feb 3, 2022, at 3:03 AM, Jan Høydahl  wrote:
> 
> This is/was a security hole and a big anti-pattern.

Is this still possible in 8.x? If so, I think it would be worth putting in the 
docs that it can be a security problem.  I can probably do that.

Andy

Re: Vulnerability on solr port

2022-02-14 Thread Andy Lester


> On Feb 14, 2022, at 3:35 AM, Anchal Sharma2  > wrote:
> 
> We have got following vulnerability on port where apache solr is running on 
> few of our servers .Does anyone have any ideas/suggestions on how to mitigate 
> this ?
> Vulnerability ->  Web Server HTTP Header Internal IP Disclosure 8983


We’d need to know more about your situation. It sounds to me like your boss or 
maybe your corporate IT department ran some kind of vulnerability scanner and 
handed you the report saying “Fix this”.  

What exactly does the report say?  Why does it think your HTTP headers are 
disclosing an IP address?

Andy

Re: Vulnerability on solr port

2022-02-14 Thread Andy Lester


> On Feb 14, 2022, at 3:35 AM, Anchal Sharma2  > wrote:
> 
> We have got following vulnerability on port where apache solr is running on 
> few of our servers .Does anyone have any ideas/suggestions on how to mitigate 
> this ?
> Vulnerability ->  Web Server HTTP Header Internal IP Disclosure 8983


We’d need to know more about your situation. It sounds to me like your boss or 
maybe your corporate IT department ran some kind of vulnerability scanner and 
handed you the report saying “Fix this”.  

What exactly does the report say?  Why does it think your HTTP headers are 
disclosing an IP address?

Andy

Re: Solr 8.6.2 - download full data

2022-02-25 Thread Andy Lester


> On Feb 25, 2022, at 8:19 AM, Anuj Bhargava  wrote:
> 
> I tried with 1 records. Need to download all
> http://xx.xxx.xxx.xxx:8983/solr/data_2019/select?q=*%3A*&rows=1&wt=csv 
> 
Use a command-line utility like curl or wget to pull it down.

Re: Solr 8.11.1 upgrading LOG4J from 2.16 to 2.17

2022-03-23 Thread Andy Lester


> On Mar 23, 2022, at 1:36 PM, Heller, George A III CTR (USA) 
>  wrote:
> 
> Can someone tell me where I can download an upgrade or patch for LOG4J and 
> instructions on how to implement it?
> 


See https://solr.apache.org/security.html

Re: Solr 8.11.1 upgrading LOG4J from 2.16 to 2.17

2022-03-23 Thread Andy Lester
Go to the https://solr.apache.org/security.html URL and you will find 
instructions there on what to do.

Andy

Re: Problem with facet in SOLR

2022-03-31 Thread Andy Lester
> I have indexed 4 fields and want to use facet on "taxo_domain_mother" but i 
> am not getting any result

It looks like you don’t have facet.field=taxo_domain_mother specified in your 
query.   It’s hard to tell exactly because screenshots make it difficult to 
figure things out.



> Do i need to configure something for "taxo_domain_mother" field so that it 
> can be used as facet ?

For performance, you want docValues=“true” on any field you facet on.

Try adding a facet.field=taxo_domain_mother to your query.  If it doesn’t work, 
and you need more help, please cut & paste the exact text of the query, and 
then the exact text of the response.  Or make a Gist on GitHub.

Andy



Re: Problem with facet in SOLR

2022-04-01 Thread Andy Lester


> On Apr 1, 2022, at 3:59 AM, Neha Gupta  wrote:
> 
> Now i had set docValues=true for this attribute and not facet is working.
> 
> Just want to know is it necessary to set the docValue to true to make facet 
> working for the attribute?

I don’t think docValues is necessary, but it will make things faster. It’s 
going to have to be either stored=true or docValues=true.

If you’re still looking for help, we need to see exact queries you’re making, 
and the exact responses you’re getting.

Andy

Re: Regarding maximum number of documents that can be returned safely from SOLR to Java Application.

2022-04-27 Thread Andy Lester


> On Apr 27, 2022, at 3:23 PM, Neha Gupta  wrote:
> 
> Just for information I will be firing queries from Java application to SOLR 
> using SOLRJ and would like to know how much maximum documents (i.e  maximum 
> number of rows that i can request in the query) can be returned safely from 
> SOLR.

It’s impossible to answer that. First, how do you mean “safe”? How big are your 
documents?

Let’s turn it around. Do you have a number in mind where you’re wondering if 
Solr can handle it? Like you’re thinking “Can Solr handle 10 million documents 
averaging 10K each”?  That’s much easier to address.

Andy

Re: Regarding maximum number of documents that can be returned safely from SOLR to Java Application.

2022-04-27 Thread Andy Lester
> 
> So  my question is if i request in one request lets say approximate 10K 
> documents using SOLRJ will that be OK. By safe here i mean approx. maximum 
> number of documents that i can request without causing any problem in 
> receiving a response from SOLR.

I’m still not clear what you’re asking. Are you asking if Solr can handle 
returning 10K docs in a result set? It seems like it to me, but “causing any 
problem” could be pretty much anything.

Do you have reason to think that it would be a problem? Did you try something 
that failed? If so, what did you try and what happened?

And if you haven’t tried it out, then I’d suggest you do that.

Andy

Re: Regarding maximum number of documents that can be returned safely from SOLR to Java Application.

2022-04-28 Thread Andy Lester
> 1. Why do you need to return so many search results at the same time? If
> it's a typical search usecase, could you not work with some manageable list
> of documents, say 50/100? But I'm guessing this is not a typical search
> that you're planning to support.


I’d just like to point out that Neha may have a use case that isn’t the typical 
“do a keyword search and return search results, Google-like 10 at a time.” I 
know that may well be the most common use case for Solr, but some of don’t do 
that.

For example, we use Solr to find up to 2500 matching documents, return their 
IDs and some facets, and then the app takes it from there to do the 
presentation and paging. For us, Solr can’t do the paging we need it to do. 
Getting back thousands of records that are fairly small (just some IDs) is 
something Solr does just fine at.

Andy

Collapsing on a field works, but expand=true does nothing

2022-05-31 Thread Andy Lester
I’m working on the collapse/expand functionality. I query for books, and I want 
to collapse my search results on their tf1flrid, basically a family ID.  I also 
want to do an expand so I can see what titles were collapsed out.  I’m looking 
at the docs here: 
https://solr.apache.org/guide/8_11/collapse-and-expand-results.html 


Here’s a gist of my working query at 
https://gist.github.com/petdance/34259dee2944a455341748a0e2ef2092 


I make the query, I get back 18 titles.  There are rows in the result that have 
duplicated tf1flrid fields. This is what I expect.

Then I try it again, 

curl "$URL" --silent --show-error \
-X POST \
-F "q=title_tracings_t:\"$Q\"" \
-F 'fl=title,flrid,tf1flrid' \
-F 'fq={!collapse field=tf1flrid nullPolicy=expand}' \
-F 'expand=true' \
-F 'rows=50' \
-F 'sort=tf1flrid asc' \
-F 'wt=json' \
| jq -S .

and here’s the results: 
https://gist.github.com/petdance/f203c7c2bf0178e0d0c1596999801ae5 


I get back 12 titles, and the rows with duplicated tf1flrid values that were 
duplicated in the first query (1638JX7 and 1638PQ3 ) have been collapsed. 
That’s also as I expect. However, I don’t have an “expand” section that shows 
what the collapsed fields were.  I don’t see any errors in my solr.log.

What am I doing wrong?  What do I need to do to get the expand section to be 
returned?

Thanks,
Andy

Re: Collapsing on a field works, but expand=true does nothing

2022-06-01 Thread Andy Lester
> Is it possible that the expand component isn't registered in your
> deployment? The expand component is a default component but have you
> overridden the defaults?

Yes, that’s exactly what happened.  Turns out that I had pulled out the unused 
handlers.

Thanks,
Andy 

Re: Solr compatibility with Oracle Database 19c Database

2022-06-08 Thread Andy Lester


> On Jun 8, 2022, at 2:35 PM, Yennam, M  wrote:
> 
> We are currently using Solr 4.9.0 which is connecting Oracle 12cR1 and we are 
> planning to upgrade our Database to Oracle 19c. So, the question that I have 
> is – Is SOLR 4.9.0 compatible with Oracle 19c, if not what is the minimum 
> version of SOLR that supports Oracle 19c database.

How are you getting data from Oracle into Solr? Are you using the 
DataImportHandler? If you’re not using the DIH, then I don’t think you’re 
connecting to Oracle directly, and then it’s a non-issue.

Andy

Re: Solr compatibility with Oracle Database 19c Database

2022-06-08 Thread Andy Lester
> some folks who care enough to contribute fixes to it. Using another tool or
> custom code to query the database and submit updates via the solr JSON api
> or SolrJ client is currently recommended over DIH.

That’s why I had to write a tool to do the exporting from Oracle, massaging 
into JSON, and posting to Solr, before we migrated from Solr 4 to Solr 8 like 
OP is looking for.

The big benefit of this is that it allowed me to have multiple importers 
running at once.  A full reindex went from taking 8 hours via the DIH to taking 
about 90 minutes with 10 importers running.

It also means that we don’t have to worry about the DIH connection as we 
migrate from Oracle 12 to Oracle 19, as OP is.  OP seems to be in the same 
situation I was in a year ago.

Andy

Re: Solr faceting

2022-07-15 Thread Andy Lester


> I would like to know if solr adds faceting fields by default when we do any
> search. We have an id field which is  still coming up in field cache even
> after it is removed from sorting and faceting queries in the code.
> Therefore, I would like to know if solr is adding this by any chance.

No, there is no faceting by default.

Can you give us more details about the query you’re calling and how you have 
unexpected activity in the field cache?

You should be able to narrow it down to where you run one query, and it shows 
things in the field cache. When you get that, then show us the query, and we 
can help figure out why it’s not doing what you expect.

Andy

Re: Solr faceting

2022-07-18 Thread Andy Lester


> On Jul 18, 2022, at 3:11 AM, Poorna Murali  wrote:
> 
> There is a  solr search api which does not have either sorting or faceting
> done with id field. But, after we execute the API, we do see id field entry
> in field cache.  I checked the solrconfig file too, we have not added any
> id configuration that could have caused this.
> Please help me clarify how this is happening.

We need to see actual code that you’re executing.

You’ve described the problem in English, and now we need to see the actual code 
that is being called. Maybe you’ve made a mistake in the query, but we can’t 
tell that from just your natural language description of what you did.  We need 
to see actual code.

So, please cut & paste an actual query that you are making to Solr that causes 
the results that you think are incorrect.  Also, please tell us exactly what 
the effects of the code are. For example it might be something like:

“I started from a fresh restart of the Solr database. I looked on the 
(whatever) screen and it showed that such-and-such counter was 1.  Then I 
executed the following query. The query gave me the results I expected, which 
I’ve pasted below. However, when I went back to the (whatever) screen, I saw 
that the counter for (whatever) showed 3, where before it had only shown 1. My 
understanding is that the counter should not have changed as a result of my 
query.”

It’s all about explaining in detail what you did, and what you expect, and what 
happened that was unexpected.

Thanks,
Andy

Re: Retain Data Import Handler In Solr9.0

2022-07-22 Thread Andy Lester


> On Jul 22, 2022, at 1:19 PM, dmitri maziuk  wrote:
> 
>> The DIH does not yet support Solr 9 but I don't think it'll be long before
>> it does.
> 
> FWIW I've been gradually migrating our DIH imports to little python scripts; 
> with all the extra things you can do in those, and less bloat in the main 
> JVM, you gotta wonder how much interest there's gonna be in keeping that 
> alive long-term.


And I’m sure the DIH is slower, too.

We used to have the DIH pull from our Oracle database.  It took about 10 hours 
to do all 45M records.

I migrated to simple Perl program that pulled from Oracle, created JSON and 
sent it to the update handlers. We can easily run 10 in parallel and finish it 
off in about 45 minutes.

Andy

Re: Retain Data Import Handler In Solr9.0

2022-07-22 Thread Andy Lester


> On Jul 22, 2022, at 1:39 PM, Dave  wrote:
> 
> Oh look into perls fork manager module, 
> 
> https://metacpan.org/pod/Parallel::ForkManager 
> 

I’m aware of the numerous tools like that (I’ve been doing Perl since the 90s 
https://metacpan.org/author/PETDANCE), but for as often as we have to do the 
full import (maybe every couple of months on a schema change) it was easier to 
just assign 1/10th of the records to each of ten updaters that run 
concurrently.  For normal day-to-day incremental, our updater runs every five 
or ten minutes and sends them to Solr.

The other huge win was getting core swapping working.  Build the new core with 
the new schema, index it for an hour, and swap old with new.  So nice.  No 
downtime for schema changes.

Andy

Re: Fastest way to index data to solr

2022-09-29 Thread Andy Lester



> On Sep 29, 2022, at 4:17 AM, Jan Høydahl  wrote:
> 
> * Index with multiple threads on the client, experiment to find a good number 
> based on the number of CPUs on receiving side

That may also mean having multiple clients. We went from taking about 8 hours 
to index our entire 42M rows to about 1.5 hours because we ran 10 indexer 
clients at once. Each indexer takes roughly 1/10th of the data and churns away. 
We don't have any of the clients do a commit. After the indexers are done, we 
run one more time through the queue with a commit at the end.

As Jan says, make sure it's not your database that is the bottleneck, and 
experiment with how many clients you want to have going at once.

Andy

Re: Fastest way to index data to solr

2022-09-30 Thread Andy Lester
I can’t imagine a case where the speed in parsing the input data won’t be 
dwarfed by the time spent on everything else. You’re talking about an in-memory 
operation that does a ton of I/O. 

It’s not going to make a noticeable difference once way or the other. 

> I have a followup question. Is JSON parsed faster than XML by Solr



Re: HTTP errors POSTing to 8.11.2

2022-10-27 Thread Andy Lester



> On Oct 27, 2022, at 5:44 PM, dmitri maziuk  wrote:
> 
> has anyone gone through the exercise of replacing Data Import Handler with 
> scripts that POST JSON and if so, are your scripts still working OK with 
> 8.11.2?

That's exactly what I've done a couple of years ago and they work just fine on 
our install of 8.11.1.




Re: Using the fq parameter to filter for a value that is multivalued field.

2022-12-09 Thread Andy Lester



> On Dec 9, 2022, at 11:22 AM, Matthew Castrigno  wrote:
> 
> "myField":["apple, pear"]


That's not multivalued.  That's a single value and the value is "apple, pear".

You need to pass multiple values to Solr for the field when you do your 
indexing.  Basically, you need to pass one myField:apple and another 
myField:pear to the indexer when you add records.

Andy

Re: Solr Query time performance

2023-01-29 Thread Andy Lester



> On Jan 29, 2023, at 4:45 AM, marc nicole  wrote:
> 
> Let's say you're right about the 200 rows being too few. From which row
> count I can see the difference reflected in the results as it should (Solr
> faster)?

It depends on how much data is in each record, but I'd think 10,000 - 100,000 
is a starting point.

Andy

Re: Query time

2023-02-08 Thread Andy Lester
Please include your schema and some sample queries so we have specifics to go 
on. 

> On Feb 8, 2023, at 9:00 AM, Mike  wrote:
> 
> I have a standalone Solr server and an index of millions of documents.
> Some queries that e.g. more than 1 million times exist takes a long time.
> I only need the first 100 results, can I make solr stop ranking and sort by
> the first 100 hits?
> How can i limit the search time of sometimes more than 10 seconds?
> 
> Thanks
> Mike



Re: SOLR security scan question

2023-02-15 Thread Andy Lester
> Any news on this?
> 
> We know some of them are covered in 
> https://solr.apache.org/security.html#cve-reports-for-apache-solr-dependencies
>  but not all.
> We have also seen the 
> https://lists.apache.org/thread/539bkq8r11msjpl3yo1ssvy77kmdrps7
> Can we have a resolution for the above?


What sort of resolution are you looking for?

Re: join query parser performance

2023-05-25 Thread Andy Lester



> On May 25, 2023, at 7:51 AM, Ron Haines  wrote:
> 
> So, when this feature is enabled, this negative &fq gets added:
> -{!join fromIndex=primary_rollup from=group_id_mv to=group_member_id
> score=none}${q}


Can we see collection definitions of both the source collection and the join? 
Also, a sample query, not just the one parameter? Also, how often are either of 
these collections updated? One thing that killed off an entire project that we 
were doing was that the join table was getting updated about once a minute, and 
this destroyed all our caching, and made the queries we wanted to do unusable.


Thanks,
Andy

Re: End of Life of Solr 8.11?

2023-06-01 Thread Andy Lester
Do we have a rough idea of when we think Solr 10 will be? A few months? A year?

I just have an upgrade project to Solr 9 on the horizon and might hold off on 
it a bit if Solr 10 is imminent. 

> Yes, it is EOL the same date that 10.0 is shippped. The release date of 10.0 
> is yet to be determined. I expect a few more 9.x releases to happen first 
> though, and most likely we'll wait for Lucene 10 to ship first.



Re: Solr Crawl Error

2023-08-31 Thread Andy Lester
I don't know what tool you're using to do the crawling, because Solr itself 
does not crawl. It just indexes text.  So you must have some other tool that is 
fetching data and feeding it to Solr.


> WARNING: Solr returned an error #404 Not Found
> WARNING: IOException while reading response: java.io.FileNotFoundException: 
> https://localhost:8986/solr/childrens/update/extract?literal.id=https%3A%2F%2Fdev103-choa.choa.org%2Ftestpodcasts%2Fam-test-11-10-22&literal.url=https%3A%2F%2Fdev103-choa.choa.org%2Ftestpodcasts%2Fam-test-11-10-22
> ERROR: THREAD 4: https://dev103-choa.choa.org/testpodcasts/am-test-11-10-22 
> FAILED


"404 Not Found" is an HTTP status code that says "I tried to fetch content from 
a URL, but there is no content at the URL." Beyond that, we can't say.  Maybe 
the crawling tool fetched a page, but that page has a link to a non-existent 
URL.

Andy

Re: Performance of solr 9.3 vs 8.11

2023-10-12 Thread Andy Lester



> On Oct 12, 2023, at 12:54 PM, Natarajan, Rajeswari 
>  wrote:
> 
> Does anyone see any query performance degradation from 8.11 to 9.3. Please 
> let me know


What is it that you are really asking?

Are you wondering if there will be a slowdown if you upgrade from 8.11 to 9.3? 
If so, we need to know more details..

Are you observing what seems to be a query performance degradation after 
upgrading from 8.11 to 9.3, and you're trying to figure out why?  If so, we 
need to know more details.

Andy

Re: How to do fastest loading and indexing

2023-11-12 Thread Andy Lester



> On Nov 12, 2023, at 9:16 AM, Vince McMahon  
> wrote:
> 
> So, if I split the single cvs into two and using two programs sending each
> of the splits, Solr will handle the parallel loading with multiple
> threads.  I don't have to make changes to Solr, right?


Yes, that's correct.

We were loading 40M records in about 8 hours through the DIH. That's about 5M 
records per hour, which is roughly what you are getting (100M records in 20 
hours).

When the DIH was removed from core Solr, it gave us the impetus to switch over 
to the update handlers. Switching to the update handler let us run multiple 
importers at a time.  Now, if I run 10 importers simultaneously, importing 
about 4M records each, we can load those 40M records in about 90 minutes.  
That's about 25M rows per hour.  Note that 10 importers didn't speed things up 
10x.  It sped up about 5x.  

I don't know what kind of speed target you're trying to hit. If you're hoping 
to do 100M rows in 30 minutes, that may not be possible. It may be that down 
the road after experimenting with different levels of concurrency and JVM and 
tuning and whatnot, you find that the best that you can do is 100M rows in, 
say, 3 hours, and you'll have to be OK with that.  Or your boss may have to be 
OK with that. There's a joke that says "If you tell a programmer they have to 
run a mile in 3 minutes, the programmer will start putting on his running 
shoes", without considering "Is what I'm being asked to do even possible."

If you're trying to speed up a process, you're going to need to run a lot of 
tests and track a lot of numbers. Try it with 5 indexers, and see what kind of 
throughput you get. Then try it with 10 and see what happenes. Measure measure 
measure.

Also, the best way to make things go faster is to do less work. Are all the 
fields you're creating necessary? Can you turn some of them into non-indexed 
fields? Do you really have to do all 100M records every time? What if only 20M 
of those records change each time. Maybe you write some code that determines 
which 20M rows need to be updated, and only index those. You'll immediately get 
a 5x speedup because you're only doing 1/5th the work.

For example, sometimes we have to do a bulk load and I have a program that 
queries each record in the Oracle database against what is indexed in Solr, and 
compares them. The records that differ get dumped in to a file and that's the 
file that gets loaded. If it takes 20 minutes to run that process, but I find I 
only need to load 10% of the data, then that's a win.

An excellent book that I'm currently reading is "How To Make Things Faster" and 
it's filled with all sorts of tips and lessons about things like this: 
https://www.amazon.com/How-Make-Things-Faster-Performance/dp/1098147065

Finally, somewhere you asked if JSON would be faster than CSV to load. I have 
not measured, but I am certain that the bottleneck in the indexing process is 
not in the parsing of the input data.  So, no, CSV vs. JSON doesn't matter.

Andy



Re: facet query question

2023-11-16 Thread Andy Lester



> Does Solr have something caching results for facet queries over large
> dataset?  Is there example how to make facet query faster?



Yes.  There are many articles about query caching in Solr, plus the docs.  
https://solr.apache.org/guide/8_8/query-settings-in-solrconfig.html for one 
version.

Andy

Re: Will solr support in AWS/Azure Cloud platform

2024-08-27 Thread Andy Lester
> Our application moving to cloud... So all the components including solr we
> planning to move to cloud.
> 
> Can we able to install Solr in AWS/Azure Cloud platform, anybody did that
> before.

Yes, you can run Solr on cloud platforms. Yes, people have done that.

"Cloud" just means that it's a machine that you don't own. It's all just 
computers. There is nothing special about Azure computers or AWS computers that 
make it so that Solr can't run on them. Note that AWS and Azure are two 
different platforms and not at all related.

Depending on what your Solr needs are and what platform you're moving to, you 
may have simplified solutions that are prebuilt.

Andy

Re: Advice on ways forward with or without Data Import Handler

2025-05-29 Thread Andy Lester


> We’ve been using Solr with DIH for about 8 years or so but now we’re hitting 
> an impasse with DIH being deprecated in Solr 9. Additionally, I’m looking to 
> move our Solr deploy to Kubernetes and I’ve been struggling to figure out 
> what to do with the DIH component in a cloud setting.

I suggest abandoning the DIH. I've done it and I'm glad we did. It makes things 
faster, more flexible and easier to maintain.

Here's what we did.

We were using the DIH to go and do SQL queries against our Oracle DB and then 
import them into our main searching core (40M book titles). We had a homemade 
scheduling mechanism set up to make the DIH run fairly often throughout the day 
to get updates out of Oracle, but we also had semaphores set up to disallow 
multiple runs of the DIH at once because that was Very Bad to do.

We threw that all away in favor of a tool (imaginatively called index-titles) 
that does the same basic query against the Oracle DB. index-titles massages the 
query results into appropriate JSON format, 5,000 at a time, and then POSTs 
them to /core/update and they get imported. When it's all done, the final POST 
is to /core/update?commit=true to make the commit happen. Typically we'll have 
100,000 titles need to get updated at a time a few times throughout the day.

There are many advantages to this.

1) Having the indexing program push data to Solr gives much more flexibility. I 
can run that indexer on any box that can make Oracle queries and POST to the 
Solr box.

2) It stops Solr from having to talk to Oracle itself. This was actually what 
triggered us making this happen, because we were moving from local hardware to 
Azure, and we wanted to be able to containerize Solr and not have to have Solr 
be able to talk to an Oracle client. Now the indexer program does the 
connecting to Oracle, which many of our other programs do already. Solr doesn't 
know anything about where its records are coming from, nor does it need to.

2a) Not having to build a custom Solr that can talk to Oracle means we can now 
run a stock Docker container that doesn't need to have an Oracle client 
installed.

3) We can run multiple instances of index-titles, which is a huge speedup if we 
have to do a full reindex. I can start up 10 different index-title runs (on 
different machines if I wanted) and tell each index-title instance to take 
1/10th a slice of the queue of records to import. Reindexing the full 40M 
titles into a new core used to take 8+ hours. With 10 index-title running, it's 
just over an hour.

4) All this speed and flexibility has given us the ability to easily have 
different developers have their own Solr core if they want. Now, it's easy to 
start up a Docker container with an empty core in it and reindex your own copy 
of the core in an hour. It used to be a nightmare to work on core schema 
changes. Now that work can happen in isolation.

Abandon the DIH. It will take some work but you'll be so glad you did down the 
line.

Andy