What does the CPU utilization look like while that query is executing? If it is using 100% of one CPU, then it is CPU limited. If it is using less than 100% of one CPU, then it is IO limited.
Regardless, that is a VERY expensive query. A shared EFS disk is a poor system design for Solr. Each node should have its own EBS volume, preferably GP3. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 28, 2024, at 7:51 PM, Beale, Jim (US-KOP) <jim.be...@hibu.com.INVALID> > wrote: > > I did send the query. Here it is: > > http://samisolrcld.aws01.hibu.int:8983/solr/calls/select?indent=true&q.op=OR&fl=business_id,call_id,call_date,call_callerno,caller_name,dialog_merged&q=business_id%3A7016655681%20AND%20call_day:[20230101%20TO%2020240101}&group=true&group.field=call_callerno&sort=call_date%20desc&rows=10000&group.main=true > > I suppose all the indexes are about 150 GB so you are close. > > I set the limit to 10,000 or 5000 for these tests. Setting the limit at 10 or > 50 would mean that there would need to be 1000-2000 requests. That seems like > an awful lot to me. > > That is interesting about the export. I will look into other types of data > collection. > > Also there is no quota on the EFS. It is apparently encrypted both ways. But > if it is fast the one time, rebooting Solr shouldn't affect how it uses disk > access. > > > Jim Beale > Lead Software Engineer > hibu.com > 2201 Renaissance Boulevard, King of Prussia, PA, 19406 > Office: 610-879-3864 > Mobile: 610-220-3067 > > > > -----Original Message----- > From: Gus Heck <gus.h...@gmail.com> > Sent: Wednesday, February 28, 2024 9:22 PM > To: users@solr.apache.org > Subject: Re: [EXTERNAL] Re: Is this list alive? I need help > > Caution! Attachments and links (urls) can contain deceptive and/or > malicious content. > > Your description leads me to believe that at worst you have ~20M docs in one > index, If the average doc size is 5k or so it sounds like 100GB.. This is > smalish and across 3 machines it ought to be fine. Your time 1 values are > very slow to begin with. Unfortunately you didn't send us the query, only the > code that generates the query. A key bit not shown is what value you are > passing in for limit (which is then set for rows. it *should* be something > like 10 or 25 or 50. It should NOT be 1000 or 99999 etc... but the fact you > have hardcoded the start to zero makes me think you are not paging and you > are doing something in the "NOT" realm. If you are trying to export ALL > matches to a query you'd be better off using /export rather than /select > (reqquires docvalues for all fields involved) or if you don't have docvalues, > use the cursormark feature to iteratively fetch pages of data. > > If you say rows-10000 then each node sends back 10000, the coordinator sorts > all 30000 and then sends the top 10000 to the client.... > > Note that the grouping feature you are using can be heavy too. To do that in > an /export context you would probably have to use streaming expressions and > even there you would have to design carefully to avoid trying to hold large > fractions of the index in memory while you formed groups... > > As for the change in speed I'm still betting on some sort of quota for your > EFS access (R5 are fixed cpu availability so that's not it) However, it's > worth looking at your GC logs in case your (probable) large queries are > getting you into trouble with memory/GC. As with any performance > troubleshooting you'll want to have eyes on the CPU load, disk io bytes, disk > iOPs and network bandwidth. > > Oh one more thing that comes to mind. Make sure you don't configure ANY swap > drive on your server. If the OS starts trying to put solr's cached memory on > a swap disk the query times just go in the trash instantly. in most cases > (YMMV) you would MUCH rather crash the server than have it start using swap. > (because then you know you need a bigger server, rather than silently serving > dog slow results while you limp along). > > -Gus > > On Wed, Feb 28, 2024 at 4:09 PM Beale, Jim (US-KOP) > <jim.be...@hibu.com.invalid> wrote: > >> Here is the performance for this query on these nodes. You saw the >> code in a previous email. >> >> >> >> >> http://samisolrcld.aws01.hibu.int:8983/solr/calls/select?indent=true&q >> .op=OR&fl=business_id,call_id,call_date,call_callerno,caller_name,dial >> og_merged&q=business_id%3A7016655681%20AND%20call_day:[20230101%20TO%2 >> 020240101}&group=true&group.field=call_callerno&sort=call_date%20desc& >> rows=10000&group.main=true >> <http://samisolrcld.aws01.hibu.int:8983/solr/calls/select?indent=true& >> q.op=OR&fl=business_id,call_id,call_date,call_callerno,caller_name,dia >> log_merged&q=business_id%3A7016655681%20AND%20call_day:%5b20230101%20T >> O%2020240101%7d&group=true&group.field=call_callerno&sort=call_date%20 >> desc&rows=10000&group.main=true> >> >> >> >> The two times given are right after a restart and the next day, or >> sometime a few hours later. The only difference is how Solr is >> running. I can’t understand what makes it run so slowly after a short while. >> >> >> >> Business_id >> >> Time 1 >> >> Time 2 >> >> 7016274253 >> >> 11.572 >> >> 23.397 >> >> 7010707194 >> >> 21.941 >> >> 21.414 >> >> 7000001491 >> >> 9.516 >> >> 39.051 >> >> 7029931968 >> >> 10.755 >> >> 59.196 >> >> 7014676602 >> >> 14.508 >> >> 14.083 >> >> 7004551760 >> >> 12.873 >> >> 36.856 >> >> 7016274253 >> >> 1.792 >> >> 17.415 >> >> 7010707194 >> >> 5.671 >> >> 25.442 >> >> 7000001491 >> >> 6.84 >> >> 36.244 >> >> 7029931968 >> >> 6.291 >> >> 38.483 >> >> 7014676602 >> >> 7.643 >> >> 12.584 >> >> 7004551760 >> >> 5.669 >> >> 21.977 >> >> 7029931968 >> >> 8.293 >> >> 36.688 >> >> 7008606979 >> >> 16.976 >> >> 30.569 >> >> 7002264530 >> >> 13.862 >> >> 35.113 >> >> 7017281920 >> >> 10.1 >> >> 31.914 >> >> 7000001491 >> >> 8.665 >> >> 35.141 >> >> 7058630709 >> >> 11.236 >> >> 38.104 >> >> 7011363889 >> >> 10.977 >> >> 19.72 >> >> 7016319075 >> >> 15.763 >> >> 26.023 >> >> 7053262466 >> >> 10.917 >> >> 48.3 >> >> 7000313815 >> >> 9.786 >> >> 24.617 >> >> 7015187150 >> >> 8.312 >> >> 29.485 >> >> 7016381845 >> >> 11.51 >> >> 34.545 >> >> 7016379523 >> >> 10.543 >> >> 29.27 >> >> 7026102159 >> >> 6.047 >> >> 30.381 >> >> 7010707194 >> >> 8.298 >> >> 27.069 >> >> 7016508018 >> >> 7.98 >> >> 34.48 >> >> 7016280579 >> >> 5.443 >> >> 26.617 >> >> 7016302809 >> >> 3.491 >> >> 12.578 >> >> 7016259866 >> >> 7.723 >> >> 33.462 >> >> 7016390730 >> >> 11.358 >> >> 32.997 >> >> 7013498165 >> >> 8.214 >> >> 26.004 >> >> 7016392929 >> >> 6.612 >> >> 19.711 >> >> 7007737612 >> >> 2.198 >> >> 4.19 >> >> 7012687678 >> >> 8.627 >> >> 35.342 >> >> 7016606704 >> >> 5.951 >> >> 21.732 >> >> 7007870203 >> >> 2.524 >> >> 16.534 >> >> 7016268227 >> >> 6.296 >> >> 25.651 >> >> 7016405011 >> >> 3.288 >> >> 18.541 >> >> 7016424246 >> >> 9.756 >> >> 31.243 >> >> 7000336592 >> >> 5.465 >> >> 31.486 >> >> 7004696397 >> >> 4.713 >> >> 29.528 >> >> 7016279283 >> >> 2.473 >> >> 24.243 >> >> 7016623672 >> >> 6.958 >> >> 35.96 >> >> 7016582537 >> >> 5.112 >> >> 33.475 >> >> 7015713947 >> >> 5.162 >> >> 25.972 >> >> 7003530665 >> >> 8.223 >> >> 26.549 >> >> 7012825693 >> >> 7.4 >> >> 16.849 >> >> 7010707194 >> >> 6.781 >> >> 23.835 >> >> 7079272278 >> >> 7.793 >> >> 24.686 >> >> >> >> *Jim Beale* >> >> *Lead Software Engineer * >> >> *hibu.com <http://hibu.com>* >> >> *2201 **Renaissance Boulevard**, King of Prussia, PA, **19406* >> >> *Office: 610-879-3864* >> >> *Mobile: 610-220-3067* >> >> >> >> >> >> *From:* Beale, Jim (US-KOP) <jim.be...@hibu.com.INVALID> >> *Sent:* Wednesday, February 28, 2024 3:29 PM >> *To:* users@solr.apache.org >> *Subject:* RE: [EXTERNAL] Re: Is this list alive? I need help >> >> >> >> *Caution!* >> >> Attachments and links (urls) can contain deceptive and/or malicious >> content. >> >> I didn't see these responses because they were buried in my clutter folder. >> >> >> >> We have 12,541,505 docs for calls, 9,144,862 form fills, 53,838 SMS >> and >> 12,752 social leads. These are all a single Solr 9.1 cluster of three >> nodes with PROD and UAT all on a single server. As follows: >> >> >> >> >> >> >> >> >> The three nodes are r5.xlarge and we’re not sure if those are large >> enough. The documents are not huge, from 1K to 25K each. >> >> >> >> samisolrcld.aws01.hibu.int is a load-balancer >> >> >> >> The request is >> >> >> >> async function getCalls(businessId, limit) { >> >> const config = { >> >> method: 'GET', >> >> url: http://samisolrcld.aws01.hibu.int:8983/solr/calls/select, >> >> params: { >> >> q: `business_id:${businessId} AND call_day:[20230101 TO >> 20240101}`, >> >> fl: "business_id, call_id, call_day, call_date, >> dialog_merged, call_callerno, call_duration, call_status, caller_name, >> caller_address, caller_state, caller_city, caller_zip", >> >> rows: limit, >> >> start: 0, >> >> group: true, >> >> "group.main": true, >> >> "group.field": "call_callerno", >> >> sort: "call_day desc" >> >> } >> >> }; >> >> //console.log(config); >> >> >> >> let rval = []; >> >> while(true) { >> >> try { >> >> //console.log(config.params.start); >> >> const rsp = await axios(config); >> >> if(rsp.data && rsp.data.response) { >> >> let docs = rsp.data.response.docs; >> >> if(docs.length == 0) break; >> >> config.params.start += limit; >> >> rval = rval.concat(docs); >> >> } >> >> } catch (err) { >> >> console.log("Error: " + err.message); >> >> } >> >> } >> >> return rval; >> >> } >> >> >> >> You wrote: >> >> >> >> Note that EFS is encrypted file system, and stunnel is encrypted >> transport, so for each disk read you likely causing: >> >> >> >> - read raw encrypted data from disk to memory (at AWS) >> >> - decrypt the disk data in memory (at AWS) >> >> - encrypt the memory data for stunnel transport (at AWS) >> >> - send the data over the wire >> >> - decrypt the data for use by solr. (Hardware you specify) >> >> >> >> That's guaranteed to be slow, and worse yet, you have no control at >> all over the size or loading of the hardware performing anything but >> the last step. You are completely at the mercy of AWS's cost/speed >> tradeoffs which are unlikely to be targeting the level of performance >> usually desired for search disk IO. >> >> >> >> This is interesting. I can copy the data to local and try it from there. >> >> >> >> >> >> >> >> Jim Beale >> >> Lead Software Engineer >> >> hibu.com >> >> 2201 Renaissance Boulevard, King of Prussia, PA, 19406 >> >> Office: 610-879-3864 >> >> Mobile: 610-220-3067 >> >> >> >> >> >> >> >> -----Original Message----- >> From: Gus Heck <gus.h...@gmail.com> >> Sent: Sunday, February 25, 2024 9:15 AM >> To: users@solr.apache.org >> Subject: [EXTERNAL] Re: Is this list alive? I need help >> >> >> >> Caution! Attachments and links (urls) can contain deceptive and/or >> malicious content. >> >> >> >> Hi Jim, >> >> >> >> Welcome to the Solr user list, not sure why your are asking about list >> liveliness? I don't see prior messages from you? >> >> https://lists.apache.org/list?users@solr.apache.org:lte=1M:jim >> >> >> >> Probably the most important thing you haven't told us is the current >> size of your indexes. You said 20k/day input, but at the start do you >> have 0days, 1 day, 10 days, 100 days, 1000 days, or 10000 days (27y) >> on disk already? >> >> >> >> If you are starting from zero, then there is likely a 20x or more >> growth in the size of the index between the first and second >> measurement.. indexes do get slower with size though you would need >> fantastically large documents or some sort of disk problem to explain it >> that way. >> >> >> >> However, maybe you do have huge documents or disk issues since your >> query time at time1 is already abysmal? Either you are creating a >> fantastically expensive query, or your system is badly overloaded. New >> systems, properly sized with moderate sized documents ought to be >> serving simple queries in tens of milliseconds. >> >> >> >> As others have said it is *critical you show us the entire query request*. >> >> If you are doing something like attempting to return the entire index >> with rows=999999, that would almost certainly explain your issues... >> >> >> >> How large are your average documents (in terms of bytes)? >> >> >> >> Also what version of Solr? >> >> >> >> r5.xlarge only has 4 cpu and 32 GB of memory. That's not very large >> (despite the name). However since it's unclear what your total index >> size looks like, it might be OK. >> >> >> >> What are your IOPS constraints with EFS? Are you running out of a >> quota there? (bursting mode?) >> >> >> >> Note that EFS is encrypted file system, and stunnel is encrypted >> transport, so for each disk read you likely causing: >> >> >> >> - read raw encrypted data from disk to memory (at AWS) >> >> - decrypt the disk data in memory (at AWS) >> >> - encrypt the memory data for stunnel transport (at AWS) >> >> - send the data over the wire >> >> - decrypt the data for use by solr. (Hardware you specify) >> >> >> >> That's guaranteed to be slow, and worse yet, you have no control at >> all over the size or loading of the hardware performing anything but >> the last step. You are completely at the mercy of AWS's cost/speed >> tradeoffs which are unlikely to be targeting the level of performance >> usually desired for search disk IO. >> >> >> >> I'll also echo others and say that it's a bad idea to allow solr >> instances to compete for disk IO in any way. I've seen people succeed >> with setups that use invisibly provisioned disks, but one typically >> has to run more hardware to compensate. Having a shared disk creates >> competition, and it also creates a single point of failure partially >> invalidating the notion of running 3 servers in cloud mode for high >> availability. If you can't have more than one disk, then you might as >> well run a single node, especially at small data sizes like 20k/day. >> A single node on well chosen hardware can usually serve tens of >> millions of normal sized documents, which would be several years of >> data for you. (assuming low query rates, handling high rates of course >> starts to require hardware) >> >> >> >> Finally, you will want to get away from using single queries as a >> measurement of latency. If you care about response time I HIGHLY >> suggest you watch this YouTube video on how NOT to measure latency: >> >> https://www.youtube.com/watch?v=lJ8ydIuPFeU >> >> >> >> On Fri, Feb 23, 2024 at 6:44 PM Jan Høydahl <jan....@cominvent.com> wrote: >> >> >> >>> I think EFS is a terribly slow file system to use for Solr, who >> >>> recommended it? :) Better use one EBS per node. >> >>> Not sure if the gradually slower performance is due to EFS though. >>> We >> >>> need to know more about your setup to get a clue. What role does >> >>> stunnel play here? How are you indexing the content etc. >> >>> >> >>> Jan >> >>> >> >>>> 23. feb. 2024 kl. 19:58 skrev Walter Underwood >>>> <wun...@wunderwood.org >>> : >> >>>> >> >>>> First, a shared disk is not a good idea. Each node should have its >> >>>> own >> >>> local disk. Solr makes heavy use of the disk. >> >>>> >> >>>> If the indexes are shared, I’m surprised it works at all. Solr is >> >>>> not >> >>> designed to share indexes. >> >>>> >> >>>> Please share the full query string. >> >>>> >> >>>> wunder >> >>>> Walter Underwood >> >>>> wun...@wunderwood.org >> >>>> http://observer.wunderwood.org/ (my blog) >> >>>> >> >>>>> On Feb 23, 2024, at 10:01 AM, Beale, Jim (US-KOP) >> >>> <jim.be...@hibu.com.INVALID> wrote: >> >>>>> >> >>>>> I have a Solrcloud installation of three servers on three >>>>> r5.xlarge >> >>>>> EC2 >> >>> with a shared disk drive using EFS and stunnel. >> >>>>> >> >>>>> I have documents coming in about 20000 per day and I am trying to >> >>> perform indexing along with some regular queries and some special >> >>> queries for some new functionality. >> >>>>> >> >>>>> When I just restart Solr, these queries run very fast but over >>>>> time >> >>> become slower and slower. >> >>>>> >> >>>>> This is typical for the numbers. At time1, the request only took >> >>>>> 2.16 >> >>> sec but over night the response took 18.137 sec. That is just typical. >> >>>>> >> >>>>> businessId, all count, reduced count, time1, time2 >> >>>>> 7016274253,8433,4769,2.162,18.137 >> >>>>> >> >>>>> The same query is so far different. Overnight the Solr servers >>>>> slow >> >>> down and give terrible response. I don’t even know if this list is alive. >> >>>>> >> >>>>> >> >>>>> Jim Beale >> >>>>> Lead Software Engineer >> >>>>> hibu.com >> >>>>> 2201 Renaissance Boulevard, King of Prussia, PA, 19406 >> >>>>> Office: 610-879-3864 >> >>>>> Mobile: 610-220-3067 >> >>>>> >> >>>>> >> >>>>> >> >>>>> The information contained in this email message, including any >> >>> attachments, is intended solely for use by the individual or entity >> >>> named above and may be confidential. If the reader of this message >>> is >> >>> not the intended recipient, you are hereby notified that you must >>> not >> >>> read, use, disclose, distribute or copy any part of this >> >>> communication. If you have received this communication in error, >> >>> please immediately notify me by email and destroy the original >>> message, >> including any attachments. Thank you. >> >>> **Hibu IT Code:1414593000000** >> >>>> >> >>> >> >>> >> >> >> >> -- >> >> http://www.needhamsoftware.com (work) >> >> https://a.co/d/b2sZLD9 (my fantasy fiction book) >> > > > -- > http://www.needhamsoftware.com (work) > https://a.co/d/b2sZLD9 (my fantasy fiction book)