Re: [EXTERNAL] Is this list alive? I need help

Walter Underwood Wed, 28 Feb 2024 20:38:19 -0800

What does the CPU utilization look like while that query is executing? If it is 
using 100% of one CPU, then it is CPU limited. If it is using less than 100% of 
one CPU, then it is IO limited.


Regardless, that is a VERY expensive query.

A shared EFS disk is a poor system design for Solr. Each node should have its 
own EBS volume, preferably GP3.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 28, 2024, at 7:51 PM, Beale, Jim (US-KOP) <jim.be...@hibu.com.INVALID> 
> wrote:
> 
> I did send the query. Here it is:
> 
> http://samisolrcld.aws01.hibu.int:8983/solr/calls/select?indent=true&q.op=OR&fl=business_id,call_id,call_date,call_callerno,caller_name,dialog_merged&q=business_id%3A7016655681%20AND%20call_day:[20230101%20TO%2020240101}&group=true&group.field=call_callerno&sort=call_date%20desc&rows=10000&group.main=true
> 
> I suppose all the indexes are about 150 GB so you are close. 
> 
> I set the limit to 10,000 or 5000 for these tests. Setting the limit at 10 or 
> 50 would mean that there would need to be 1000-2000 requests. That seems like 
> an awful lot to me. 
> 
> That is interesting about the export. I will look into other types of data 
> collection.
> 
> Also there is no quota on the EFS. It is apparently encrypted both ways. But 
> if it is fast the one time, rebooting Solr shouldn't affect how it uses disk 
> access.
> 
> 
> Jim Beale
> Lead Software Engineer 
> hibu.com
> 2201 Renaissance Boulevard, King of Prussia, PA, 19406
> Office: 610-879-3864
> Mobile: 610-220-3067
>  
> 
> 
> -----Original Message-----
> From: Gus Heck <gus.h...@gmail.com> 
> Sent: Wednesday, February 28, 2024 9:22 PM
> To: users@solr.apache.org
> Subject: Re: [EXTERNAL] Re: Is this list alive? I need help
> 
> Caution!        Attachments and links (urls) can contain deceptive and/or 
> malicious content.
> 
> Your description leads me to believe that at worst you have ~20M docs in one 
> index, If the average doc size is 5k or so it sounds like 100GB.. This is 
> smalish and across 3 machines it ought to be fine. Your time 1 values are 
> very slow to begin with. Unfortunately you didn't send us the query, only the 
> code that generates the query. A key bit not shown is what value you are 
> passing in for limit (which is then set for rows. it *should* be something 
> like 10 or 25 or 50. It should NOT be 1000 or 99999 etc... but the fact you 
> have hardcoded the start to zero makes me think you are not paging and you 
> are doing something in the "NOT" realm. If you are trying to export ALL 
> matches to a query you'd be better off using /export rather than /select 
> (reqquires docvalues for all fields involved) or if you don't have docvalues, 
> use the cursormark feature to iteratively fetch pages of data.
> 
> If you say rows-10000 then each node sends back 10000, the coordinator sorts 
> all 30000 and then sends the top 10000 to the client....
> 
> Note that the grouping feature you are using can be heavy too. To do that in 
> an /export context you would probably have to use streaming expressions and 
> even there you would have to design carefully to avoid trying to hold large 
> fractions of the index in memory while you formed groups...
> 
> As for the change in speed I'm still betting on some sort of quota for your 
> EFS access (R5 are fixed cpu availability so that's not it) However, it's 
> worth looking at your GC logs in case your (probable) large queries are 
> getting you into trouble with memory/GC. As with any performance 
> troubleshooting you'll want to have eyes on the CPU load, disk io bytes, disk 
> iOPs and network bandwidth.
> 
> Oh one more thing that comes to mind. Make sure you don't configure ANY swap 
> drive on your server. If the OS starts trying to put solr's cached memory on 
> a swap disk the query times just go in the trash instantly. in most cases 
> (YMMV) you would MUCH rather crash the server than have it start using swap. 
> (because then you know you need a bigger server, rather than silently serving 
> dog slow results while you limp along).
> 
> -Gus
> 
> On Wed, Feb 28, 2024 at 4:09 PM Beale, Jim (US-KOP) 
> <jim.be...@hibu.com.invalid> wrote:
> 
>> Here is the performance for this query on these nodes. You saw the 
>> code in a previous email.
>> 
>> 
>> 
>> 
>> http://samisolrcld.aws01.hibu.int:8983/solr/calls/select?indent=true&q
>> .op=OR&fl=business_id,call_id,call_date,call_callerno,caller_name,dial
>> og_merged&q=business_id%3A7016655681%20AND%20call_day:[20230101%20TO%2
>> 020240101}&group=true&group.field=call_callerno&sort=call_date%20desc&
>> rows=10000&group.main=true 
>> <http://samisolrcld.aws01.hibu.int:8983/solr/calls/select?indent=true&;
>> q.op=OR&fl=business_id,call_id,call_date,call_callerno,caller_name,dia
>> log_merged&q=business_id%3A7016655681%20AND%20call_day:%5b20230101%20T
>> O%2020240101%7d&group=true&group.field=call_callerno&sort=call_date%20
>> desc&rows=10000&group.main=true>
>> 
>> 
>> 
>> The two times given are right after a restart and the next day, or 
>> sometime a few hours later. The only difference is how Solr is 
>> running. I can’t understand what makes it run so slowly after a short while.
>> 
>> 
>> 
>> Business_id
>> 
>> Time 1
>> 
>> Time 2
>> 
>> 7016274253
>> 
>> 11.572
>> 
>> 23.397
>> 
>> 7010707194
>> 
>> 21.941
>> 
>> 21.414
>> 
>> 7000001491
>> 
>> 9.516
>> 
>> 39.051
>> 
>> 7029931968
>> 
>> 10.755
>> 
>> 59.196
>> 
>> 7014676602
>> 
>> 14.508
>> 
>> 14.083
>> 
>> 7004551760
>> 
>> 12.873
>> 
>> 36.856
>> 
>> 7016274253
>> 
>> 1.792
>> 
>> 17.415
>> 
>> 7010707194
>> 
>> 5.671
>> 
>> 25.442
>> 
>> 7000001491
>> 
>> 6.84
>> 
>> 36.244
>> 
>> 7029931968
>> 
>> 6.291
>> 
>> 38.483
>> 
>> 7014676602
>> 
>> 7.643
>> 
>> 12.584
>> 
>> 7004551760
>> 
>> 5.669
>> 
>> 21.977
>> 
>> 7029931968
>> 
>> 8.293
>> 
>> 36.688
>> 
>> 7008606979
>> 
>> 16.976
>> 
>> 30.569
>> 
>> 7002264530
>> 
>> 13.862
>> 
>> 35.113
>> 
>> 7017281920
>> 
>> 10.1
>> 
>> 31.914
>> 
>> 7000001491
>> 
>> 8.665
>> 
>> 35.141
>> 
>> 7058630709
>> 
>> 11.236
>> 
>> 38.104
>> 
>> 7011363889
>> 
>> 10.977
>> 
>> 19.72
>> 
>> 7016319075
>> 
>> 15.763
>> 
>> 26.023
>> 
>> 7053262466
>> 
>> 10.917
>> 
>> 48.3
>> 
>> 7000313815
>> 
>> 9.786
>> 
>> 24.617
>> 
>> 7015187150
>> 
>> 8.312
>> 
>> 29.485
>> 
>> 7016381845
>> 
>> 11.51
>> 
>> 34.545
>> 
>> 7016379523
>> 
>> 10.543
>> 
>> 29.27
>> 
>> 7026102159
>> 
>> 6.047
>> 
>> 30.381
>> 
>> 7010707194
>> 
>> 8.298
>> 
>> 27.069
>> 
>> 7016508018
>> 
>> 7.98
>> 
>> 34.48
>> 
>> 7016280579
>> 
>> 5.443
>> 
>> 26.617
>> 
>> 7016302809
>> 
>> 3.491
>> 
>> 12.578
>> 
>> 7016259866
>> 
>> 7.723
>> 
>> 33.462
>> 
>> 7016390730
>> 
>> 11.358
>> 
>> 32.997
>> 
>> 7013498165
>> 
>> 8.214
>> 
>> 26.004
>> 
>> 7016392929
>> 
>> 6.612
>> 
>> 19.711
>> 
>> 7007737612
>> 
>> 2.198
>> 
>> 4.19
>> 
>> 7012687678
>> 
>> 8.627
>> 
>> 35.342
>> 
>> 7016606704
>> 
>> 5.951
>> 
>> 21.732
>> 
>> 7007870203
>> 
>> 2.524
>> 
>> 16.534
>> 
>> 7016268227
>> 
>> 6.296
>> 
>> 25.651
>> 
>> 7016405011
>> 
>> 3.288
>> 
>> 18.541
>> 
>> 7016424246
>> 
>> 9.756
>> 
>> 31.243
>> 
>> 7000336592
>> 
>> 5.465
>> 
>> 31.486
>> 
>> 7004696397
>> 
>> 4.713
>> 
>> 29.528
>> 
>> 7016279283
>> 
>> 2.473
>> 
>> 24.243
>> 
>> 7016623672
>> 
>> 6.958
>> 
>> 35.96
>> 
>> 7016582537
>> 
>> 5.112
>> 
>> 33.475
>> 
>> 7015713947
>> 
>> 5.162
>> 
>> 25.972
>> 
>> 7003530665
>> 
>> 8.223
>> 
>> 26.549
>> 
>> 7012825693
>> 
>> 7.4
>> 
>> 16.849
>> 
>> 7010707194
>> 
>> 6.781
>> 
>> 23.835
>> 
>> 7079272278
>> 
>> 7.793
>> 
>> 24.686
>> 
>> 
>> 
>> *Jim Beale*
>> 
>> *Lead Software Engineer *
>> 
>> *hibu.com <http://hibu.com>*
>> 
>> *2201 **Renaissance Boulevard**, King of Prussia, PA, **19406*
>> 
>> *Office: 610-879-3864*
>> 
>> *Mobile: 610-220-3067*
>> 
>> 
>> 
>> 
>> 
>> *From:* Beale, Jim (US-KOP) <jim.be...@hibu.com.INVALID>
>> *Sent:* Wednesday, February 28, 2024 3:29 PM
>> *To:* users@solr.apache.org
>> *Subject:* RE: [EXTERNAL] Re: Is this list alive? I need help
>> 
>> 
>> 
>> *Caution!*
>> 
>> Attachments and links (urls) can contain deceptive and/or malicious 
>> content.
>> 
>> I didn't see these responses because they were buried in my clutter folder.
>> 
>> 
>> 
>> We have 12,541,505 docs for calls, 9,144,862 form fills, 53,838 SMS 
>> and
>> 12,752 social leads. These are all a single Solr 9.1 cluster of three 
>> nodes with PROD and UAT all on a single server. As follows:
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> The three nodes are r5.xlarge and we’re not sure if those are large 
>> enough. The documents are not huge, from 1K to 25K each.
>> 
>> 
>> 
>> samisolrcld.aws01.hibu.int is a load-balancer
>> 
>> 
>> 
>> The request is
>> 
>> 
>> 
>> async function getCalls(businessId, limit) {
>> 
>>    const config = {
>> 
>>        method: 'GET',
>> 
>>        url: http://samisolrcld.aws01.hibu.int:8983/solr/calls/select,
>> 
>>        params: {
>> 
>>            q: `business_id:${businessId} AND call_day:[20230101 TO 
>> 20240101}`,
>> 
>>            fl: "business_id, call_id, call_day, call_date, 
>> dialog_merged, call_callerno, call_duration, call_status, caller_name, 
>> caller_address, caller_state, caller_city, caller_zip",
>> 
>>            rows: limit,
>> 
>>            start: 0,
>> 
>>            group: true,
>> 
>>            "group.main": true,
>> 
>>            "group.field": "call_callerno",
>> 
>>            sort: "call_day desc"
>> 
>>        }
>> 
>>    };
>> 
>>    //console.log(config);
>> 
>> 
>> 
>>    let rval = [];
>> 
>>    while(true) {
>> 
>>        try {
>> 
>>            //console.log(config.params.start);
>> 
>>            const rsp = await axios(config);
>> 
>>            if(rsp.data && rsp.data.response) {
>> 
>>                let docs = rsp.data.response.docs;
>> 
>>                if(docs.length == 0) break;
>> 
>>                config.params.start += limit;
>> 
>>                rval = rval.concat(docs);
>> 
>>            }
>> 
>>        } catch (err) {
>> 
>>            console.log("Error: " + err.message);
>> 
>>        }
>> 
>>    }
>> 
>>    return rval;
>> 
>> }
>> 
>> 
>> 
>> You wrote:
>> 
>> 
>> 
>> Note that EFS is encrypted file system, and stunnel is encrypted 
>> transport, so for each disk read you likely causing:
>> 
>> 
>> 
>>   - read raw encrypted data from disk to memory (at AWS)
>> 
>>   - decrypt the disk data in memory (at AWS)
>> 
>>   - encrypt the memory data for stunnel transport (at AWS)
>> 
>>   - send the data over the wire
>> 
>>   - decrypt the data for use by solr. (Hardware you specify)
>> 
>> 
>> 
>> That's guaranteed to be slow, and worse yet, you have no control at 
>> all over the size or loading of the hardware performing anything but 
>> the last step. You are completely at the mercy of AWS's cost/speed 
>> tradeoffs which are unlikely to be targeting the level of performance 
>> usually desired for search disk IO.
>> 
>> 
>> 
>> This is interesting. I can copy the data to local and try it from there.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Jim Beale
>> 
>> Lead Software Engineer
>> 
>> hibu.com
>> 
>> 2201 Renaissance Boulevard, King of Prussia, PA, 19406
>> 
>> Office: 610-879-3864
>> 
>> Mobile: 610-220-3067
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Gus Heck <gus.h...@gmail.com>
>> Sent: Sunday, February 25, 2024 9:15 AM
>> To: users@solr.apache.org
>> Subject: [EXTERNAL] Re: Is this list alive? I need help
>> 
>> 
>> 
>> Caution!        Attachments and links (urls) can contain deceptive and/or
>> malicious content.
>> 
>> 
>> 
>> Hi Jim,
>> 
>> 
>> 
>> Welcome to the Solr user list, not sure why your are asking about list 
>> liveliness? I don't see prior messages from you?
>> 
>> https://lists.apache.org/list?users@solr.apache.org:lte=1M:jim
>> 
>> 
>> 
>> Probably the most important thing you haven't told us is the current 
>> size of your indexes. You said 20k/day input, but at the start do you 
>> have 0days, 1 day, 10 days, 100 days, 1000 days, or 10000 days (27y) 
>> on disk already?
>> 
>> 
>> 
>> If you are starting from zero, then there is likely a 20x or more 
>> growth in the size of the index between the first and second 
>> measurement.. indexes do get slower with size though you would need 
>> fantastically large documents or some sort of disk problem to explain it 
>> that way.
>> 
>> 
>> 
>> However, maybe you do have huge documents or disk issues since your 
>> query time at time1 is already abysmal? Either you are creating a 
>> fantastically expensive query, or your system is badly overloaded. New 
>> systems, properly sized with moderate sized documents ought to be 
>> serving simple queries in tens of milliseconds.
>> 
>> 
>> 
>> As others have said it is *critical you show us the entire query request*.
>> 
>> If you are doing something like attempting to return the entire index 
>> with rows=999999, that would almost certainly explain your issues...
>> 
>> 
>> 
>> How large are your average documents (in terms of bytes)?
>> 
>> 
>> 
>> Also what version of Solr?
>> 
>> 
>> 
>> r5.xlarge only has 4 cpu and 32 GB of memory. That's not very large 
>> (despite the name). However since it's unclear what your total index 
>> size looks like, it might be OK.
>> 
>> 
>> 
>> What are your IOPS constraints with EFS? Are you running out of a 
>> quota there? (bursting mode?)
>> 
>> 
>> 
>> Note that EFS is encrypted file system, and stunnel is encrypted 
>> transport, so for each disk read you likely causing:
>> 
>> 
>> 
>>   - read raw encrypted data from disk to memory (at AWS)
>> 
>>   - decrypt the disk data in memory (at AWS)
>> 
>>   - encrypt the memory data for stunnel transport (at AWS)
>> 
>>   - send the data over the wire
>> 
>>   - decrypt the data for use by solr. (Hardware you specify)
>> 
>> 
>> 
>> That's guaranteed to be slow, and worse yet, you have no control at 
>> all over the size or loading of the hardware performing anything but 
>> the last step. You are completely at the mercy of AWS's cost/speed 
>> tradeoffs which are unlikely to be targeting the level of performance 
>> usually desired for search disk IO.
>> 
>> 
>> 
>> I'll also echo others and say that it's a bad idea to allow solr 
>> instances to compete for disk IO in any way. I've seen people succeed 
>> with setups that use invisibly provisioned disks, but one typically 
>> has to run more hardware to compensate. Having a shared disk creates 
>> competition, and it also creates a single point of failure partially 
>> invalidating the notion of running 3 servers in cloud mode for high 
>> availability. If you can't have more than one disk, then you might as 
>> well run a single node, especially at small data sizes like 20k/day.  
>> A single node on well chosen hardware can usually serve tens of 
>> millions of normal sized documents, which would be several years of 
>> data for you. (assuming low query rates, handling high rates of course 
>> starts to require hardware)
>> 
>> 
>> 
>> Finally, you will want to get away from using single queries as a 
>> measurement of latency. If you care about response time I HIGHLY 
>> suggest you watch this YouTube video on how NOT to measure latency:
>> 
>> https://www.youtube.com/watch?v=lJ8ydIuPFeU
>> 
>> 
>> 
>> On Fri, Feb 23, 2024 at 6:44 PM Jan Høydahl <jan....@cominvent.com> wrote:
>> 
>> 
>> 
>>> I think EFS is a terribly slow file system to use for Solr, who
>> 
>>> recommended it? :) Better use one EBS per node.
>> 
>>> Not sure if the gradually slower performance is due to EFS though. 
>>> We
>> 
>>> need to know more about your setup to get a clue. What role does
>> 
>>> stunnel play here? How are you indexing the content etc.
>> 
>>> 
>> 
>>> Jan
>> 
>>> 
>> 
>>>> 23. feb. 2024 kl. 19:58 skrev Walter Underwood 
>>>> <wun...@wunderwood.org
>>> :
>> 
>>>> 
>> 
>>>> First, a shared disk is not a good idea. Each node should have its
>> 
>>>> own
>> 
>>> local disk. Solr makes heavy use of the disk.
>> 
>>>> 
>> 
>>>> If the indexes are shared, I’m surprised it works at all. Solr is
>> 
>>>> not
>> 
>>> designed to share indexes.
>> 
>>>> 
>> 
>>>> Please share the full query string.
>> 
>>>> 
>> 
>>>> wunder
>> 
>>>> Walter Underwood
>> 
>>>> wun...@wunderwood.org
>> 
>>>> http://observer.wunderwood.org/  (my blog)
>> 
>>>> 
>> 
>>>>> On Feb 23, 2024, at 10:01 AM, Beale, Jim (US-KOP)
>> 
>>> <jim.be...@hibu.com.INVALID> wrote:
>> 
>>>>> 
>> 
>>>>> I have a Solrcloud installation of three servers on three 
>>>>> r5.xlarge
>> 
>>>>> EC2
>> 
>>> with a shared disk drive using EFS and stunnel.
>> 
>>>>> 
>> 
>>>>> I have documents coming in about 20000 per day and I am trying to
>> 
>>> perform indexing along with some regular queries and some special
>> 
>>> queries for some new functionality.
>> 
>>>>> 
>> 
>>>>> When I just restart Solr, these queries run very fast but over 
>>>>> time
>> 
>>> become slower and slower.
>> 
>>>>> 
>> 
>>>>> This is typical for the numbers. At time1, the request only took
>> 
>>>>> 2.16
>> 
>>> sec but over night the response took 18.137 sec. That is just typical.
>> 
>>>>> 
>> 
>>>>> businessId, all count, reduced count, time1, time2
>> 
>>>>> 7016274253,8433,4769,2.162,18.137
>> 
>>>>> 
>> 
>>>>> The same query is so far different. Overnight the Solr servers 
>>>>> slow
>> 
>>> down and give terrible response. I don’t even know if this list is alive.
>> 
>>>>> 
>> 
>>>>> 
>> 
>>>>> Jim Beale
>> 
>>>>> Lead Software Engineer
>> 
>>>>> hibu.com
>> 
>>>>> 2201 Renaissance Boulevard, King of Prussia, PA, 19406
>> 
>>>>> Office: 610-879-3864
>> 
>>>>> Mobile: 610-220-3067
>> 
>>>>> 
>> 
>>>>> 
>> 
>>>>> 
>> 
>>>>> The information contained in this email message, including any
>> 
>>> attachments, is intended solely for use by the individual or entity
>> 
>>> named above and may be confidential. If the reader of this message 
>>> is
>> 
>>> not the intended recipient, you are hereby notified that you must 
>>> not
>> 
>>> read, use, disclose, distribute or copy any part of this
>> 
>>> communication. If you have received this communication in error,
>> 
>>> please immediately notify me by email and destroy the original 
>>> message,
>> including any attachments. Thank you.
>> 
>>> **Hibu IT Code:1414593000000**
>> 
>>>> 
>> 
>>> 
>> 
>>> 
>> 
>> 
>> 
>> --
>> 
>> http://www.needhamsoftware.com (work)
>> 
>> https://a.co/d/b2sZLD9 (my fantasy fiction book)
>> 
> 
> 
> --
> http://www.needhamsoftware.com (work)
> https://a.co/d/b2sZLD9 (my fantasy fiction book)

Re: [EXTERNAL] Is this list alive? I need help

Reply via email to