RE: [EXTERNAL] Is this list alive? I need help

Beale, Jim (US-KOP) Thu, 29 Feb 2024 09:44:00 -0800

We are planning to move to EBS to see if it is different. I suppose it will 
take a few days to get it done.


Jim Beale
Lead Software Engineer
hibu.com
2201 Renaissance Boulevard, King of Prussia, PA, 19406
Office: 610-879-3864
Mobile: 610-220-3067

[cid:image001.png@01DA6B0C.9F34D3F0]

From: Walter Underwood <wun...@wunderwood.org>
Sent: Thursday, February 29, 2024 12:27 PM
To: users@solr.apache.org
Subject: Re: [EXTERNAL] Is this list alive? I need help

Caution!
Attachments and links (urls) can contain deceptive and/or malicious content.
Also on EFS performance, because EFS is mounted with NFS, the one time I 
accidentally ran Solr with indexes on an NFS-mounted volume, it was 100X slower 
than local. It looks like they’ve improved that to only 10X slower than a local 
EBS volume.

So get off of EFS. Use local GP3 volumes.

[cid:image002.png@01DA6B0C.9F34D3F0]
How to get comparable performance to gp2/gp3 on 
EFS?<https://repost.aws/questions/QUqyZD98d0TbiluqPBW_zALw/how-to-get-comparable-performance-to-gp2-gp3-on-efs>
repost.aws<https://repost.aws/questions/QUqyZD98d0TbiluqPBW_zALw/how-to-get-comparable-performance-to-gp2-gp3-on-efs>

wunder
Walter Underwood
wun...@wunderwood.org<mailto:wun...@wunderwood.org>
http://observer.wunderwood.org/  (my blog)


On Feb 28, 2024, at 11:18 PM, Gus Heck 
<gus.h...@gmail.com<mailto:gus.h...@gmail.com>> wrote:

Ah sorry my eyes flew past the long hard to read link straight to the
pretty table. Sorry.

Yeah so a 10000 row grouping query is not a good idea. If you did it
paginated with cursor mark you would want to play around with trading off
number of requests vs size of request. very likely the optimal size is a
lot less than 10000 so long as the looping code isn't crazy inefficient but
it might be as high as 100 or even 500. No way to know for any particular
system and query other than testing it.

As for how EFS could change it's performance on you, check out references
to "bursting credits" here:
https://docs.aws.amazon.com/efs/latest/ug/performance.html

On Wed, Feb 28, 2024 at 10:55 PM Beale, Jim (US-KOP)
<jim.be...@hibu.com.invalid<mailto:jim.be...@hibu.com.invalid>> wrote:


I did send the query. Here it is:


http://samisolrcld.aws01.hibu.int:8983/solr/calls/select?indent=true&q.op=OR&fl=business_id,call_id,call_date,call_callerno,caller_name,dialog_merged&q=business_id%3A7016655681%20AND%20call_day:[20230101%20TO%2020240101}&group=true&group.field=call_callerno&sort=call_date%20desc&rows=10000&group.main=true<http://samisolrcld.aws01.hibu.int:8983/solr/calls/select?indent=true&q.op=OR&fl=business_id,call_id,call_date,call_callerno,caller_name,dialog_merged&q=business_id%3A7016655681%20AND%20call_day:%5b20230101%20TO%2020240101%7d&group=true&group.field=call_callerno&sort=call_date%20desc&rows=10000&group.main=true>
<http://samisolrcld.aws01.hibu.int:8983/solr/calls/select?indent=true&q.op=OR&fl=business_id,call_id,call_date,call_callerno,caller_name,dialog_merged&q=business_id%3A7016655681%20AND%20call_day:[20230101%20TO%2020240101%7D&group=true&group.field=call_callerno&sort=call_date%20desc&rows=10000&group.main=true<http://samisolrcld.aws01.hibu.int:8983/solr/calls/select?indent=true&q.op=OR&fl=business_id,call_id,call_date,call_callerno,caller_name,dialog_merged&q=business_id%3A7016655681%20AND%20call_day:%5b20230101%20TO%2020240101%7D&group=true&group.field=call_callerno&sort=call_date%20desc&rows=10000&group.main=true>>

I suppose all the indexes are about 150 GB so you are close.

I set the limit to 10,000 or 5000 for these tests. Setting the limit at 10
or 50 would mean that there would need to be 1000-2000 requests. That seems
like an awful lot to me.

That is interesting about the export. I will look into other types of data
collection.

Also there is no quota on the EFS. It is apparently encrypted both ways.
But if it is fast the one time, rebooting Solr shouldn't affect how it uses
disk access.


Jim Beale
Lead Software Engineer
hibu.com
2201 Renaissance Boulevard, King of Prussia, PA, 19406
Office: 610-879-3864
Mobile: 610-220-3067



-----Original Message-----
From: Gus Heck <gus.h...@gmail.com<mailto:gus.h...@gmail.com>>
Sent: Wednesday, February 28, 2024 9:22 PM
To: users@solr.apache.org<mailto:users@solr.apache.org>
Subject: Re: [EXTERNAL] Re: Is this list alive? I need help

Caution!        Attachments and links (urls) can contain deceptive and/or
malicious content.

Your description leads me to believe that at worst you have ~20M docs in
one index, If the average doc size is 5k or so it sounds like 100GB.. This
is smalish and across 3 machines it ought to be fine. Your time 1 values
are very slow to begin with. Unfortunately you didn't send us the query,
only the code that generates the query. A key bit not shown is what value
you are passing in for limit (which is then set for rows. it *should* be
something like 10 or 25 or 50. It should NOT be 1000 or 99999 etc... but
the fact you have hardcoded the start to zero makes me think you are not
paging and you are doing something in the "NOT" realm. If you are trying to
export ALL matches to a query you'd be better off using /export rather than
/select (reqquires docvalues for all fields involved) or if you don't have
docvalues, use the cursormark feature to iteratively fetch pages of data.

If you say rows-10000 then each node sends back 10000, the coordinator
sorts all 30000 and then sends the top 10000 to the client....

Note that the grouping feature you are using can be heavy too. To do that
in an /export context you would probably have to use streaming expressions
and even there you would have to design carefully to avoid trying to hold
large fractions of the index in memory while you formed groups...

As for the change in speed I'm still betting on some sort of quota for
your EFS access (R5 are fixed cpu availability so that's not it) However,
it's worth looking at your GC logs in case your (probable) large queries
are getting you into trouble with memory/GC. As with any performance
troubleshooting you'll want to have eyes on the CPU load, disk io bytes,
disk iOPs and network bandwidth.

Oh one more thing that comes to mind. Make sure you don't configure ANY
swap drive on your server. If the OS starts trying to put solr's cached
memory on a swap disk the query times just go in the trash instantly. in
most cases (YMMV) you would MUCH rather crash the server than have it start
using swap. (because then you know you need a bigger server, rather than
silently serving dog slow results while you limp along).

-Gus

On Wed, Feb 28, 2024 at 4:09 PM Beale, Jim (US-KOP)
<jim.be...@hibu.com.invalid<mailto:jim.be...@hibu.com.invalid>> wrote:


Here is the performance for this query on these nodes. You saw the
code in a previous email.




http://samisolrcld.aws01.hibu.int:8983/solr/calls/select?indent=true&q
.op=OR&fl=business_id,call_id,call_date,call_callerno,caller_name,dial
og_merged&q=business_id%3A7016655681%20AND%20call_day:[20230101%20TO%2
020240101}&group=true&group.field=call_callerno&sort=call_date%20desc&
rows=10000&group.main=true
<http://samisolrcld.aws01.hibu.int:8983/solr/calls/select?indent=true&;
q.op=OR&fl=business_id,call_id,call_date,call_callerno,caller_name,dia
log_merged&q=business_id%3A7016655681%20AND%20call_day:%5b20230101%20T
O%2020240101%7d&group=true&group.field=call_callerno&sort=call_date%20
desc&rows=10000&group.main=true<http://samisolrcld.aws01.hibu.int:8983/solr/calls/select?indent=true&%0bq.op=OR&fl=business_id,call_id,call_date,call_callerno,caller_name,dia%0blog_merged&q=business_id%3A7016655681%20AND%20call_day:%5b20230101%20T%0bO%2020240101%7d&group=true&group.field=call_callerno&sort=call_date%20%0bdesc&rows=10000&group.main=true>>



The two times given are right after a restart and the next day, or
sometime a few hours later. The only difference is how Solr is
running. I can’t understand what makes it run so slowly after a short
while.




Business_id

Time 1

Time 2

7016274253

11.572

23.397

7010707194

21.941

21.414

7000001491

9.516

39.051

7029931968

10.755

59.196

7014676602

14.508

14.083

7004551760

12.873

36.856

7016274253

1.792

17.415

7010707194

5.671

25.442

7000001491

6.84

36.244

7029931968

6.291

38.483

7014676602

7.643

12.584

7004551760

5.669

21.977

7029931968

8.293

36.688

7008606979

16.976

30.569

7002264530

13.862

35.113

7017281920

10.1

31.914

7000001491

8.665

35.141

7058630709

11.236

38.104

7011363889

10.977

19.72

7016319075

15.763

26.023

7053262466

10.917

48.3

7000313815

9.786

24.617

7015187150

8.312

29.485

7016381845

11.51

34.545

7016379523

10.543

29.27

7026102159

6.047

30.381

7010707194

8.298

27.069

7016508018

7.98

34.48

7016280579

5.443

26.617

7016302809

3.491

12.578

7016259866

7.723

33.462

7016390730

11.358

32.997

7013498165

8.214

26.004

7016392929

6.612

19.711

7007737612

2.198

4.19

7012687678

8.627

35.342

7016606704

5.951

21.732

7007870203

2.524

16.534

7016268227

6.296

25.651

7016405011

3.288

18.541

7016424246

9.756

31.243

7000336592

5.465

31.486

7004696397

4.713

29.528

7016279283

2.473

24.243

7016623672

6.958

35.96

7016582537

5.112

33.475

7015713947

5.162

25.972

7003530665

8.223

26.549

7012825693

7.4

16.849

7010707194

6.781

23.835

7079272278

7.793

24.686



*Jim Beale*

*Lead Software Engineer *

*hibu.com <http://hibu.com>*

*2201 **Renaissance Boulevard**, King of Prussia, PA, **19406*

*Office: 610-879-3864*

*Mobile: 610-220-3067*





*From:* Beale, Jim (US-KOP) 
<jim.be...@hibu.com.INVALID<mailto:jim.be...@hibu.com.INVALID>>
*Sent:* Wednesday, February 28, 2024 3:29 PM
*To:* users@solr.apache.org<mailto:users@solr.apache.org>
*Subject:* RE: [EXTERNAL] Re: Is this list alive? I need help



*Caution!*

Attachments and links (urls) can contain deceptive and/or malicious
content.

I didn't see these responses because they were buried in my clutter
folder.




We have 12,541,505 docs for calls, 9,144,862 form fills, 53,838 SMS
and
12,752 social leads. These are all a single Solr 9.1 cluster of three
nodes with PROD and UAT all on a single server. As follows:








The three nodes are r5.xlarge and we’re not sure if those are large
enough. The documents are not huge, from 1K to 25K each.



samisolrcld.aws01.hibu.int is a load-balancer



The request is



async function getCalls(businessId, limit) {

   const config = {

       method: 'GET',

       url: http://samisolrcld.aws01.hibu.int:8983/solr/calls/select,

       params: {

           q: `business_id:${businessId} AND call_day:[20230101 TO
20240101}`,

           fl: "business_id, call_id, call_day, call_date,
dialog_merged, call_callerno, call_duration, call_status, caller_name,
caller_address, caller_state, caller_city, caller_zip",

           rows: limit,

           start: 0,

           group: true,

           "group.main": true,

           "group.field": "call_callerno",

           sort: "call_day desc"

       }

   };

   //console.log(config);



   let rval = [];

   while(true) {

       try {

           //console.log(config.params.start);

           const rsp = await axios(config);

           if(rsp.data && rsp.data.response) {

               let docs = rsp.data.response.docs;

               if(docs.length == 0) break;

               config.params.start += limit;

               rval = rval.concat(docs);

           }

       } catch (err) {

           console.log("Error: " + err.message);

       }

   }

   return rval;

}



You wrote:



Note that EFS is encrypted file system, and stunnel is encrypted
transport, so for each disk read you likely causing:



  - read raw encrypted data from disk to memory (at AWS)

  - decrypt the disk data in memory (at AWS)

  - encrypt the memory data for stunnel transport (at AWS)

  - send the data over the wire

  - decrypt the data for use by solr. (Hardware you specify)



That's guaranteed to be slow, and worse yet, you have no control at
all over the size or loading of the hardware performing anything but
the last step. You are completely at the mercy of AWS's cost/speed
tradeoffs which are unlikely to be targeting the level of performance
usually desired for search disk IO.



This is interesting. I can copy the data to local and try it from there.







Jim Beale

Lead Software Engineer

hibu.com

2201 Renaissance Boulevard, King of Prussia, PA, 19406

Office: 610-879-3864

Mobile: 610-220-3067







-----Original Message-----
From: Gus Heck <gus.h...@gmail.com<mailto:gus.h...@gmail.com>>
Sent: Sunday, February 25, 2024 9:15 AM
To: users@solr.apache.org<mailto:users@solr.apache.org>
Subject: [EXTERNAL] Re: Is this list alive? I need help



Caution!        Attachments and links (urls) can contain deceptive and/or
malicious content.



Hi Jim,



Welcome to the Solr user list, not sure why your are asking about list
liveliness? I don't see prior messages from you?

https://lists.apache.org/list?users@solr.apache.org:lte=1M:jim



Probably the most important thing you haven't told us is the current
size of your indexes. You said 20k/day input, but at the start do you
have 0days, 1 day, 10 days, 100 days, 1000 days, or 10000 days (27y)
on disk already?



If you are starting from zero, then there is likely a 20x or more
growth in the size of the index between the first and second
measurement.. indexes do get slower with size though you would need
fantastically large documents or some sort of disk problem to explain it
that way.




However, maybe you do have huge documents or disk issues since your
query time at time1 is already abysmal? Either you are creating a
fantastically expensive query, or your system is badly overloaded. New
systems, properly sized with moderate sized documents ought to be
serving simple queries in tens of milliseconds.



As others have said it is *critical you show us the entire query
request*.


If you are doing something like attempting to return the entire index
with rows=999999, that would almost certainly explain your issues...



How large are your average documents (in terms of bytes)?



Also what version of Solr?



r5.xlarge only has 4 cpu and 32 GB of memory. That's not very large
(despite the name). However since it's unclear what your total index
size looks like, it might be OK.



What are your IOPS constraints with EFS? Are you running out of a
quota there? (bursting mode?)



Note that EFS is encrypted file system, and stunnel is encrypted
transport, so for each disk read you likely causing:



  - read raw encrypted data from disk to memory (at AWS)

  - decrypt the disk data in memory (at AWS)

  - encrypt the memory data for stunnel transport (at AWS)

  - send the data over the wire

  - decrypt the data for use by solr. (Hardware you specify)



That's guaranteed to be slow, and worse yet, you have no control at
all over the size or loading of the hardware performing anything but
the last step. You are completely at the mercy of AWS's cost/speed
tradeoffs which are unlikely to be targeting the level of performance
usually desired for search disk IO.



I'll also echo others and say that it's a bad idea to allow solr
instances to compete for disk IO in any way. I've seen people succeed
with setups that use invisibly provisioned disks, but one typically
has to run more hardware to compensate. Having a shared disk creates
competition, and it also creates a single point of failure partially
invalidating the notion of running 3 servers in cloud mode for high
availability. If you can't have more than one disk, then you might as
well run a single node, especially at small data sizes like 20k/day.
A single node on well chosen hardware can usually serve tens of
millions of normal sized documents, which would be several years of
data for you. (assuming low query rates, handling high rates of course
starts to require hardware)



Finally, you will want to get away from using single queries as a
measurement of latency. If you care about response time I HIGHLY
suggest you watch this YouTube video on how NOT to measure latency:

https://www.youtube.com/watch?v=lJ8ydIuPFeU



On Fri, Feb 23, 2024 at 6:44 PM Jan Høydahl 
<jan....@cominvent.com<mailto:jan....@cominvent.com>>
wrote:





I think EFS is a terribly slow file system to use for Solr, who


recommended it? :) Better use one EBS per node.


Not sure if the gradually slower performance is due to EFS though.
We


need to know more about your setup to get a clue. What role does


stunnel play here? How are you indexing the content etc.





Jan





23. feb. 2024 kl. 19:58 skrev Walter Underwood
<wun...@wunderwood.org
:





First, a shared disk is not a good idea. Each node should have its


own


local disk. Solr makes heavy use of the disk.





If the indexes are shared, I’m surprised it works at all. Solr is


not


designed to share indexes.





Please share the full query string.





wunder


Walter Underwood


wun...@wunderwood.org<mailto:wun...@wunderwood.org>


http://observer.wunderwood.org/  (my blog)





On Feb 23, 2024, at 10:01 AM, Beale, Jim (US-KOP)


<jim.be...@hibu.com.INVALID<mailto:jim.be...@hibu.com.INVALID>> wrote:





I have a Solrcloud installation of three servers on three
r5.xlarge


EC2


with a shared disk drive using EFS and stunnel.





I have documents coming in about 20000 per day and I am trying to


perform indexing along with some regular queries and some special


queries for some new functionality.





When I just restart Solr, these queries run very fast but over
time


become slower and slower.





This is typical for the numbers. At time1, the request only took


2.16


sec but over night the response took 18.137 sec. That is just typical.





businessId, all count, reduced count, time1, time2


7016274253,8433,4769,2.162,18.137





The same query is so far different. Overnight the Solr servers
slow


down and give terrible response. I don’t even know if this list is
alive.









Jim Beale


Lead Software Engineer


hibu.com


2201 Renaissance Boulevard, King of Prussia, PA, 19406


Office: 610-879-3864


Mobile: 610-220-3067











The information contained in this email message, including any


attachments, is intended solely for use by the individual or entity


named above and may be confidential. If the reader of this message
is


not the intended recipient, you are hereby notified that you must
not


read, use, disclose, distribute or copy any part of this


communication. If you have received this communication in error,


please immediately notify me by email and destroy the original
message,
including any attachments. Thank you.


**Hibu IT Code:1414593000000**












--

http://www.needhamsoftware.com (work)

https://a.co/d/b2sZLD9 (my fantasy fiction book)


--
http://www.needhamsoftware.com (work)
https://a.co/d/b2sZLD9 (my fantasy fiction book)


--
http://www.needhamsoftware.com (work)
https://a.co/d/b2sZLD9 (my fantasy fiction book)

RE: [EXTERNAL] Is this list alive? I need help

Reply via email to