cat /proc/sys/fs/file-max
9223372036854775807

The file descriptor limit of dirsrv is 8192. Should I bump this up too? I just used what was on the 389 website.

Thanks,

Gary

On 10/9/24 03:06, Joe Fletcher wrote:

Just a thought: what are the open files limits set to on the system?

I’ve seen plenty of cases where increasing descriptor limits can help in these instances.

*From:*Thierry Bordaz via 389-users <389-users@lists.fedoraproject.org>
*Sent:* Wednesday, October 9, 2024 10:30 AM
*To:* Gary Waters <gwat...@caltech.edu>; General discussion list for the 389 Directory server project. <389-users@lists.fedoraproject.org>; Marc Sauton <msau...@redhat.com>; James Chapman <jacha...@redhat.com>
*Cc:* gwaters-...@caltech.edu; Thierry Bordaz <tbor...@redhat.com>
*Subject:* [389-users] Re: Inconsistent Ldap connection issues

**** The e-mail below is from an external source. Please do not open attachments or click links from an unknown or suspicious origin. ****

Hi Gary,

My understanding is that a single client opens ~10K connections in ~6min, that is an average of 26 connections per second. It creates a set of established connections that spike up to 961 parallel connections. The problem being that from time to time the client reports 'can not contact LDAP server' and while you are running 2.4 this problem did not occur in 1.4.3. Am I correct ?

My first guess is that Marc is correct and tuning of tcp param should be the culprit. Now it does not explain why it occurs only in 2.4 but may be you were running 1.4.3 on RHEL7 and are running 2.4 on RHEL8.

2.4 contains significant changes around that area. I do not recall all commit in connection handling side. AFAIK what is "new" is a dedicated listener thread (polling on ports 389/636/ldapi for new connections) and a multiple polling threads (polling the established connections) My understanding is that your problem impacts the listener thread as it is responsible to accept new connections and it is the new problem you detect in 2.4. @james did I miss something ?

If the culprit is not in TCP tuning, @james do you remember known bugs that would limit the listener thread ?

best regards
thierry

On 10/8/24 7:36 PM, Gary Waters wrote:

    Hi Thierry,

    Ah yes of course. Here is 1 run of their web app load test, it is
    6 minutes long, and it should mostly be only the test it self. I
    will start looking for

    We encountered 2 "Can not contact ldap server" errors during this
    run.

    2 cant contact ldap server errors in this run

    ----------- Access Log Output ------------

    Start of Logs:    08/Oct/2024:09:53:35.810833927
    End of Logs:      08/Oct/2024:09:59:52.361830449

    Processed Log Time:  0 Hours, 6 Minutes, 16.550998016 Seconds

    Restarts:                      1
    Secure Protocol Versions:
      - TLS1.2 128-bit AES-GCM (9833 connections)

    Peak Concurrent Connections:   689
    Total Operations:              86412
    Total Results:                 86412
    Overall Performance:           100.0%

    Total Connections:             9933          (26.38/sec) (1582.73/min)
     - LDAP Connections:           9933          (26.38/sec) (1582.73/min)
     - LDAPI Connections:          0             (0.00/sec) (0.00/min)
     - LDAPS Connections:          0             (0.00/sec) (0.00/min)
     - StartTLS Extended Ops:      9833          (26.11/sec) (1566.80/min)

    Searches:                      66647         (176.99/sec)
    (10619.60/min)
    Modifications:                 0             (0.00/sec) (0.00/min)
    Adds:                          0             (0.00/sec) (0.00/min)
    Deletes:                       0             (0.00/sec) (0.00/min)
    Mod RDNs:                      0             (0.00/sec) (0.00/min)
    Compares:                      0             (0.00/sec) (0.00/min)
    Binds:                         9932          (26.38/sec) (1582.57/min)

    Average wtime (wait time):     0.001407368
    Average optime (op time):      0.003186859
    Average etime (elapsed time):  0.004591048

    Multi-factor Authentications:  0
    Proxied Auth Operations:       0
    Persistent Searches:           0
    Internal Operations:           0
    Entry Operations:              0
    Extended Operations:           9833
    Abandoned Requests:            0
    Smart Referrals Received:      0

    VLV Operations:                0
    VLV Unindexed Searches:        0
    VLV Unindexed Components:      0
    SORT Operations:               0

    Entire Search Base Queries:    0
    Paged Searches:                0
    Unindexed Searches:            0
    Unindexed Components:          0
    Invalid Attribute Filters:     0
    FDs Taken:                     9933
    FDs Returned:                  9932
    Highest FD Taken:              961

    Broken Pipes:                  0
    Connections Reset By Peer:     0
    Resource Unavailable:          0
    Max BER Size Exceeded:         0

    Binds:                         9932
    Unbinds:                       9225
    -----------------------------------
     - LDAP v2 Binds:              0
     - LDAP v3 Binds:              9932
     - AUTOBINDs(LDAPI):           0
     - SSL Client Binds:           0
     - Failed SSL Client Binds:    0
     - SASL Binds:                 0
     - Directory Manager Binds:    0
     - Anonymous Binds:            99

    ================

    After this run I bumped up these from 4096,

    net.ipv4.tcp_max_syn_backlog = 6144
    net.core.somaxconn = 6144

    Yet we still get the ldap errors (this one and the start tls
    request error previously mentioned.)

    Should I bump up the

    On 10/8/24 03:47, Thierry Bordaz wrote:

        Hi Gary,

        I meant that the access logs covered 5hours. It would be
        helpful to capture/focus on the logs from the few minutes
        before/after the time when the problem occurred. Then check
        from those limited logs if there is a pattern or unexpected
        events (long operation, no operation, abandon, ...)

        best regards
        thierry

        On 10/7/24 7:37 PM, Gary Waters wrote:

            Hi Thierry,

            Ok I ll decrease the timeout to 15 seconds then.

                Reducing the size of the logs will help.

            Which log and how do I do this ?

            Thanks Marc and Theirry!

            -Gary

            On 10/7/24 00:26, Thierry Bordaz wrote:

                Hi,

                Those slap_poll error means that the server was unable
                to send back PDU to the client. It can occur if the
                client sends a request and does not read fast enough
                the results. The timeout is high 30s (30000), could it
                be that the problem is on the client side (app) ?

                I suggest that you focus on the timestamp when the
                application reports a failure. Then look in the
                access/error logs from 1-3min before and after the
                time of the failure. Logconv from that limited scope
                will be more helpful than a global one.

                The pattern looks to be an app opens a connection,
                switch to secure connection (start-tls), issue 6-8
                SRCH then close. etime/wtime/optime looks fine but as
                it is an average (over 1M op) it is not helpful.
                Reducing the size of the logs will help.
                I found interesting the abandon op as it is possibly
                related to a performance issue.

                best regards
                thierry

                On 10/4/24 11:54 PM, Gary Waters via 389-users wrote:

                    Hi Marc,

                    I have made nsslapd-listen-backlog-size to 512.

                    For the ioblocktimeout, I increased it because of
                    an error I was seeing:

                    [30/Sep/2024:16:26:55.987681019 -0700] - ERR -
                    slapd_poll - (743) - Timed out
                    [30/Sep/2024:16:34:49.646922635 -0700] - ERR -
                    slapd_poll - (568) - Timed out

                    Googling stated that I should increase the
                    ioblocktimeout. So I bumped it up from 20000 to 30000.

                    Since then, those slapd_poll timed out errors have
                    not occurred. Should I have changed something else?

                    What should I increase these to?

                    net.core.somaxconn = 4096
                    net.ipv4.tcp_max_syn_backlog = 4096

                    Thanks so much for your help!

                    -Gary

                    On 10/4/24 11:55, Marc Sauton wrote:

                        tune up nsslapd-listen-backlog-size

                        and verify the net.core.somaxconn
                        and net.ipv4.tcp_max_syn_backlog are high
                        enough ( sysctl -a )

                        possibly tune down the nsslapd-ioblocktimeout
                        value

                        Thanks,

                        M.

                        On Fri, Oct 4, 2024 at 11:06 AM gwaters-web---
                        via 389-users
                        <389-users@lists.fedoraproject.org> wrote:

                            Hello,

                            We are experiencing a new issue since we
                            upgraded from 389-ds-base from
                            1.4~ish to 2.0.15 on RHEL 8. I couldnt
                            figure how to fix it, so I
                            switched to RHEL9 and are on 2.4.5-9.

                            The issue is during a performance load
                            test of a web application. The
                            app logs into a website and does some
                            things that searches against ldap,
                            and does some transactions. This app has
                            been performing fine for years,
                            the app has changed so it could be
                            something there, but I am not sure
                            about that because of the percentage of
                            the traffic that is successful.

                            The errors for the web app are "Can't
                            contact Ldap Server" and sometimes
                            "Can't contact LDAP server. Start TLS
                            request accepted.Server willing to
                            negotiate SSL. (0xFFFF [-1])". Out of the
                            128k connections below, these
                            errors will happen like 5 or 6 times, so
                            its wildly inconsistent and random.

                            I did a logconv analysis with 6 hours of a
                            day of testing, see below.
                            One thing that really stood out to me was
                            the peak concurrent
                            connections = 22.. That peak is so low, I
                            dont know how these errors are
                            happening.

                            I dont see any errors in the access log (
                            grepping for err=1).
                            I looked for cache warnings/errors in the
                            access/errors logs, but didnt
                            find any. I dont see things like
                            unavailable connections in the access logs.

                            Suggestions on what to change or look for
                            in the logs ?

                            Thanks,
                            Gary


                            information:
                            Machine Size: 16G of ram, 4 core AMD  (its
                            an EC2.m5.large, gp3 disk type)

                            kernel:
                            Linux  5.14.0-427.35.1.el9_4.x86_64 #1 SMP
                            PREEMPT_DYNAMIC
                            packages:
                            389-ds-base-libs-2.4.5-9.el9_4.x86_64
                            389-ds-base-2.4.5-9.el9_4.x86_64

                            single instance of dirsrv running
                            dirsrv modifcations from default:

                            nsslapd-logging-backend: dirsrv-log,syslog
                            nsslapd-maxdescriptors: 8192
                            nsslapd-listen-backlog-size: 256
                            nsslapd-allow-hashed-passwords: on
                            nsslapd-idletimeout: 30
                            nsslapd-ioblocktimeout: 30000
                            nsslapd-sizelimit: -1
                            nsslapd-auditlog-logging-enabled: off
                            nsslapd-lookthroughlimit: -1

                            dirsrv.systemd:
                            limitNOFILE=8192

                             >Total Log Lines Analszed:  2694287
                             >
                             >
                             >
                             > ---------- Access Log Output ------------
                             >
                             > Start of Logs:
                            26/Sep/2024:10:07:32.089983378
                             > End of Logs: 26/Sep/2024:15:54:29.895403688
                             >
                             > Processed Log Time:  5 Hours, 46
                            Minutes, 57.805426688 Seconds
                             >
                             > Restarts:                      0
                             > Secure Protocol Versions:
                             >   - TLS1.2 128-bit AES-GCM (123117
                            connections)
                             >
                             > Peak Concurrent Connections:   22
                             > Total Operations:              1097043
                             > Total Results:                 1097044
                             > Overall Performance:           100.0%
                             >
                             > Total Connections: 128646       
                            (6.18/sec) (370.78/min)
                             >  - LDAP Connections: 128646       
                            (6.18/sec) (370.78/min)
                             >  - LDAPI Connections: 0            
                            (0.00/sec) (0.00/min)
                             >  - LDAPS Connections: 0            
                            (0.00/sec) (0.00/min)
                             >  - StartTLS Extended Ops: 123116       
                            (5.91/sec) (354.84/min)
                             >
                             > Searches: 845279        (40.60/sec)
                            (2436.22/min)
                             > Modifications: 0             (0.00/sec)
                            (0.00/min)
                             > Adds: 0             (0.00/sec) (0.00/min)
                             > Deletes: 0             (0.00/sec)
                            (0.00/min)
                             > Mod RDNs: 0             (0.00/sec)
                            (0.00/min)
                             > Compares: 0             (0.00/sec)
                            (0.00/min)
                             > Binds: 128647        (6.18/sec)
                            (370.78/min)
                             >
                             > Average wtime (wait time): 0.001560856
                             > Average optime (op time): 0.003310453
                             > Average etime (elapsed time): 0.004868040
                             >
                             > Multi-factor Authentications:  0
                             > Proxied Auth Operations:       0
                             > Persistent Searches:           0
                             > Internal Operations:           0
                             > Entry Operations:              0
                             > Extended Operations:           123116
                             > Abandoned Requests:            1
                             > Smart Referrals Received:      0
                             >
                             > VLV Operations:                0
                             > VLV Unindexed Searches:        0
                             > VLV Unindexed Components:      0
                             > SORT Operations:               0
                             >
                             > Entire Search Base Queries:    0
                             > Paged Searches:                0
                             > Unindexed Searches:            0
                             > Unindexed Components:          0
                             > Invalid Attribute Filters:     0
                             > FDs Taken:                     128646
                             > FDs Returned:                  129318
                             > Highest FD Taken:              968
                             >
                             > Broken Pipes:                  0
                             > Connections Reset By Peer:     0
                             > Resource Unavailable:          0
                             > Max BER Size Exceeded:         0
                             >
                             > Binds:                         128647
                             > Unbinds:                       119206
                             > -------------------------------------
                             >  - LDAP v2 Binds:              0
                             >  - LDAP v3 Binds:              128647
                             >  - AUTOBINDs(LDAPI):           0
                             >  - SSL Client Binds:           0
                             >  - Failed SSL Client Binds:    0
                             >  - SASL Binds:                 0
                             >  - Dir

-- _______________________________________________
                            389-users mailing list --
                            389-users@lists.fedoraproject.org
                            To unsubscribe send an email to
                            389-users-le...@lists.fedoraproject.org
                            Fedora Code of Conduct:
                            
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
                            
<https://urldefense.com/v3/__https:/docs.fedoraproject.org/en-US/project/code-of-conduct/__;!!JE2nyJ8!1qCobdltMGmX_XrFdPGoEKXfZdGVOi6A5VWjbWtmSJ7AwkqY-PydSmvJfs-4UYWfupW5of5u87v7mBMWQCBDEiq_mj_I$>
                            List Guidelines:
                            
https://fedoraproject.org/wiki/Mailing_list_guidelines
                            
<https://urldefense.com/v3/__https:/fedoraproject.org/wiki/Mailing_list_guidelines__;!!JE2nyJ8!1qCobdltMGmX_XrFdPGoEKXfZdGVOi6A5VWjbWtmSJ7AwkqY-PydSmvJfs-4UYWfupW5of5u87v7mBMWQCBDEsTfSxSj$>
                            List Archives:
                            
https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
                            
<https://urldefense.com/v3/__https:/lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org__;!!JE2nyJ8!1qCobdltMGmX_XrFdPGoEKXfZdGVOi6A5VWjbWtmSJ7AwkqY-PydSmvJfs-4UYWfupW5of5u87v7mBMWQCBDEslJO53N$>
                            Do not reply to spam, report it:
                            https://pagure.io/fedora-infrastructure/new_issue
                            
<https://urldefense.com/v3/__https:/pagure.io/fedora-infrastructure/new_issue__;!!JE2nyJ8!1qCobdltMGmX_XrFdPGoEKXfZdGVOi6A5VWjbWtmSJ7AwkqY-PydSmvJfs-4UYWfupW5of5u87v7mBMWQCBDEvlOa8tQ$>


-- 
_______________________________________________
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue

Reply via email to