On 11/18/2011 05:31 AM, "Reddy, Gandikota Rushendra (ES&N)"
<gandikota-rushendra.re...@hp.com> wrote:
This was a bounce - please check your list membership before posting again.
On a multi CPU system, ns-slapd process aborts. Stack trace says it aborted
calling PR_ASSERT() at the following location in connection.c (Release version
of 389 dirsrv is 1.2.0).
Is there any chance you could attempt to reproduce this issue with the
latest 1.2.9.9 (stable) or 1.2.10.a5 (testing)? Or provide a
reproducer. 1.2.0 is quite old.
Also, please file a bug at
https://bugzilla.redhat.com/enter_bug.cgi?product=389
-----
Connection.c
1690 int connection_read_operation(Connection *conn, Operation *op, ber_tag_t
*tag, int *remaining_data)
1691 {
1734 }
1735 /* If we still haven't seen a complete PDU, read from the network
*/
1736 while (*tag == LBER_DEFAULT) {
1737 int ioblocktimeout_waits = config_get_ioblocktimeout() /
CONN_TURBO_TIMEOUT_INTERVAL;
1738 /* We should never get here with data remaining in the
buffer */
1739 PR_ASSERT( !new_operation || 0 == (conn->c_private->c_buffer_bytes
- conn->c_private->c_buffer_offset) );<------------------ aborted calling PR_ASSERT()
1740 /* We make a non-blocking read call */
-----
Using instrumented image I have collected data and found that:
While a thread is working on connection's private buffer, another thread which
belongs to older connection makes the connection readable again by modifying
c_gettingber to zero.
This results in waking up one more thread. Finally more than one thread works
simultaneously on connection's private buffer and results in aborting ns-slapd.
It happens as below:
1. Thead X is working on Connection-1's last opereration.
2. After performing the LDAP operation ( connection_dispatch_operation() )
decrementes the connection's c_refcnt by calling connection_release_nolock()
in connection_threadmain() function.
3. Since this is the last operation, c_refcnt becomes zero.
4. Thread X yet to execute the code that updates c_threadnumber and
c_gettingber. But as soon as c_refcnt of the connection becomes zero, main
thread allocates the same connection object/structure to new incoming
connection connection-2. Thread-Y started executing operation belongs to this
newer connection (Connection-2).
-----
connection.c
2008 static void
2009 connection_threadmain()
2010 {
2187 if (!thread_turbo_flag&& !more_data) {
2188 connection_release_nolock
(conn);<------------------ Thread-X decremented c_refcnt and now c_refcnt = 0
2189 }
2190 PR_Unlock( conn->c_mutex );
2191 }
2192 /* Since we didn't do so earlier, we need to make a
replication connection readable again here */
2193 if ( ((1 == is_timedout) || (replication_connection&&
!thread_turbo_flag))&& !more_data)
2194 connection_make_readable(conn);
2195 pb = NULL;
2196
2197 if (!thread_turbo_flag&& !more_data) { /* Don't do this
in turbo mode */
2198 PR_Lock( conn->c_mutex );
2199 /* if the threadnumber of now below the maximum,
wakeup
2200 * the listener thread so that we start polling on
this
2201 * connection again
2202 */
2203 /* DBDB I think this code is bogus -- we already
signaled the listener above here */
2204 if (conn->c_threadnumber ==
config_get_maxthreadsperconn())
2205 need_wakeup = 1;
2206 else
2207 need_wakeup = 0;
2208 conn->c_threadnumber--;
2209 PR_Unlock( conn->c_mutex );
2210
2211 if (need_wakeup)
2212 signal_listner();
2213 }
2214
2215
2216 } /* while (1) */
2217 }
----
5. Thread-Y started processing operation data belongs to newer connection
Connection-2 (connection_read_operation() ).
6. Now Thread-X of connection-1 called make_connection_readable() and made the
connection readable again by setting c_gettingbuffer to 0.
7. This wakes up another thread of connection-2 and it too starts processing
operation data (connection_read_operation() ).
8. So at this point of time more than one thread is working on connection's
private data.
9. As two threads are working on the connections c_buffer, c_bufferbytes and
c_offset, at some time it meats conditions to abort() at the following location:
Connection.c: connection_read_operation()
1734 }
1735 /* If we still haven't seen a complete PDU, read from the network
*/
1736 while (*tag == LBER_DEFAULT) {
1737 int ioblocktimeout_waits = config_get_ioblocktimeout() /
CONN_TURBO_TIMEOUT_INTERVAL;
1738 /* We should never get here with data remaining in the
buffer */
1739 PR_ASSERT( !new_operation || 0 == (conn->c_private->c_buffer_bytes
- conn->c_private->c_buffer_offset) );<------------------ aborted calling PR_ASSERT()
1740 /* We make a non-blocking read call */
-----
The probable fix could be:
In the connection_threadmain() function, if it is CONN_DONE don't make the
connection readable again after decrementing the c_refcnt.
So should we handle CONN_DONE and CONN_TIMEDOUT situation differently as below?
Present code:
2097 switch (ret) {
2098 case CONN_DONE:
2099 /* This means that the connection was
closed, so clear turbo mode */
2100 /*FALLTHROUGH*/
2101 case CONN_TIMEDOUT:
2102 thread_turbo_flag = 0;
2103 is_timedout = 1;
new code:
2097 switch (ret) {
2098 case CONN_DONE:
2099 /* This means that the connection was
closed, so clear turbo mode */
is_conn_done = 0;<-----------------
thread_turbo_flag = 0;<-------------
2101 case CONN_TIMEDOUT:
2102 thread_turbo_flag = 0;
2103 is_timedout = 1;
Note that you don't have to set thread_turbo_flag = 0 in the CONN_DONE
case because it will be set when it falls through to the CONN_TIMEDOUT case.
Present:
2193 if ( ((1 == is_timedout) || (replication_connection&&
!thread_turbo_flag))&& !more_data )
2194 connection_make_readable(conn);
2195 pb = NULL;
new:
2193 ---------> if ( !is_conn_done&& ((1 == is_timedout) ||
(replication_connection&& !thread_turbo_flag))&& !more_data )
2194 connection_make_readable(conn);
2195 pb = NULL;
Is it right way to fix it? Please provide your inputs.
If ret is CONN_DONE, but you set is_conn_done = 0, how does the code
know the connection is really done?
Thanks& regards,
Rushendra
--
389 users mailing list
389-us...@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users