Hi all,

I am using JGDI with EventClient, on a SGE 6.2u5 installation. My process
randomly dies every few days with a segmentation fault in this code (from
the core dumps):

Thread 1 (Thread 0x7fbd3bfff700 (LWP 23345)):
#0  0x00007fbd5113e097 in cl_raw_list_get_next_elem () from
/opt/sge/lib/lx24-amd64/libjgdi.so
#1  0x00007fbd511214c2 in cl_message_list_get_next_elem () from
/opt/sge/lib/lx24-amd64/libjgdi.so
#2  0x00007fbd51132734 in cl_commlib_app_message_queue_cleanup () from
/opt/sge/lib/lx24-amd64/libjgdi.so
#3  0x00007fbd51130d04 in cl_com_handle_service_thread () from
/opt/sge/lib/lx24-amd64/libjgdi.so


So it looks to me like the commlib messages list gets corrupted (lock issue
?).

Here are the backtraces for the other JGDI threads:

Thread 67 (Thread 0x7fbd50ed2700 (LWP 23344)):
#0  0x00000038ea80b75b in pthread_cond_timedwait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x00007fbd5113faac in cl_thread_wait_for_thread_condition () from
/opt/sge/lib/lx24-amd64/libjgdi.so
#2  0x00007fbd511401d2 in cl_thread_wait_for_event () from
/opt/sge/lib/lx24-amd64/libjgdi.so
#3  0x00007fbd51130b84 in cl_com_trigger_thread () from
/opt/sge/lib/lx24-amd64/libjgdi.so
--
Thread 18 (Thread 0x7fbd3abfd700 (LWP 23347)):
#0  0x00000038ea80b75b in pthread_cond_timedwait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x00007fbd5113faac in cl_thread_wait_for_thread_condition () from
/opt/sge/lib/lx24-amd64/libjgdi.so
#2  0x00007fbd511401d2 in cl_thread_wait_for_event () from
/opt/sge/lib/lx24-amd64/libjgdi.so
#3  0x00007fbd51131cbd in cl_com_handle_write_thread () from
/opt/sge/lib/lx24-amd64/libjgdi.so
--
Thread 11 (Thread 0x7fbd38dfa700 (LWP 23369)):
#0  0x00000038ea80b75b in pthread_cond_timedwait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x00007fbd5113faac in cl_thread_wait_for_thread_condition () from
/opt/sge/lib/lx24-amd64/libjgdi.so
#2  0x00007fbd511401d2 in cl_thread_wait_for_event () from
/opt/sge/lib/lx24-amd64/libjgdi.so
#3  0x00007fbd51131cbd in cl_com_handle_write_thread () from
/opt/sge/lib/lx24-amd64/libjgdi.so
--
Thread 6 (Thread 0x7fbd3a1fc700 (LWP 23367)):
#0  0x00000038ea80b75b in pthread_cond_timedwait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x00007fbd5113faac in cl_thread_wait_for_thread_condition () from
/opt/sge/lib/lx24-amd64/libjgdi.so
#2  0x00007fbd511401d2 in cl_thread_wait_for_event () from
/opt/sge/lib/lx24-amd64/libjgdi.so
#3  0x00007fbd51130d41 in cl_com_handle_service_thread () from
/opt/sge/lib/lx24-amd64/libjgdi.so
--
Thread 5 (Thread 0x7fbd3b5fe700 (LWP 23346)):
#0  0x00000038ea0dc053 in poll () from /lib64/libc.so.6
#1  0x00007fbd51113c10 in cl_com_tcp_open_connection_request_handler ()
from /opt/sge/lib/lx24-amd64/libjgdi.so
#2  0x00007fbd511199d9 in cl_com_open_connection_request_handler () from
/opt/sge/lib/lx24-amd64/libjgdi.so
#3  0x00007fbd51130f91 in cl_com_handle_read_thread () from
/opt/sge/lib/lx24-amd64/libjgdi.so
--
Thread 4 (Thread 0x7fbd397fb700 (LWP 23368)):
#0  0x00000038ea0dc053 in poll () from /lib64/libc.so.6
#1  0x00007fbd51113c10 in cl_com_tcp_open_connection_request_handler ()
from /opt/sge/lib/lx24-amd64/libjgdi.so
#2  0x00007fbd511199d9 in cl_com_open_connection_request_handler () from
/opt/sge/lib/lx24-amd64/libjgdi.so
#3  0x00007fbd51130f91 in cl_com_handle_read_thread () from
/opt/sge/lib/lx24-amd64/libjgdi.so
--
Thread 3 (Thread 0x7fbd50498700 (LWP 23366)):
#0  0x00000038ea80b75b in pthread_cond_timedwait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x00007fbd5113faac in cl_thread_wait_for_thread_condition () from
/opt/sge/lib/lx24-amd64/libjgdi.so
#2  0x00007fbd5112d7c8 in cl_commlib_receive_message () from
/opt/sge/lib/lx24-amd64/libjgdi.so
#3  0x00007fbd5108b9b5 in sge_gdi2_get_any_request () from
/opt/sge/lib/lx24-amd64/libjgdi.so
#4  0x00007fbd51084223 in get_event_list () from
/opt/sge/lib/lx24-amd64/libjgdi.so
#5  0x00007fbd510839d6 in ec2_get () from /opt/sge/lib/lx24-amd64/libjgdi.so
#6  0x00007fbd50f84cbc in waitEVC () from /opt/sge/lib/lx24-amd64/libjgdi.so
#7  0x00007fbd50f83662 in
Java_com_sun_grid_jgdi_jni_EventClientImpl_fillEvents () from
/opt/sge/lib/lx24-amd64/libjgdi.so
--


I could find similar old bug reports, but nothing showing this could have
been fixed:

   - http://arc.liv.ac.uk/pipermail/gridengine-users/2009-July/026145.html
   - https://arc.liv.ac.uk/pipermail/gridengine-users/2009-July/026086.html

Is anyone aware of this ? In the hypothesis this is actually a concurrency
bug, are there any ways to force commlib to use a single-threaded mode ?

Regards,
Julien
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to