Hi all, I am using JGDI with EventClient, on a SGE 6.2u5 installation. My process randomly dies every few days with a segmentation fault in this code (from the core dumps):
Thread 1 (Thread 0x7fbd3bfff700 (LWP 23345)): #0 0x00007fbd5113e097 in cl_raw_list_get_next_elem () from /opt/sge/lib/lx24-amd64/libjgdi.so #1 0x00007fbd511214c2 in cl_message_list_get_next_elem () from /opt/sge/lib/lx24-amd64/libjgdi.so #2 0x00007fbd51132734 in cl_commlib_app_message_queue_cleanup () from /opt/sge/lib/lx24-amd64/libjgdi.so #3 0x00007fbd51130d04 in cl_com_handle_service_thread () from /opt/sge/lib/lx24-amd64/libjgdi.so So it looks to me like the commlib messages list gets corrupted (lock issue ?). Here are the backtraces for the other JGDI threads: Thread 67 (Thread 0x7fbd50ed2700 (LWP 23344)): #0 0x00000038ea80b75b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fbd5113faac in cl_thread_wait_for_thread_condition () from /opt/sge/lib/lx24-amd64/libjgdi.so #2 0x00007fbd511401d2 in cl_thread_wait_for_event () from /opt/sge/lib/lx24-amd64/libjgdi.so #3 0x00007fbd51130b84 in cl_com_trigger_thread () from /opt/sge/lib/lx24-amd64/libjgdi.so -- Thread 18 (Thread 0x7fbd3abfd700 (LWP 23347)): #0 0x00000038ea80b75b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fbd5113faac in cl_thread_wait_for_thread_condition () from /opt/sge/lib/lx24-amd64/libjgdi.so #2 0x00007fbd511401d2 in cl_thread_wait_for_event () from /opt/sge/lib/lx24-amd64/libjgdi.so #3 0x00007fbd51131cbd in cl_com_handle_write_thread () from /opt/sge/lib/lx24-amd64/libjgdi.so -- Thread 11 (Thread 0x7fbd38dfa700 (LWP 23369)): #0 0x00000038ea80b75b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fbd5113faac in cl_thread_wait_for_thread_condition () from /opt/sge/lib/lx24-amd64/libjgdi.so #2 0x00007fbd511401d2 in cl_thread_wait_for_event () from /opt/sge/lib/lx24-amd64/libjgdi.so #3 0x00007fbd51131cbd in cl_com_handle_write_thread () from /opt/sge/lib/lx24-amd64/libjgdi.so -- Thread 6 (Thread 0x7fbd3a1fc700 (LWP 23367)): #0 0x00000038ea80b75b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fbd5113faac in cl_thread_wait_for_thread_condition () from /opt/sge/lib/lx24-amd64/libjgdi.so #2 0x00007fbd511401d2 in cl_thread_wait_for_event () from /opt/sge/lib/lx24-amd64/libjgdi.so #3 0x00007fbd51130d41 in cl_com_handle_service_thread () from /opt/sge/lib/lx24-amd64/libjgdi.so -- Thread 5 (Thread 0x7fbd3b5fe700 (LWP 23346)): #0 0x00000038ea0dc053 in poll () from /lib64/libc.so.6 #1 0x00007fbd51113c10 in cl_com_tcp_open_connection_request_handler () from /opt/sge/lib/lx24-amd64/libjgdi.so #2 0x00007fbd511199d9 in cl_com_open_connection_request_handler () from /opt/sge/lib/lx24-amd64/libjgdi.so #3 0x00007fbd51130f91 in cl_com_handle_read_thread () from /opt/sge/lib/lx24-amd64/libjgdi.so -- Thread 4 (Thread 0x7fbd397fb700 (LWP 23368)): #0 0x00000038ea0dc053 in poll () from /lib64/libc.so.6 #1 0x00007fbd51113c10 in cl_com_tcp_open_connection_request_handler () from /opt/sge/lib/lx24-amd64/libjgdi.so #2 0x00007fbd511199d9 in cl_com_open_connection_request_handler () from /opt/sge/lib/lx24-amd64/libjgdi.so #3 0x00007fbd51130f91 in cl_com_handle_read_thread () from /opt/sge/lib/lx24-amd64/libjgdi.so -- Thread 3 (Thread 0x7fbd50498700 (LWP 23366)): #0 0x00000038ea80b75b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fbd5113faac in cl_thread_wait_for_thread_condition () from /opt/sge/lib/lx24-amd64/libjgdi.so #2 0x00007fbd5112d7c8 in cl_commlib_receive_message () from /opt/sge/lib/lx24-amd64/libjgdi.so #3 0x00007fbd5108b9b5 in sge_gdi2_get_any_request () from /opt/sge/lib/lx24-amd64/libjgdi.so #4 0x00007fbd51084223 in get_event_list () from /opt/sge/lib/lx24-amd64/libjgdi.so #5 0x00007fbd510839d6 in ec2_get () from /opt/sge/lib/lx24-amd64/libjgdi.so #6 0x00007fbd50f84cbc in waitEVC () from /opt/sge/lib/lx24-amd64/libjgdi.so #7 0x00007fbd50f83662 in Java_com_sun_grid_jgdi_jni_EventClientImpl_fillEvents () from /opt/sge/lib/lx24-amd64/libjgdi.so -- I could find similar old bug reports, but nothing showing this could have been fixed: - http://arc.liv.ac.uk/pipermail/gridengine-users/2009-July/026145.html - https://arc.liv.ac.uk/pipermail/gridengine-users/2009-July/026086.html Is anyone aware of this ? In the hypothesis this is actually a concurrency bug, are there any ways to force commlib to use a single-threaded mode ? Regards, Julien
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users