As a side note, I also frequently get this kind of errors: com.sun.grid.jgdi.JGDIException: failed receiving gdi request response for mid=65527 (can't send response for this message id - protocol error). at com.sun.grid.jgdi.jni.JGDIImpl.fillJobListWithAnswer(Native Method) at com.sun.grid.jgdi.jni.JGDIImpl.fillJobList(JGDIImpl.java:3279) at com.sun.grid.jgdi.jni.JGDIImpl.getJob(JGDIImpl.java:3330)
and: com.sun.grid.jgdi.JGDIException: GDI mismatch at com.sun.grid.jgdi.jni.JGDIImpl.fillJobListWithAnswer(Native Method) at com.sun.grid.jgdi.jni.JGDIImpl.fillJobList(JGDIImpl.java:3279) at com.sun.grid.jgdi.jni.JGDIImpl.getJob(JGDIImpl.java:3330) It might not be related though. 2016-11-09 13:52 GMT+01:00 Julien Nicoulaud <julien.nicoul...@gmail.com>: > Hi all, > > I am using JGDI with EventClient, on a SGE 6.2u5 installation. My process > randomly dies every few days with a segmentation fault in this code (from > the core dumps): > > Thread 1 (Thread 0x7fbd3bfff700 (LWP 23345)): > #0 0x00007fbd5113e097 in cl_raw_list_get_next_elem () from > /opt/sge/lib/lx24-amd64/libjgdi.so > #1 0x00007fbd511214c2 in cl_message_list_get_next_elem () from > /opt/sge/lib/lx24-amd64/libjgdi.so > #2 0x00007fbd51132734 in cl_commlib_app_message_queue_cleanup () from > /opt/sge/lib/lx24-amd64/libjgdi.so > #3 0x00007fbd51130d04 in cl_com_handle_service_thread () from > /opt/sge/lib/lx24-amd64/libjgdi.so > > > So it looks to me like the commlib messages list gets corrupted (lock > issue ?). > > Here are the backtraces for the other JGDI threads: > > Thread 67 (Thread 0x7fbd50ed2700 (LWP 23344)): > #0 0x00000038ea80b75b in pthread_cond_timedwait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x00007fbd5113faac in cl_thread_wait_for_thread_condition () from > /opt/sge/lib/lx24-amd64/libjgdi.so > #2 0x00007fbd511401d2 in cl_thread_wait_for_event () from > /opt/sge/lib/lx24-amd64/libjgdi.so > #3 0x00007fbd51130b84 in cl_com_trigger_thread () from > /opt/sge/lib/lx24-amd64/libjgdi.so > -- > Thread 18 (Thread 0x7fbd3abfd700 (LWP 23347)): > #0 0x00000038ea80b75b in pthread_cond_timedwait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x00007fbd5113faac in cl_thread_wait_for_thread_condition () from > /opt/sge/lib/lx24-amd64/libjgdi.so > #2 0x00007fbd511401d2 in cl_thread_wait_for_event () from > /opt/sge/lib/lx24-amd64/libjgdi.so > #3 0x00007fbd51131cbd in cl_com_handle_write_thread () from > /opt/sge/lib/lx24-amd64/libjgdi.so > -- > Thread 11 (Thread 0x7fbd38dfa700 (LWP 23369)): > #0 0x00000038ea80b75b in pthread_cond_timedwait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x00007fbd5113faac in cl_thread_wait_for_thread_condition () from > /opt/sge/lib/lx24-amd64/libjgdi.so > #2 0x00007fbd511401d2 in cl_thread_wait_for_event () from > /opt/sge/lib/lx24-amd64/libjgdi.so > #3 0x00007fbd51131cbd in cl_com_handle_write_thread () from > /opt/sge/lib/lx24-amd64/libjgdi.so > -- > Thread 6 (Thread 0x7fbd3a1fc700 (LWP 23367)): > #0 0x00000038ea80b75b in pthread_cond_timedwait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x00007fbd5113faac in cl_thread_wait_for_thread_condition () from > /opt/sge/lib/lx24-amd64/libjgdi.so > #2 0x00007fbd511401d2 in cl_thread_wait_for_event () from > /opt/sge/lib/lx24-amd64/libjgdi.so > #3 0x00007fbd51130d41 in cl_com_handle_service_thread () from > /opt/sge/lib/lx24-amd64/libjgdi.so > -- > Thread 5 (Thread 0x7fbd3b5fe700 (LWP 23346)): > #0 0x00000038ea0dc053 in poll () from /lib64/libc.so.6 > #1 0x00007fbd51113c10 in cl_com_tcp_open_connection_request_handler () > from /opt/sge/lib/lx24-amd64/libjgdi.so > #2 0x00007fbd511199d9 in cl_com_open_connection_request_handler () from > /opt/sge/lib/lx24-amd64/libjgdi.so > #3 0x00007fbd51130f91 in cl_com_handle_read_thread () from > /opt/sge/lib/lx24-amd64/libjgdi.so > -- > Thread 4 (Thread 0x7fbd397fb700 (LWP 23368)): > #0 0x00000038ea0dc053 in poll () from /lib64/libc.so.6 > #1 0x00007fbd51113c10 in cl_com_tcp_open_connection_request_handler () > from /opt/sge/lib/lx24-amd64/libjgdi.so > #2 0x00007fbd511199d9 in cl_com_open_connection_request_handler () from > /opt/sge/lib/lx24-amd64/libjgdi.so > #3 0x00007fbd51130f91 in cl_com_handle_read_thread () from > /opt/sge/lib/lx24-amd64/libjgdi.so > -- > Thread 3 (Thread 0x7fbd50498700 (LWP 23366)): > #0 0x00000038ea80b75b in pthread_cond_timedwait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x00007fbd5113faac in cl_thread_wait_for_thread_condition () from > /opt/sge/lib/lx24-amd64/libjgdi.so > #2 0x00007fbd5112d7c8 in cl_commlib_receive_message () from > /opt/sge/lib/lx24-amd64/libjgdi.so > #3 0x00007fbd5108b9b5 in sge_gdi2_get_any_request () from > /opt/sge/lib/lx24-amd64/libjgdi.so > #4 0x00007fbd51084223 in get_event_list () from /opt/sge/lib/lx24-amd64/ > libjgdi.so > #5 0x00007fbd510839d6 in ec2_get () from /opt/sge/lib/lx24-amd64/ > libjgdi.so > #6 0x00007fbd50f84cbc in waitEVC () from /opt/sge/lib/lx24-amd64/ > libjgdi.so > #7 0x00007fbd50f83662 in > Java_com_sun_grid_jgdi_jni_EventClientImpl_fillEvents > () from /opt/sge/lib/lx24-amd64/libjgdi.so > -- > > > I could find similar old bug reports, but nothing showing this could have > been fixed: > > - http://arc.liv.ac.uk/pipermail/gridengine-users/2009-July/026145.html > - https://arc.liv.ac.uk/pipermail/gridengine-users/ > 2009-July/026086.html > > Is anyone aware of this ? In the hypothesis this is actually a concurrency > bug, are there any ways to force commlib to use a single-threaded mode ? > > Regards, > Julien >
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users