For what it's worth, I could reproduce the bug with JGDI debug logs activated. I posted it here (truncated to last 50MB): https://gist.github.com/nicoulaj/0d8eaa2c437ff408b28b79a6e4fb87e9 I can see a lot of commlib errors, but I'm not sure this helps as I can see the same errors when it works fine.
2016-11-09 13:54 GMT+01:00 Julien Nicoulaud <julien.nicoul...@gmail.com>: > As a side note, I also frequently get this kind of errors: > > com.sun.grid.jgdi.JGDIException: failed receiving gdi request response > for mid=65527 (can't send response for this message id - protocol error). > at com.sun.grid.jgdi.jni.JGDIImpl.fillJobListWithAnswer(Native Method) > at com.sun.grid.jgdi.jni.JGDIImpl.fillJobList(JGDIImpl.java:3279) > at com.sun.grid.jgdi.jni.JGDIImpl.getJob(JGDIImpl.java:3330) > > > and: > > com.sun.grid.jgdi.JGDIException: GDI mismatch > at com.sun.grid.jgdi.jni.JGDIImpl.fillJobListWithAnswer(Native Method) > at com.sun.grid.jgdi.jni.JGDIImpl.fillJobList(JGDIImpl.java:3279) > at com.sun.grid.jgdi.jni.JGDIImpl.getJob(JGDIImpl.java:3330) > > It might not be related though. > > 2016-11-09 13:52 GMT+01:00 Julien Nicoulaud <julien.nicoul...@gmail.com>: > >> Hi all, >> >> I am using JGDI with EventClient, on a SGE 6.2u5 installation. My process >> randomly dies every few days with a segmentation fault in this code (from >> the core dumps): >> >> Thread 1 (Thread 0x7fbd3bfff700 (LWP 23345)): >> #0 0x00007fbd5113e097 in cl_raw_list_get_next_elem () from >> /opt/sge/lib/lx24-amd64/libjgdi.so >> #1 0x00007fbd511214c2 in cl_message_list_get_next_elem () from >> /opt/sge/lib/lx24-amd64/libjgdi.so >> #2 0x00007fbd51132734 in cl_commlib_app_message_queue_cleanup () from >> /opt/sge/lib/lx24-amd64/libjgdi.so >> #3 0x00007fbd51130d04 in cl_com_handle_service_thread () from >> /opt/sge/lib/lx24-amd64/libjgdi.so >> >> >> So it looks to me like the commlib messages list gets corrupted (lock >> issue ?). >> >> Here are the backtraces for the other JGDI threads: >> >> Thread 67 (Thread 0x7fbd50ed2700 (LWP 23344)): >> #0 0x00000038ea80b75b in pthread_cond_timedwait@@GLIBC_2.3.2 () from >> /lib64/libpthread.so.0 >> #1 0x00007fbd5113faac in cl_thread_wait_for_thread_condition () from >> /opt/sge/lib/lx24-amd64/libjgdi.so >> #2 0x00007fbd511401d2 in cl_thread_wait_for_event () from >> /opt/sge/lib/lx24-amd64/libjgdi.so >> #3 0x00007fbd51130b84 in cl_com_trigger_thread () from >> /opt/sge/lib/lx24-amd64/libjgdi.so >> -- >> Thread 18 (Thread 0x7fbd3abfd700 (LWP 23347)): >> #0 0x00000038ea80b75b in pthread_cond_timedwait@@GLIBC_2.3.2 () from >> /lib64/libpthread.so.0 >> #1 0x00007fbd5113faac in cl_thread_wait_for_thread_condition () from >> /opt/sge/lib/lx24-amd64/libjgdi.so >> #2 0x00007fbd511401d2 in cl_thread_wait_for_event () from >> /opt/sge/lib/lx24-amd64/libjgdi.so >> #3 0x00007fbd51131cbd in cl_com_handle_write_thread () from >> /opt/sge/lib/lx24-amd64/libjgdi.so >> -- >> Thread 11 (Thread 0x7fbd38dfa700 (LWP 23369)): >> #0 0x00000038ea80b75b in pthread_cond_timedwait@@GLIBC_2.3.2 () from >> /lib64/libpthread.so.0 >> #1 0x00007fbd5113faac in cl_thread_wait_for_thread_condition () from >> /opt/sge/lib/lx24-amd64/libjgdi.so >> #2 0x00007fbd511401d2 in cl_thread_wait_for_event () from >> /opt/sge/lib/lx24-amd64/libjgdi.so >> #3 0x00007fbd51131cbd in cl_com_handle_write_thread () from >> /opt/sge/lib/lx24-amd64/libjgdi.so >> -- >> Thread 6 (Thread 0x7fbd3a1fc700 (LWP 23367)): >> #0 0x00000038ea80b75b in pthread_cond_timedwait@@GLIBC_2.3.2 () from >> /lib64/libpthread.so.0 >> #1 0x00007fbd5113faac in cl_thread_wait_for_thread_condition () from >> /opt/sge/lib/lx24-amd64/libjgdi.so >> #2 0x00007fbd511401d2 in cl_thread_wait_for_event () from >> /opt/sge/lib/lx24-amd64/libjgdi.so >> #3 0x00007fbd51130d41 in cl_com_handle_service_thread () from >> /opt/sge/lib/lx24-amd64/libjgdi.so >> -- >> Thread 5 (Thread 0x7fbd3b5fe700 (LWP 23346)): >> #0 0x00000038ea0dc053 in poll () from /lib64/libc.so.6 >> #1 0x00007fbd51113c10 in cl_com_tcp_open_connection_request_handler () >> from /opt/sge/lib/lx24-amd64/libjgdi.so >> #2 0x00007fbd511199d9 in cl_com_open_connection_request_handler () from >> /opt/sge/lib/lx24-amd64/libjgdi.so >> #3 0x00007fbd51130f91 in cl_com_handle_read_thread () from >> /opt/sge/lib/lx24-amd64/libjgdi.so >> -- >> Thread 4 (Thread 0x7fbd397fb700 (LWP 23368)): >> #0 0x00000038ea0dc053 in poll () from /lib64/libc.so.6 >> #1 0x00007fbd51113c10 in cl_com_tcp_open_connection_request_handler () >> from /opt/sge/lib/lx24-amd64/libjgdi.so >> #2 0x00007fbd511199d9 in cl_com_open_connection_request_handler () from >> /opt/sge/lib/lx24-amd64/libjgdi.so >> #3 0x00007fbd51130f91 in cl_com_handle_read_thread () from >> /opt/sge/lib/lx24-amd64/libjgdi.so >> -- >> Thread 3 (Thread 0x7fbd50498700 (LWP 23366)): >> #0 0x00000038ea80b75b in pthread_cond_timedwait@@GLIBC_2.3.2 () from >> /lib64/libpthread.so.0 >> #1 0x00007fbd5113faac in cl_thread_wait_for_thread_condition () from >> /opt/sge/lib/lx24-amd64/libjgdi.so >> #2 0x00007fbd5112d7c8 in cl_commlib_receive_message () from >> /opt/sge/lib/lx24-amd64/libjgdi.so >> #3 0x00007fbd5108b9b5 in sge_gdi2_get_any_request () from >> /opt/sge/lib/lx24-amd64/libjgdi.so >> #4 0x00007fbd51084223 in get_event_list () from >> /opt/sge/lib/lx24-amd64/libjgdi.so >> #5 0x00007fbd510839d6 in ec2_get () from /opt/sge/lib/lx24-amd64/libjgd >> i.so >> #6 0x00007fbd50f84cbc in waitEVC () from /opt/sge/lib/lx24-amd64/libjgd >> i.so >> #7 0x00007fbd50f83662 in >> Java_com_sun_grid_jgdi_jni_EventClientImpl_fillEvents >> () from /opt/sge/lib/lx24-amd64/libjgdi.so >> -- >> >> >> I could find similar old bug reports, but nothing showing this could have >> been fixed: >> >> - http://arc.liv.ac.uk/pipermail/gridengine-users/2009-July/ >> 026145.html >> - https://arc.liv.ac.uk/pipermail/gridengine-users/2009-July/ >> 026086.html >> >> Is anyone aware of this ? In the hypothesis this is actually a >> concurrency bug, are there any ways to force commlib to use a >> single-threaded mode ? >> >> Regards, >> Julien >> > >
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users