For what it's worth, I could reproduce the bug with JGDI debug logs
activated.
I posted it here (truncated to last 50MB):
https://gist.github.com/nicoulaj/0d8eaa2c437ff408b28b79a6e4fb87e9
I can see a lot of commlib errors, but I'm not sure this helps as I can see
the same errors when it works fine.

2016-11-09 13:54 GMT+01:00 Julien Nicoulaud <julien.nicoul...@gmail.com>:

> As a side note, I also frequently get this kind of errors:
>
> com.sun.grid.jgdi.JGDIException: failed receiving gdi request response
> for mid=65527 (can't send response for this message id - protocol error).
>   at com.sun.grid.jgdi.jni.JGDIImpl.fillJobListWithAnswer(Native Method)
>   at com.sun.grid.jgdi.jni.JGDIImpl.fillJobList(JGDIImpl.java:3279)
>   at com.sun.grid.jgdi.jni.JGDIImpl.getJob(JGDIImpl.java:3330)
>
>
> and:
>
> com.sun.grid.jgdi.JGDIException: GDI mismatch
>   at com.sun.grid.jgdi.jni.JGDIImpl.fillJobListWithAnswer(Native Method)
>   at com.sun.grid.jgdi.jni.JGDIImpl.fillJobList(JGDIImpl.java:3279)
>   at com.sun.grid.jgdi.jni.JGDIImpl.getJob(JGDIImpl.java:3330)
>
> It might not be related though.
>
> 2016-11-09 13:52 GMT+01:00 Julien Nicoulaud <julien.nicoul...@gmail.com>:
>
>> Hi all,
>>
>> I am using JGDI with EventClient, on a SGE 6.2u5 installation. My process
>> randomly dies every few days with a segmentation fault in this code (from
>> the core dumps):
>>
>> Thread 1 (Thread 0x7fbd3bfff700 (LWP 23345)):
>> #0  0x00007fbd5113e097 in cl_raw_list_get_next_elem () from
>> /opt/sge/lib/lx24-amd64/libjgdi.so
>> #1  0x00007fbd511214c2 in cl_message_list_get_next_elem () from
>> /opt/sge/lib/lx24-amd64/libjgdi.so
>> #2  0x00007fbd51132734 in cl_commlib_app_message_queue_cleanup () from
>> /opt/sge/lib/lx24-amd64/libjgdi.so
>> #3  0x00007fbd51130d04 in cl_com_handle_service_thread () from
>> /opt/sge/lib/lx24-amd64/libjgdi.so
>>
>>
>> So it looks to me like the commlib messages list gets corrupted (lock
>> issue ?).
>>
>> Here are the backtraces for the other JGDI threads:
>>
>> Thread 67 (Thread 0x7fbd50ed2700 (LWP 23344)):
>> #0  0x00000038ea80b75b in pthread_cond_timedwait@@GLIBC_2.3.2 () from
>> /lib64/libpthread.so.0
>> #1  0x00007fbd5113faac in cl_thread_wait_for_thread_condition () from
>> /opt/sge/lib/lx24-amd64/libjgdi.so
>> #2  0x00007fbd511401d2 in cl_thread_wait_for_event () from
>> /opt/sge/lib/lx24-amd64/libjgdi.so
>> #3  0x00007fbd51130b84 in cl_com_trigger_thread () from
>> /opt/sge/lib/lx24-amd64/libjgdi.so
>> --
>> Thread 18 (Thread 0x7fbd3abfd700 (LWP 23347)):
>> #0  0x00000038ea80b75b in pthread_cond_timedwait@@GLIBC_2.3.2 () from
>> /lib64/libpthread.so.0
>> #1  0x00007fbd5113faac in cl_thread_wait_for_thread_condition () from
>> /opt/sge/lib/lx24-amd64/libjgdi.so
>> #2  0x00007fbd511401d2 in cl_thread_wait_for_event () from
>> /opt/sge/lib/lx24-amd64/libjgdi.so
>> #3  0x00007fbd51131cbd in cl_com_handle_write_thread () from
>> /opt/sge/lib/lx24-amd64/libjgdi.so
>> --
>> Thread 11 (Thread 0x7fbd38dfa700 (LWP 23369)):
>> #0  0x00000038ea80b75b in pthread_cond_timedwait@@GLIBC_2.3.2 () from
>> /lib64/libpthread.so.0
>> #1  0x00007fbd5113faac in cl_thread_wait_for_thread_condition () from
>> /opt/sge/lib/lx24-amd64/libjgdi.so
>> #2  0x00007fbd511401d2 in cl_thread_wait_for_event () from
>> /opt/sge/lib/lx24-amd64/libjgdi.so
>> #3  0x00007fbd51131cbd in cl_com_handle_write_thread () from
>> /opt/sge/lib/lx24-amd64/libjgdi.so
>> --
>> Thread 6 (Thread 0x7fbd3a1fc700 (LWP 23367)):
>> #0  0x00000038ea80b75b in pthread_cond_timedwait@@GLIBC_2.3.2 () from
>> /lib64/libpthread.so.0
>> #1  0x00007fbd5113faac in cl_thread_wait_for_thread_condition () from
>> /opt/sge/lib/lx24-amd64/libjgdi.so
>> #2  0x00007fbd511401d2 in cl_thread_wait_for_event () from
>> /opt/sge/lib/lx24-amd64/libjgdi.so
>> #3  0x00007fbd51130d41 in cl_com_handle_service_thread () from
>> /opt/sge/lib/lx24-amd64/libjgdi.so
>> --
>> Thread 5 (Thread 0x7fbd3b5fe700 (LWP 23346)):
>> #0  0x00000038ea0dc053 in poll () from /lib64/libc.so.6
>> #1  0x00007fbd51113c10 in cl_com_tcp_open_connection_request_handler ()
>> from /opt/sge/lib/lx24-amd64/libjgdi.so
>> #2  0x00007fbd511199d9 in cl_com_open_connection_request_handler () from
>> /opt/sge/lib/lx24-amd64/libjgdi.so
>> #3  0x00007fbd51130f91 in cl_com_handle_read_thread () from
>> /opt/sge/lib/lx24-amd64/libjgdi.so
>> --
>> Thread 4 (Thread 0x7fbd397fb700 (LWP 23368)):
>> #0  0x00000038ea0dc053 in poll () from /lib64/libc.so.6
>> #1  0x00007fbd51113c10 in cl_com_tcp_open_connection_request_handler ()
>> from /opt/sge/lib/lx24-amd64/libjgdi.so
>> #2  0x00007fbd511199d9 in cl_com_open_connection_request_handler () from
>> /opt/sge/lib/lx24-amd64/libjgdi.so
>> #3  0x00007fbd51130f91 in cl_com_handle_read_thread () from
>> /opt/sge/lib/lx24-amd64/libjgdi.so
>> --
>> Thread 3 (Thread 0x7fbd50498700 (LWP 23366)):
>> #0  0x00000038ea80b75b in pthread_cond_timedwait@@GLIBC_2.3.2 () from
>> /lib64/libpthread.so.0
>> #1  0x00007fbd5113faac in cl_thread_wait_for_thread_condition () from
>> /opt/sge/lib/lx24-amd64/libjgdi.so
>> #2  0x00007fbd5112d7c8 in cl_commlib_receive_message () from
>> /opt/sge/lib/lx24-amd64/libjgdi.so
>> #3  0x00007fbd5108b9b5 in sge_gdi2_get_any_request () from
>> /opt/sge/lib/lx24-amd64/libjgdi.so
>> #4  0x00007fbd51084223 in get_event_list () from
>> /opt/sge/lib/lx24-amd64/libjgdi.so
>> #5  0x00007fbd510839d6 in ec2_get () from /opt/sge/lib/lx24-amd64/libjgd
>> i.so
>> #6  0x00007fbd50f84cbc in waitEVC () from /opt/sge/lib/lx24-amd64/libjgd
>> i.so
>> #7  0x00007fbd50f83662 in 
>> Java_com_sun_grid_jgdi_jni_EventClientImpl_fillEvents
>> () from /opt/sge/lib/lx24-amd64/libjgdi.so
>> --
>>
>>
>> I could find similar old bug reports, but nothing showing this could have
>> been fixed:
>>
>>    - http://arc.liv.ac.uk/pipermail/gridengine-users/2009-July/
>>    026145.html
>>    - https://arc.liv.ac.uk/pipermail/gridengine-users/2009-July/
>>    026086.html
>>
>> Is anyone aware of this ? In the hypothesis this is actually a
>> concurrency bug, are there any ways to force commlib to use a
>> single-threaded mode ?
>>
>> Regards,
>> Julien
>>
>
>
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to