As a side note, I also frequently get this kind of errors:

com.sun.grid.jgdi.JGDIException: failed receiving gdi request response for
mid=65527 (can't send response for this message id - protocol error).
  at com.sun.grid.jgdi.jni.JGDIImpl.fillJobListWithAnswer(Native Method)
  at com.sun.grid.jgdi.jni.JGDIImpl.fillJobList(JGDIImpl.java:3279)
  at com.sun.grid.jgdi.jni.JGDIImpl.getJob(JGDIImpl.java:3330)


and:

com.sun.grid.jgdi.JGDIException: GDI mismatch
  at com.sun.grid.jgdi.jni.JGDIImpl.fillJobListWithAnswer(Native Method)
  at com.sun.grid.jgdi.jni.JGDIImpl.fillJobList(JGDIImpl.java:3279)
  at com.sun.grid.jgdi.jni.JGDIImpl.getJob(JGDIImpl.java:3330)

It might not be related though.

2016-11-09 13:52 GMT+01:00 Julien Nicoulaud <julien.nicoul...@gmail.com>:

> Hi all,
>
> I am using JGDI with EventClient, on a SGE 6.2u5 installation. My process
> randomly dies every few days with a segmentation fault in this code (from
> the core dumps):
>
> Thread 1 (Thread 0x7fbd3bfff700 (LWP 23345)):
> #0  0x00007fbd5113e097 in cl_raw_list_get_next_elem () from
> /opt/sge/lib/lx24-amd64/libjgdi.so
> #1  0x00007fbd511214c2 in cl_message_list_get_next_elem () from
> /opt/sge/lib/lx24-amd64/libjgdi.so
> #2  0x00007fbd51132734 in cl_commlib_app_message_queue_cleanup () from
> /opt/sge/lib/lx24-amd64/libjgdi.so
> #3  0x00007fbd51130d04 in cl_com_handle_service_thread () from
> /opt/sge/lib/lx24-amd64/libjgdi.so
>
>
> So it looks to me like the commlib messages list gets corrupted (lock
> issue ?).
>
> Here are the backtraces for the other JGDI threads:
>
> Thread 67 (Thread 0x7fbd50ed2700 (LWP 23344)):
> #0  0x00000038ea80b75b in pthread_cond_timedwait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
> #1  0x00007fbd5113faac in cl_thread_wait_for_thread_condition () from
> /opt/sge/lib/lx24-amd64/libjgdi.so
> #2  0x00007fbd511401d2 in cl_thread_wait_for_event () from
> /opt/sge/lib/lx24-amd64/libjgdi.so
> #3  0x00007fbd51130b84 in cl_com_trigger_thread () from
> /opt/sge/lib/lx24-amd64/libjgdi.so
> --
> Thread 18 (Thread 0x7fbd3abfd700 (LWP 23347)):
> #0  0x00000038ea80b75b in pthread_cond_timedwait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
> #1  0x00007fbd5113faac in cl_thread_wait_for_thread_condition () from
> /opt/sge/lib/lx24-amd64/libjgdi.so
> #2  0x00007fbd511401d2 in cl_thread_wait_for_event () from
> /opt/sge/lib/lx24-amd64/libjgdi.so
> #3  0x00007fbd51131cbd in cl_com_handle_write_thread () from
> /opt/sge/lib/lx24-amd64/libjgdi.so
> --
> Thread 11 (Thread 0x7fbd38dfa700 (LWP 23369)):
> #0  0x00000038ea80b75b in pthread_cond_timedwait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
> #1  0x00007fbd5113faac in cl_thread_wait_for_thread_condition () from
> /opt/sge/lib/lx24-amd64/libjgdi.so
> #2  0x00007fbd511401d2 in cl_thread_wait_for_event () from
> /opt/sge/lib/lx24-amd64/libjgdi.so
> #3  0x00007fbd51131cbd in cl_com_handle_write_thread () from
> /opt/sge/lib/lx24-amd64/libjgdi.so
> --
> Thread 6 (Thread 0x7fbd3a1fc700 (LWP 23367)):
> #0  0x00000038ea80b75b in pthread_cond_timedwait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
> #1  0x00007fbd5113faac in cl_thread_wait_for_thread_condition () from
> /opt/sge/lib/lx24-amd64/libjgdi.so
> #2  0x00007fbd511401d2 in cl_thread_wait_for_event () from
> /opt/sge/lib/lx24-amd64/libjgdi.so
> #3  0x00007fbd51130d41 in cl_com_handle_service_thread () from
> /opt/sge/lib/lx24-amd64/libjgdi.so
> --
> Thread 5 (Thread 0x7fbd3b5fe700 (LWP 23346)):
> #0  0x00000038ea0dc053 in poll () from /lib64/libc.so.6
> #1  0x00007fbd51113c10 in cl_com_tcp_open_connection_request_handler ()
> from /opt/sge/lib/lx24-amd64/libjgdi.so
> #2  0x00007fbd511199d9 in cl_com_open_connection_request_handler () from
> /opt/sge/lib/lx24-amd64/libjgdi.so
> #3  0x00007fbd51130f91 in cl_com_handle_read_thread () from
> /opt/sge/lib/lx24-amd64/libjgdi.so
> --
> Thread 4 (Thread 0x7fbd397fb700 (LWP 23368)):
> #0  0x00000038ea0dc053 in poll () from /lib64/libc.so.6
> #1  0x00007fbd51113c10 in cl_com_tcp_open_connection_request_handler ()
> from /opt/sge/lib/lx24-amd64/libjgdi.so
> #2  0x00007fbd511199d9 in cl_com_open_connection_request_handler () from
> /opt/sge/lib/lx24-amd64/libjgdi.so
> #3  0x00007fbd51130f91 in cl_com_handle_read_thread () from
> /opt/sge/lib/lx24-amd64/libjgdi.so
> --
> Thread 3 (Thread 0x7fbd50498700 (LWP 23366)):
> #0  0x00000038ea80b75b in pthread_cond_timedwait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
> #1  0x00007fbd5113faac in cl_thread_wait_for_thread_condition () from
> /opt/sge/lib/lx24-amd64/libjgdi.so
> #2  0x00007fbd5112d7c8 in cl_commlib_receive_message () from
> /opt/sge/lib/lx24-amd64/libjgdi.so
> #3  0x00007fbd5108b9b5 in sge_gdi2_get_any_request () from
> /opt/sge/lib/lx24-amd64/libjgdi.so
> #4  0x00007fbd51084223 in get_event_list () from /opt/sge/lib/lx24-amd64/
> libjgdi.so
> #5  0x00007fbd510839d6 in ec2_get () from /opt/sge/lib/lx24-amd64/
> libjgdi.so
> #6  0x00007fbd50f84cbc in waitEVC () from /opt/sge/lib/lx24-amd64/
> libjgdi.so
> #7  0x00007fbd50f83662 in 
> Java_com_sun_grid_jgdi_jni_EventClientImpl_fillEvents
> () from /opt/sge/lib/lx24-amd64/libjgdi.so
> --
>
>
> I could find similar old bug reports, but nothing showing this could have
> been fixed:
>
>    - http://arc.liv.ac.uk/pipermail/gridengine-users/2009-July/026145.html
>    - https://arc.liv.ac.uk/pipermail/gridengine-users/
>    2009-July/026086.html
>
> Is anyone aware of this ? In the hypothesis this is actually a concurrency
> bug, are there any ways to force commlib to use a single-threaded mode ?
>
> Regards,
> Julien
>
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to