At 08:51 02/05/2012, you wrote:
Hi,

I am trying to execute following code on cluster.

run_kernel is a cuda call with signature int run_kernel(int array[],int nelements, int taskid, char hostname[]);

... deleted code

mysum = run_kernel(&onearray[20000000], chunksize, taskid, myname);

... more deleted code

I am simply trying to calculate sum of array elements using kernel function. Each task has its own data and it calculates its own sum.

I am getting segmentation fault on master task but all other task calculate the sum successfully.

Here is the output


MPI task 0 has started on host node4
MPI task 1 has started on host node4
MPI task 2 has started on host node5
MPI task 6 has started on host node6
MPI task 5 has started on node5
MPI task 9 has started on host node6
MPI task 8 has started on host node6
MPI task 3 has started on node5
MPI task 4 has started on hnode5
MPI task 7 has started on node6
[node4] *** Process received signal ***
[node4] Signal: Segmentation fault (11)
[node4] Signal code: Address not mapped (1)
[node4] Failing at address: 0xb7866000
[node4] [ 0] [0xbc040c]
[node4] [ 1] /usr/lib/libcuda.so(+0x13a0f6) [0x10640f6]
[node4] [ 2] /usr/lib/libcuda.so(+0x146912) [0x1070912]
[node4] [ 3] /usr/lib/libcuda.so(+0x147231) [0x1071231]
[node4] [ 4] /usr/lib/libcuda.so(+0x13cb64) [0x1066b64]
[node4] [ 5] /usr/lib/libcuda.so(+0x11863c) [0x104263c]
[node4] [ 6] /usr/lib/libcuda.so(+0x11d93b) [0x104793b]
[node4] [ 7] /usr/lib/libcuda.so(cuMemcpyHtoD_v2+0x64) [0x1037264]
[node4] [ 8] /usr/local/cuda/lib/libcudart.so.4(+0x20336) [0x224336]
[node4] [ 9] /usr/local/cuda/lib/libcudart.so.4(cudaMemcpy+0x230) [0x257360]
[node4] [10] mpi_array_distributed(run_kernel+0x9a) [0x804a482]
[node4] [11] mpi_array_distributed(main+0x325) [0x804a139]
[node4] [12] /lib/libc.so.6(__libc_start_main+0xe6) [0x4dece6]
[node4] [13] mpi_array_distributed() [0x8049d81]
[node4] *** End of error message ***

It fails doing the cuMemcpyHtoD inside cuda code. Perhaps doing any of this changes can fix your problem:

a) mysum = run_kernel(onearray, chunksize, taskid, myname);

b) mysum = run_kernel(&onearray[20000000-1], chunksize, taskid, myname);

 --------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 3054 on node <http://ecm-c-l-207-004.uniwa.uwa.edu.au>ecm-c-l-207-004.uniwa.uwa.edu.au exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Sadly i cant install memory checker such as valgrind on my machine due to some restrictions. I could not spot any error by looking in code.

Can anyone help me ?what is wrong in above code.

Thanks


Reply via email to