Issue 170504
Summary [flang][OpenMP] Dead code with !$omp loop bind(parallel) causes core dump during runtime
Labels flang
Assignees
Reporter jfuchs-kmt
    I found an obscure bug, where dead code causes a crash during runtime when using `!$omp loop bind(parallel)` in nested loops as part of a function call. Take snippet below as reference, we have two critical subroutines `singleloop` and `outerloop` that are completely unrelated and do not operate on the same data. Additionally, `outerloop` and related `innerloop` are dead code and **never called** (crash also happens if code is called).

When we want to loop through some data using `!$omp target teams loop` in `singleloop`, then we get a core dump:
```
"PluginInterface" error: Failure to synchronize stream (nil): "unknown or internal error" error in cuStreamSynchronize: an illegal memory access was encountered
omptarget error: Consult https://openmp.llvm.org/design/Runtimes.html for debugging options.
omptarget error: Source location information not present. Compile with -g or -gline-tables-only.
omptarget fatal error 1: failure of target construct while offloading is mandatory
Aborted (core dumped)
```
The error is related to different parts of the code. Here are some (very weird) ways how to get rid of the bug:

1. remove `bind(parallel)` in the completely unrelated `innerloop` subroutine
2. remove `val=1.0` in `singleloop`
3. remove any of the 3 lines assigning `xa`, `xb` or `xc` (code crashes with >=3 assignments, runs with <=2)

These completely obscure and impractical workarounds, since in a real program we want to use the `outerloop`-`innerloop` structure to work on some data. In summary, some dead code is influencing the runtime behavior of actually running code, causing weird memory issues.

Additional remark: using `bind(thread)` in `innerloop` works perfectly fine but the question arises how flang applies actual parallelization. We are investigating multiple compilers and found that there is no agreement if `bind(thread)` or `bind(parallel)` is supported at all or performant. E.g. `nvfortran` works best with `bind(parallel)` in `innerloop`, while e.g. Intel ifx produces wrong results with `bind(parallel)`.
We found that flang always produces correct results but is terrible in performance for `bind(parallel)`. Thus the question arises if this is even actually supported?

I am using a very recent `flang version 22.0.0git ([email protected]:llvm/llvm-project.git 045331e4a035fa5dd4e91db03c5c7d6335443c03)`.
Compile the snippet with: `flang -O2 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -fopenmp-version=52 main.F90` and run it with `OMPTARGET_OFFLOAD=mandatory ./a.out`.

```fortran

PROGRAM main
    IMPLICIT NONE

    INTEGER, PARAMETER :: ngrids = 12
    INTEGER, PARAMETER :: cpd = 64 + 2 + 2
 INTEGER, PARAMETER :: cpg = cpd**3
    INTEGER, PARAMETER :: n = ngrids * cpg

    CALL singleloop()
CONTAINS
    SUBROUTINE outerloop()
 REAL, ALLOCATABLE, TARGET, DIMENSION(:) :: a
        INTEGER :: igrid, ip3

        ALLOCATE(a(n), source=0.0)
        !$omp target enter data map(to: a)
        
        !$omp target teams loop bind(teams) shared(a) private(ip3)
        DO igrid = 1, ngrids
            ip3 = (igrid - 1) * cpg + 1
            CALL innerloop(a(ip3), REAL(igrid))
        END DO
 !$omp end target teams loop

        !$omp target exit data map(delete: a)
        DEALLOCATE(a)
    END SUBROUTINE outerloop

    SUBROUTINE innerloop(a, val)
        !$omp declare target
        REAL, INTENT(INOUT), DIMENSION(cpd, cpd, cpd) :: a
        REAL, INTENT(IN) :: val

 INTEGER :: i, j, k

        !$omp loop bind(parallel)
        DO i = 1, cpd 
            DO j = 1, cpd 
                DO k = 1, cpd
 ! This is just some random computation for show
                    ! One can also comment this out
                    a(k, j, i) = 0.5 * a(k, j, i) + val
                END DO
            END DO
        END DO
 !$omp end loop
    END SUBROUTINE innerloop

    SUBROUTINE singleloop()
 REAL, ALLOCATABLE, TARGET, DIMENSION(:) :: a, b, c
        INTEGER :: idx
        REAL :: xa, xb, xc, val
        
        ALLOCATE(a(n), source=0.0)
        ALLOCATE(b(n), source=0.0)
        ALLOCATE(c(n), source=0.0)
        !$omp target enter data map(to: a, b, c)

 !$omp target teams loop shared(a, b, c) private(val, xa, xb, xc)
        DO idx = 1, n
            val = 1.0
            xa = a(idx)
            xb = b(idx)
            xc = c(idx)
        END DO
        !$omp end target teams loop

        !$omp target exit data map(delete: a, b, c)
 DEALLOCATE(a)
        DEALLOCATE(b)
        DEALLOCATE(c)
    END SUBROUTINE singleloop
END PROGRAM main

```

_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to