Rocblas autopkgtest log:
(total size 252MB) [ SKIPPED ] 4909 tests. [ PASSED ] 1205568 tests. [ FAILED ] 0 tests. rocBLAS version: 5.1.1.07564667-dirty rocBLAS-commit-hash: Tensile-commit-hash: hipBLASLt: N/A, as rocBLAS was built without hipBLASLt command line: /usr/libexec/rocm/librocblas5-tests/rocblas-test autopkgtest [18:26:12]: test librocblas5-tests: -----------------------] autopkgtest [18:26:13]: test librocblas5-tests: - - - - - - - - - - results - - - - - - - - - - librocblas5-tests PASS autopkgtest [18:26:14]: @@@@@@@@@@@@@@@@@@@@ summary librocblas5-tests PASS 2026-05-02 18:26:18 - Autopkg tests ended for rocblas. Tests took: 1h 51m 59s. ** Description changed: [ Impact ] - rocblas 7.1.1 (librocblas5) fixes two correctness/performance regressions - introduced in ROCm 7.0/7.1.0 on GPU compute workloads, and corrects several - API implementation bugs: + rocblas 7.1.1 (librocblas5) fixes two correctness/performance regressions + introduced in ROCm 7.0/7.1.0 on GPU compute workloads, and corrects several + API implementation bugs: - 1. fp16/bf16 GEMV precision regression on MI200 (SWDEV-560127) — A ROCm 7.0 - optimisation incorrectly allowed half/bf16 input with fp32 output gemm_ex - calls to use the 16-bit GEMV kernel. Because the 16-bit kernel performs - accumulation in 16-bit arithmetic, cumulative rounding errors caused - numerically incorrect results for any workload using - hpa_half_in_single_out or hpa_bf16_in_single_out precision - with rocblas_gemm_ex on gfx90a (MI200) and gfx942 targets. - Fix: operands are now explicitly cast to the execution type (Tex) before - multiplication inside rocblas_gemvt_kernel_calc and - rocblas_gemvt_reduce_kernel_calc, restoring 32-bit precision. + 1. fp16/bf16 GEMV precision regression on MI200 (SWDEV-560127) — A ROCm 7.0 + optimisation incorrectly allowed half/bf16 input with fp32 output gemm_ex + calls to use the 16-bit GEMV kernel. Because the 16-bit kernel performs + accumulation in 16-bit arithmetic, cumulative rounding errors caused + numerically incorrect results for any workload using + hpa_half_in_single_out or hpa_bf16_in_single_out precision + with rocblas_gemm_ex on gfx90a (MI200) and gfx942 targets. + Fix: operands are now explicitly cast to the execution type (Tex) before + multiplication inside rocblas_gemvt_kernel_calc and + rocblas_gemvt_reduce_kernel_calc, restoring 32-bit precision. - 2. rocHPL multi-GPU performance regression from stream-order allocation - default (SWDEV-558744) — ROCm 7.1.0 made hipMallocAsync/hipFreeAsync - (stream-order allocation) the default memory scheme for rocBLAS handles. - This caused 15%–47% throughput drops in rocHPL-MxP on 2-, 4-, and 8-GPU - configurations. Stream-order allocation is now opt-in again via the - environment variable ROCBLAS_STREAM_ORDER_ALLOC; the default reverts to - hipMalloc/hipFree. This behaviour change is documented in the 7.1.1 - CHANGELOG entry. HIP graph capture (beta feature) now explicitly enables - stream-order allocation internally for the duration of the capture window - (client_utility.cpp), so it continues to work correctly when - ROCBLAS_STREAM_ORDER_ALLOC is not set. + 2. rocHPL multi-GPU performance regression from stream-order allocation + default (SWDEV-558744) — ROCm 7.1.0 made hipMallocAsync/hipFreeAsync + (stream-order allocation) the default memory scheme for rocBLAS handles. + This caused 15%–47% throughput drops in rocHPL-MxP on 2-, 4-, and 8-GPU + configurations. Stream-order allocation is now opt-in again via the + environment variable ROCBLAS_STREAM_ORDER_ALLOC; the default reverts to + hipMalloc/hipFree. This behaviour change is documented in the 7.1.1 + CHANGELOG entry. HIP graph capture (beta feature) now explicitly enables + stream-order allocation internally for the duration of the capture window + (client_utility.cpp), so it continues to work correctly when + ROCBLAS_STREAM_ORDER_ALLOC is not set. - 3. rocblas_is_user_managing_device_memory was broken — In 7.1.0 the function - body was hardcoded to `return false` regardless of handle state. It now - correctly inspects device_memory_owner. Applications that relied on this - function to detect user-managed memory were silently getting wrong results. + 3. rocblas_is_user_managing_device_memory was broken — In 7.1.0 the function + body was hardcoded to `return false` regardless of handle state. It now + correctly inspects device_memory_owner. Applications that relied on this + function to detect user-managed memory were silently getting wrong results. - 4. rocblas_set_device_memory_size was a near-no-op — In 7.1.0 the function - returned success without performing any allocation. It now actually - allocates the requested size via hipMalloc and marks the handle as - user_managed. A new "user_managed" ownership state is introduced alongside - the existing "user_owned" (rocblas_set_workspace) scheme, with a - ROCBLAS_REALLOC_ON_DEMAND=1 compile-time flag enabling on-demand - reallocation for the rocblas_managed path. + 4. rocblas_set_device_memory_size was a near-no-op — In 7.1.0 the function + returned success without performing any allocation. It now actually + allocates the requested size via hipMalloc and marks the handle as + user_managed. A new "user_managed" ownership state is introduced alongside + the existing "user_owned" (rocblas_set_workspace) scheme, with a + ROCBLAS_REALLOC_ON_DEMAND=1 compile-time flag enabling on-demand + reallocation for the rocblas_managed path. - 5. Deprecation message cleanup — rocblas_set_device_memory_size and - rocblas_is_user_managing_device_memory had "[Do not use]" removed from - their deprecation strings, signalling these APIs are being rehabilitated - rather than removed. + 5. Deprecation message cleanup — rocblas_set_device_memory_size and + rocblas_is_user_managing_device_memory had "[Do not use]" removed from + their deprecation strings, signalling these APIs are being rehabilitated + rather than removed. - The 7.1.1 release also carries a documentation-only fix (logging environment - variable include path and reference link corrections) with no runtime impact. + The 7.1.1 release also carries a documentation-only fix (logging environment + variable include path and reference link corrections) with no runtime impact. - Packaging fixes included in this upload (no user-visible behaviour change - once installed; they only restore working build/test plumbing): + Packaging fixes included in this upload (no user-visible behaviour change + once installed; they only restore working build/test plumbing): - a. Tensile kernel install path / runtime lookup mismatch — d/rules pinned - the Tensile data install dir to a hardcoded internal version (5.1.0) - while the runtime patch (move-tensile-library-into-versioned-subdir) - derived its lookup path from upstream's ROCBLAS_VERSION_* macros. - Upstream bumped VERSION_STRING from 5.1.0 to 5.1.1 in 7.1.1, so the - install path and the runtime lookup path would drift apart on this - upload. Symptom would be rocblas-test (and any consumer of librocblas) - aborting at startup with - "Cannot read /usr/lib/<multiarch>/rocblas/library/ - TensileLibrary.dat". - Fix: install kernels at the unversioned - /usr/lib/<multiarch>/rocblas/library so the runtime finds them via - upstream's natural fallback path, drop the versioned-subdir patch - entirely, and update the install/not-installed/lintian-override - globs to match. This removes the version coupling that caused the - drift in the first place. - b. Build-time test could not find rocblas_gtest.data — - Enable-changing-directory-for-test-data.patch had hard-replaced - upstream's "look next to the test binary" lookup with a hardcoded - INSTALL_TEST_DATA_DIR. dh_auto_test runs before dh_auto_install, so - the file is not yet at the install path. Fix: prefer the install - path, fall back to rocblas_exepath() when the file is not present - there. Post-install behaviour is unchanged. - c. d/rules: BUILD_CLIENTS_TESTS expansion emitted "ON ON" when both - FEATURE_CHECK and FEATURE_INSTTEST were ON, which CMake parsed as a - stray extra source path. Replaced $(or $(filter ON,...),OFF) with - $(if ...) so the variable always expands to a single token. + a. Tensile kernel install path / runtime lookup mismatch — d/rules pinned + the Tensile data install dir to a hardcoded internal version (5.1.0) + while the runtime patch (move-tensile-library-into-versioned-subdir) + derived its lookup path from upstream's ROCBLAS_VERSION_* macros. + Upstream bumped VERSION_STRING from 5.1.0 to 5.1.1 in 7.1.1, so the + install path and the runtime lookup path would drift apart on this + upload. Symptom would be rocblas-test (and any consumer of librocblas) + aborting at startup with + "Cannot read /usr/lib/<multiarch>/rocblas/library/ + TensileLibrary.dat". + Fix: install kernels at the unversioned + /usr/lib/<multiarch>/rocblas/library so the runtime finds them via + upstream's natural fallback path, drop the versioned-subdir patch + entirely, and update the install/not-installed/lintian-override + globs to match. This removes the version coupling that caused the + drift in the first place. + b. Build-time test could not find rocblas_gtest.data — + Enable-changing-directory-for-test-data.patch had hard-replaced + upstream's "look next to the test binary" lookup with a hardcoded + INSTALL_TEST_DATA_DIR. dh_auto_test runs before dh_auto_install, so + the file is not yet at the install path. Fix: prefer the install + path, fall back to rocblas_exepath() when the file is not present + there. Post-install behaviour is unchanged. + c. d/rules: BUILD_CLIENTS_TESTS expansion emitted "ON ON" when both + FEATURE_CHECK and FEATURE_INSTTEST were ON, which CMake parsed as a + stray extra source path. Replaced $(or $(filter ON,...),OFF) with + $(if ...) so the variable always expands to a single token. - Items (b) and (c) only manifest when building on a host with /dev/kfd - accessible (i.e. with an AMD GPU present). Launchpad's amd64 builders - have no GPU, so override_dh_auto_test-arch is skipped there and the - bugs were latent. They were uncovered while validating this upload on - a gfx1151 (Strix Halo) developer machine. + Items (b) and (c) only manifest when building on a host with /dev/kfd + accessible (i.e. with an AMD GPU present). Launchpad's amd64 builders + have no GPU, so override_dh_auto_test-arch is skipped there and the + bugs were latent. They were uncovered while validating this upload on + a gfx1151 (Strix Halo) developer machine. - Reverse dependencies: librocblas-dev, libtorch-rocm-2.9, - librocwmma-tests-validate, librocsolver0-tests, librocsolver0-bench, - librocsolver0, librocblas5-tests, librocblas5-bench, libggml0-backend-hip, - libmiopen1-tests, libmiopen1, libhipsolver1, libhipblas3. + Reverse dependencies: librocblas-dev, libtorch-rocm-2.9, + librocwmma-tests-validate, librocsolver0-tests, librocsolver0-bench, + librocsolver0, librocblas5-tests, librocblas5-bench, libggml0-backend-hip, + libmiopen1-tests, libmiopen1, libhipsolver1, libhipblas3. - [ Test Plan ] + [ Test Plan ] - 1. Build: - - sbuild or dpkg-buildpackage the package successfully. - - Verify dpkg --compare-versions shows the new version is greater. - - Run dpkg-gensymbols and confirm no symbols are added/removed/changed - (the .symbols file should remain identical — SONAME remains - librocblas.so.5). - - On a host with an AMD GPU available (/dev/kfd readable), confirm - override_dh_auto_test-arch runs the rocblas-test suite to - completion. Locally verified on gfx1151 (Radeon 8060S / Strix - Halo): 211778 tests across 196 suites, all PASSED. - 2. Installability: - - apt install librocblas5. - - Confirm reverse dependencies remain installable without rebuild. - - Verify the Tensile data is at - /usr/lib/x86_64-linux-gnu/rocblas/library/ - (no version subdirectory) and that - dpkg -L librocblas5 | grep TensileLibrary - lists the per-architecture .dat files. - 3. Run autopkgtest (librocblas5-tests) on a GPU-equipped testbed and - confirm it passes. - - <RUNNING> + 1. Build: + - sbuild or dpkg-buildpackage the package successfully. + - Verify dpkg --compare-versions shows the new version is greater. + - Run dpkg-gensymbols and confirm no symbols are added/removed/changed + (the .symbols file should remain identical — SONAME remains + librocblas.so.5). + - On a host with an AMD GPU available (/dev/kfd readable), confirm + override_dh_auto_test-arch runs the rocblas-test suite to + completion. Locally verified on gfx1151 (Radeon 8060S / Strix + Halo): 211778 tests across 196 suites, all PASSED. + 2. Installability: + - apt install librocblas5. + - Confirm reverse dependencies remain installable without rebuild. + - Verify the Tensile data is at + /usr/lib/x86_64-linux-gnu/rocblas/library/ + (no version subdirectory) and that + dpkg -L librocblas5 | grep TensileLibrary + lists the per-architecture .dat files. + 3. Run autopkgtest (librocblas5-tests) on a GPU-equipped testbed and + confirm it passes. Output: - [ Where problems could occur ] + [ SKIPPED ] 4909 tests. + [ PASSED ] 1205568 tests. + [ FAILED ] 0 tests. + rocBLAS version: 5.1.1.07564667-dirty + rocBLAS-commit-hash: + Tensile-commit-hash: + hipBLASLt: N/A, as rocBLAS was built without hipBLASLt + command line: /usr/libexec/rocm/librocblas5-tests/rocblas-test + autopkgtest [18:26:12]: test librocblas5-tests: -----------------------] + autopkgtest [18:26:13]: test librocblas5-tests: - - - - - - - - - - results - - - - - - - - - - + librocblas5-tests PASS + autopkgtest [18:26:14]: @@@@@@@@@@@@@@@@@@@@ summary + librocblas5-tests PASS + 2026-05-02 18:26:18 - Autopkg tests ended for rocblas. + Tests took: 1h 51m 59s. - 1. Applications relying on stream-order allocation being the default - (low risk, correctness neutral): Any application that depended on the - ROCm 7.1.0 behaviour where rocBLAS_managed implicitly used - hipMallocAsync may now observe different memory allocation timing. - In practice this only matters for HIP graph capture, which the library - now handles internally. Symptom: none expected for well-behaved apps; - a graph capture that manually assumed stream-order alloc was active - without the env var may need updating. - 2. rocblas_set_device_memory_size now triggers allocation (low risk): - Applications that called this function expecting a no-op will now - trigger a hipMalloc. Symptom: slightly higher memory usage at handle - creation if the application calls rocblas_set_device_memory_size with - a non-zero size before it is needed. - 3. rocblas_is_user_managing_device_memory returning true where it - previously always returned false (low risk): Any application that - worked around the broken return value by never checking it will be - unaffected. Applications that did check it and coded logic around - "always false" may behave differently. Symptom: unexpected branch - taken in application code that queries memory ownership. - 4. GEMV precision fix kernel path change (very low risk): The explicit - Tex() cast changes the instruction sequence in the GEMV transposed - kernel. On architectures other than gfx90a/gfx942 the cast is a - no-op so no behaviour change is expected. Symptom: none expected; - a pre_checkin test failure would be the indicator. - [ Other Info ] + [ Where problems could occur ] - * No ABI/API breakage: the debian/librocblas5.symbols file is identical - between 7.1.0 and 7.1.1. No symbols were added, removed, or changed. - The SONAME remains librocblas.so.5. - * Minor: example_solver_rocblas.cpp copyright year reverted 2025→2024 as - a cherry-pick artefact; no functional impact. - * Upstream comparison (rocBLAS changes): - https://github.com/ROCm/rocblas/compare/rocm-7.1.0...rocm-7.1.1 - * Tensile: no changes between rocm-7.1.0 and rocm-7.1.1. - https://github.com/ROCm/Tensile/compare/rocm-7.1.0...rocm-7.1.1 + 1. Applications relying on stream-order allocation being the default + (low risk, correctness neutral): Any application that depended on the + ROCm 7.1.0 behaviour where rocBLAS_managed implicitly used + hipMallocAsync may now observe different memory allocation timing. + In practice this only matters for HIP graph capture, which the library + now handles internally. Symptom: none expected for well-behaved apps; + a graph capture that manually assumed stream-order alloc was active + without the env var may need updating. + 2. rocblas_set_device_memory_size now triggers allocation (low risk): + Applications that called this function expecting a no-op will now + trigger a hipMalloc. Symptom: slightly higher memory usage at handle + creation if the application calls rocblas_set_device_memory_size with + a non-zero size before it is needed. + 3. rocblas_is_user_managing_device_memory returning true where it + previously always returned false (low risk): Any application that + worked around the broken return value by never checking it will be + unaffected. Applications that did check it and coded logic around + "always false" may behave differently. Symptom: unexpected branch + taken in application code that queries memory ownership. + 4. GEMV precision fix kernel path change (very low risk): The explicit + Tex() cast changes the instruction sequence in the GEMV transposed + kernel. On architectures other than gfx90a/gfx942 the cast is a + no-op so no behaviour change is expected. Symptom: none expected; + a pre_checkin test failure would be the indicator. + + [ Other Info ] + + * No ABI/API breakage: the debian/librocblas5.symbols file is identical + between 7.1.0 and 7.1.1. No symbols were added, removed, or changed. + The SONAME remains librocblas.so.5. + * Minor: example_solver_rocblas.cpp copyright year reverted 2025→2024 as + a cherry-pick artefact; no functional impact. + * Upstream comparison (rocBLAS changes): + https://github.com/ROCm/rocblas/compare/rocm-7.1.0...rocm-7.1.1 + * Tensile: no changes between rocm-7.1.0 and rocm-7.1.1. + https://github.com/ROCm/Tensile/compare/rocm-7.1.0...rocm-7.1.1 -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2150579 Title: SRU: New upstream version 7.1.1 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/rocblas/+bug/2150579/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
