Introducing Free-Threaded Python to Debian

ben0i0d Sun, 16 Nov 2025 07:18:46 -0800

Hello everyone,

In my previous email(https://lists.debian.org/debian-python/2025/11/msg00002.html), Iintroduced my work on bringing free-threaded Python to Debian. I'm nowpleased to announce that the integration of nogil with the GIL versionhas been successfully completed and has passed necessary testing.

Benefits of Introducing Free-Threaded Python:
1.Significant improvement in multi-threaded parallel performance

2.No need to migrate to multiprocessing or alternative languages -currently multiprocessing is required to utilize multi-core performance,but it comes with substantial overhead3.Reduced maintenance burden for C/C++ extensions - currently complexGIL management or custom thread pools are needed to work around the GIL4.Alignment with future direction - PEP 703 has been accepted, andupstream has confirmed that No-GIL will be the default in the future

Current Progress:

A merge request has been opened on salsa:https://salsa.debian.org/cpython-team/python3/-/merge_requests/41An immediately testable package repository is available at:https://salsa.debian.org/ben0i0d/python3t-repo

The related Debian Bug report (#1117718) is also relevant to this work.
Design Decisions:
Based on discussions with Stefano, the current approach:
-Maintains isolated standard libraries
-Does NOT isolate dist-packages
-Is not a separate new package, but rather a variant of python3

My guiding principle remains: introducing free-threaded Python shouldnot create new problems, break existing functionality, or hinderanyone's work.

Technical Implementation Details:
Key changes in python3.14:

1.Introduced ABI variant control: 'ABI_VARIANTS := gil nogil', with'--disable-gil --with-suffix=t' enabled for nogil

(Note: '--enable-experimental-jit' cannot be used with '--disable-gil')

2.Build system optimization: Implemented duplication of nogil tasks via'$(foreach abi,$(ABI_VARIANTS),$(eval $(call BUILD_STATIC,$(abi))))' toavoid code redundancy3.Test adjustments: Added 'TEST_EXCLUDES += test_tools' since'Tools/freeze/test/Makefile' uses hardcoded 'python'4.venv fix: Added 'add-abiflags-sitepackages.diff' to fix venvsitepackages recognition after nogil build (this issue has been fixedupstream and the patch will be removed after 3.14.1 release)5.Minimal installation: The nogil version excludes documentation, idle,tk, desktop, binfmt, etc., ensuring it's an extension rather than arewrite of the original python36.Code readability: Made harmless text adjustments to rules file, suchas consolidating scattered 'TEST_EXCLUDES +=' statements

Build Process:
git clone [email protected]:ben0i0d/python3t.git
cd python3t
git checkout python3
uscan --download-current-version --verbose
dpkg-source -b .

sudo env DEB_BUILD_OPTIONS="nocheck nobench" pbuilder build../python3.14_3.14.0-5.dsc 2>&1 | tee ../log.txt

Test Environment:
-OS: Debian GNU/Linux forky/sid (forky) x86_64
-CPU: AMD Ryzen 5 9600X (12) @ 5.68 GHz
-Memory: 30.47 GiB
-Kernel: 6.16.12+deb14+1-amd64
Test Results:
1.Basic Performance Test
# GIL version, single-threaded
python3.14 benchmark.py --n 512 --threads 1
Elapsed: 1.836 s
# GIL version, 8 threads
python3.14 benchmark.py --n 512 --threads 8
Elapsed: 2.026 s
# nogil version, single-threaded
python3.14t benchmark.py --n 512 --threads 1
Elapsed: 2.408 s
# nogil version, 8 threads
python3.14t benchmark.py --n 512 --threads 8
Elapsed: 0.674 s
2.NumPy Compatibility Test
Both versions successfully create virtual environments and install numpy:
GIL environment:
python3.14 -m venv gil
source gil/bin/activate
pip install numpy
# Runs normally, Elapsed: 0.003 s
nogil environment:
python3.14t -m venv nogil
source nogil/bin/activate
pip install numpy
# Runs normally, Elapsed: 0.009 s
Important Notes:
-The GIL version build results are identical to the current master branch

-Building only the GIL version is supported, but building only nogil isnot (ensuring nogil doesn't become default)-No backport to python3.13 is planned since nogil support there is stillexperimental

Regarding dist-packages:

I understand some colleagues have concerns about not isolatingdist-packages. Currently, I recommend users employ the nogil versionwithin venv environments. I will fully support subsequent migrationefforts to help more people transition smoothly.The current implementation requires more testing and refinement. I'm nota CPython expert, so I genuinely welcome valuable suggestions from thecommunity and commit to actively participating in improvement efforts.

Please refer to the attachments for detailed build logs and test scripts.
I look forward to your feedback!
Best regards,
*Xu Chen (ben0i0d)*

"""
Multi-threaded matrix multiplication benchmark (pure Python loops).


This script measures how much parallel speedup you can get from multi-threading
in a CPU-bound workload — perfect for comparing GIL vs. no-GIL Python interpreters.

Example:
  python3.14 benchmark.py --n 512 --threads 1
  python3.14 benchmark.py --n 512 --threads 8
  python3.14t benchmark.py --n 512 --threads 1
  python3.14t benchmark.py --n 512 --threads 8
"""

import time
import random
import argparse
from concurrent.futures import ThreadPoolExecutor

def matmul(A, B):
    n = len(A)
    m = len(B[0])
    p = len(B)

    # Transpose B for better cache locality
    Bt = list(zip(*B))

    C = [[0.0] * m for _ in range(n)]
    t0 = time.perf_counter()

    for i in range(n):
        row_A = A[i]
        row_C = C[i]

        for j in range(m):
            col_B = Bt[j]
            s = 0.0
            # Use local variables in innermost loop
            for k in range(p):
                s += row_A[k] * col_B[k]
            row_C[j] = s

    t1 = time.perf_counter()
    return C, t1 - t0

def matmul_range(A, Bt, start_row, end_row):
    n = len(A)
    m = len(Bt)  # Bt is transposed, so number of columns in B is len(Bt)
    p = len(Bt[0])  # This is the original number of rows in B, which is the same as the number of columns in A

    C_part = [[0.0] * m for _ in range(end_row - start_row)]

    for i_local, i_global in enumerate(range(start_row, end_row)):
        row_A = A[i_global]
        row_C = C_part[i_local]
        for j in range(m):
            col_B = Bt[j]
            s = 0.0
            for k in range(p):
                s += row_A[k] * col_B[k]
            row_C[j] = s

    return start_row, C_part

def matmul_threaded(A, B, threads=1):
    n = len(A)
    
    if threads == 1:
        return matmul(A, B)
    
    # Transpose B for better cache locality
    Bt = list(zip(*B))
    m = len(Bt)
    
    # Multi-threaded path
    step = (n + threads - 1) // threads
    futures = []
    
    t0 = time.perf_counter()
    with ThreadPoolExecutor(max_workers=threads) as executor:
        for i in range(threads):
            start = i * step
            end = min((i + 1) * step, n)
            if start < end:
                futures.append(executor.submit(matmul_range, A, Bt, start, end))
        
        # Collect and combine results
        C = []
        for future in futures:
            start_row, C_part = future.result()
            C.extend(C_part)
    
    t1 = time.perf_counter()
    return C, t1 - t0

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Optimized matrix multiplication benchmark.")
    parser.add_argument("--n", type=int, default=300, help="Matrix size (n x n)")
    parser.add_argument("--threads", type=int, default=4, help="Number of threads")
    args = parser.parse_args()

    n = args.n

    print(f"Matrix size: {n}x{n}, threads: {args.threads}")

    # Generate matrices
    A = [[random.random() for _ in range(n)] for _ in range(n)]
    B = [[random.random() for _ in range(n)] for _ in range(n)]

    if args.threads == 1:
        _, elapsed = matmul(A, B)
    else:
        _, elapsed = matmul_threaded(A, B, threads=args.threads)

    print(f"Elapsed: {elapsed:.3f} s")

"""
NumPy matrix multiplication benchmark.

This script provides a comparable version of the pure-Python benchmark,
but using NumPy for the core matmul operation.
"""

import time
import random
import argparse
import numpy as np

def matmul_numpy(A, B):
    t0 = time.perf_counter()
    C = A @ B
    t1 = time.perf_counter()
    return C, t1 - t0

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="NumPy matrix multiplication benchmark.")
    parser.add_argument("--n", type=int, default=300, help="Matrix size (n x n)")
    args = parser.parse_args()

    n = args.n

    print(f"Matrix size: {n}x{n}")

    # Generate matrices (keep behavior same as Python version)
    A = np.random.rand(n, n).astype(float)
    B = np.random.rand(n, n).astype(float)

    _, elapsed = matmul_numpy(A, B)

    print(f"Elapsed: {elapsed:.3f} s")

Introducing Free-Threaded Python to Debian

Reply via email to