Hello everyone,
In my previous email
(https://lists.debian.org/debian-python/2025/11/msg00002.html), I
introduced my work on bringing free-threaded Python to Debian. I'm now
pleased to announce that the integration of nogil with the GIL version
has been successfully completed and has passed necessary testing.
Benefits of Introducing Free-Threaded Python:
1.Significant improvement in multi-threaded parallel performance
2.No need to migrate to multiprocessing or alternative languages -
currently multiprocessing is required to utilize multi-core performance,
but it comes with substantial overhead
3.Reduced maintenance burden for C/C++ extensions - currently complex
GIL management or custom thread pools are needed to work around the GIL
4.Alignment with future direction - PEP 703 has been accepted, and
upstream has confirmed that No-GIL will be the default in the future
Current Progress:
A merge request has been opened on salsa:
https://salsa.debian.org/cpython-team/python3/-/merge_requests/41
An immediately testable package repository is available at:
https://salsa.debian.org/ben0i0d/python3t-repo
The related Debian Bug report (#1117718) is also relevant to this work.
Design Decisions:
Based on discussions with Stefano, the current approach:
-Maintains isolated standard libraries
-Does NOT isolate dist-packages
-Is not a separate new package, but rather a variant of python3
My guiding principle remains: introducing free-threaded Python should
not create new problems, break existing functionality, or hinder
anyone's work.
Technical Implementation Details:
Key changes in python3.14:
1.Introduced ABI variant control: 'ABI_VARIANTS := gil nogil', with
'--disable-gil --with-suffix=t' enabled for nogil
(Note: '--enable-experimental-jit' cannot be used with '--disable-gil')
2.Build system optimization: Implemented duplication of nogil tasks via
'$(foreach abi,$(ABI_VARIANTS),$(eval $(call BUILD_STATIC,$(abi))))' to
avoid code redundancy
3.Test adjustments: Added 'TEST_EXCLUDES += test_tools' since
'Tools/freeze/test/Makefile' uses hardcoded 'python'
4.venv fix: Added 'add-abiflags-sitepackages.diff' to fix venv
sitepackages recognition after nogil build (this issue has been fixed
upstream and the patch will be removed after 3.14.1 release)
5.Minimal installation: The nogil version excludes documentation, idle,
tk, desktop, binfmt, etc., ensuring it's an extension rather than a
rewrite of the original python3
6.Code readability: Made harmless text adjustments to rules file, such
as consolidating scattered 'TEST_EXCLUDES +=' statements
Build Process:
git clone [email protected]:ben0i0d/python3t.git
cd python3t
git checkout python3
uscan --download-current-version --verbose
dpkg-source -b .
sudo env DEB_BUILD_OPTIONS="nocheck nobench" pbuilder build
../python3.14_3.14.0-5.dsc 2>&1 | tee ../log.txt
Test Environment:
-OS: Debian GNU/Linux forky/sid (forky) x86_64
-CPU: AMD Ryzen 5 9600X (12) @ 5.68 GHz
-Memory: 30.47 GiB
-Kernel: 6.16.12+deb14+1-amd64
Test Results:
1.Basic Performance Test
# GIL version, single-threaded
python3.14 benchmark.py --n 512 --threads 1
Elapsed: 1.836 s
# GIL version, 8 threads
python3.14 benchmark.py --n 512 --threads 8
Elapsed: 2.026 s
# nogil version, single-threaded
python3.14t benchmark.py --n 512 --threads 1
Elapsed: 2.408 s
# nogil version, 8 threads
python3.14t benchmark.py --n 512 --threads 8
Elapsed: 0.674 s
2.NumPy Compatibility Test
Both versions successfully create virtual environments and install numpy:
GIL environment:
python3.14 -m venv gil
source gil/bin/activate
pip install numpy
# Runs normally, Elapsed: 0.003 s
nogil environment:
python3.14t -m venv nogil
source nogil/bin/activate
pip install numpy
# Runs normally, Elapsed: 0.009 s
Important Notes:
-The GIL version build results are identical to the current master branch
-Building only the GIL version is supported, but building only nogil is
not (ensuring nogil doesn't become default)
-No backport to python3.13 is planned since nogil support there is still
experimental
Regarding dist-packages:
I understand some colleagues have concerns about not isolating
dist-packages. Currently, I recommend users employ the nogil version
within venv environments. I will fully support subsequent migration
efforts to help more people transition smoothly.
The current implementation requires more testing and refinement. I'm not
a CPython expert, so I genuinely welcome valuable suggestions from the
community and commit to actively participating in improvement efforts.
Please refer to the attachments for detailed build logs and test scripts.
I look forward to your feedback!
Best regards,
*Xu Chen (ben0i0d)*
"""
Multi-threaded matrix multiplication benchmark (pure Python loops).
This script measures how much parallel speedup you can get from multi-threading
in a CPU-bound workload — perfect for comparing GIL vs. no-GIL Python interpreters.
Example:
python3.14 benchmark.py --n 512 --threads 1
python3.14 benchmark.py --n 512 --threads 8
python3.14t benchmark.py --n 512 --threads 1
python3.14t benchmark.py --n 512 --threads 8
"""
import time
import random
import argparse
from concurrent.futures import ThreadPoolExecutor
def matmul(A, B):
n = len(A)
m = len(B[0])
p = len(B)
# Transpose B for better cache locality
Bt = list(zip(*B))
C = [[0.0] * m for _ in range(n)]
t0 = time.perf_counter()
for i in range(n):
row_A = A[i]
row_C = C[i]
for j in range(m):
col_B = Bt[j]
s = 0.0
# Use local variables in innermost loop
for k in range(p):
s += row_A[k] * col_B[k]
row_C[j] = s
t1 = time.perf_counter()
return C, t1 - t0
def matmul_range(A, Bt, start_row, end_row):
n = len(A)
m = len(Bt) # Bt is transposed, so number of columns in B is len(Bt)
p = len(Bt[0]) # This is the original number of rows in B, which is the same as the number of columns in A
C_part = [[0.0] * m for _ in range(end_row - start_row)]
for i_local, i_global in enumerate(range(start_row, end_row)):
row_A = A[i_global]
row_C = C_part[i_local]
for j in range(m):
col_B = Bt[j]
s = 0.0
for k in range(p):
s += row_A[k] * col_B[k]
row_C[j] = s
return start_row, C_part
def matmul_threaded(A, B, threads=1):
n = len(A)
if threads == 1:
return matmul(A, B)
# Transpose B for better cache locality
Bt = list(zip(*B))
m = len(Bt)
# Multi-threaded path
step = (n + threads - 1) // threads
futures = []
t0 = time.perf_counter()
with ThreadPoolExecutor(max_workers=threads) as executor:
for i in range(threads):
start = i * step
end = min((i + 1) * step, n)
if start < end:
futures.append(executor.submit(matmul_range, A, Bt, start, end))
# Collect and combine results
C = []
for future in futures:
start_row, C_part = future.result()
C.extend(C_part)
t1 = time.perf_counter()
return C, t1 - t0
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Optimized matrix multiplication benchmark.")
parser.add_argument("--n", type=int, default=300, help="Matrix size (n x n)")
parser.add_argument("--threads", type=int, default=4, help="Number of threads")
args = parser.parse_args()
n = args.n
print(f"Matrix size: {n}x{n}, threads: {args.threads}")
# Generate matrices
A = [[random.random() for _ in range(n)] for _ in range(n)]
B = [[random.random() for _ in range(n)] for _ in range(n)]
if args.threads == 1:
_, elapsed = matmul(A, B)
else:
_, elapsed = matmul_threaded(A, B, threads=args.threads)
print(f"Elapsed: {elapsed:.3f} s")
"""
NumPy matrix multiplication benchmark.
This script provides a comparable version of the pure-Python benchmark,
but using NumPy for the core matmul operation.
"""
import time
import random
import argparse
import numpy as np
def matmul_numpy(A, B):
t0 = time.perf_counter()
C = A @ B
t1 = time.perf_counter()
return C, t1 - t0
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="NumPy matrix multiplication benchmark.")
parser.add_argument("--n", type=int, default=300, help="Matrix size (n x n)")
args = parser.parse_args()
n = args.n
print(f"Matrix size: {n}x{n}")
# Generate matrices (keep behavior same as Python version)
A = np.random.rand(n, n).astype(float)
B = np.random.rand(n, n).astype(float)
_, elapsed = matmul_numpy(A, B)
print(f"Elapsed: {elapsed:.3f} s")