Re: Optimize LISTEN/NOTIFY

Joel Jacobson Thu, 24 Jul 2025 14:04:51 -0700

On Wed, Jul 23, 2025, at 04:44, Thomas Munro wrote:
> On Wed, Jul 23, 2025 at 1:39 PM Joel Jacobson <j...@compiler.org> wrote:
>> In their patch, in asyn.c's SignalBackends(), they do
>> SendInterrupt(INTERRUPT_ASYNC_NOTIFY, procno) instead of
>> SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, procnos[i]). They don't
>> seem to check if the backend is already signalled or not, but maybe
>> SendInterrupt() has signal coalescing built-in so it would be a noop
>> with almost no cost?
>
> Yeah:
>
> + old_pending = pg_atomic_fetch_or_u32(&proc->pendingInterrupts, 
> interruptMask);
> +
> + /*
> + * If the process is currently blocked waiting for an interrupt to arrive,
> + * and the interrupt wasn't already pending, wake it up.
> + */
> + if ((old_pending & (interruptMask | SLEEPING_ON_INTERRUPTS)) ==
> SLEEPING_ON_INTERRUPTS)
> +     WakeupOtherProc(proc);


Thanks for confirming the coalescing logic in SendInterrupt. That's a
great low-level optimization. It's clear we're both targeting the same
problem of redundant wake-ups under contention, but approaching it from
different architectural levels.

The core difference, as I see it, is *where* the state management
resides. The "Interrupts vs signals" patch set creates a unified
machinery where the 'pending' state for all subsystems is combined into
a single atomic bitmask. This is a valid approach.

However, I've been exploring an alternative pattern that decouples the
state management from the signaling machinery, allowing each subsystem
to manage its own state independently. I believe this leads to a
simpler, more modular migration path. I've developed a two-patch series
for `async.c` to demonstrate this concept.

1. The first patch introduces a lock-free, atomic finite state machine
   (FSM) entirely within async.c. By using a subsystem-specific atomic
   integer and CAS operations, async.c can now robustly manage its own
   listener states (IDLE, SIGNALLED, PROCESSING). This solves the
   redundant signal problem at the source, as notifiers can now observe
   a listener's state and refrain from sending a wakeup if one is
   already pending.

2. The second patch demonstrates that once state is managed locally, the
   wakeup mechanism becomes trivial.** The expensive `SendProcSignal`
   call is replaced with a direct `SetLatch`. This leverages the
   existing, highly-optimized `WaitEventSet` infrastructure as a simple,
   efficient "poke."

This suggests a powerful, incremental migration pattern: first, fix a
subsystem's state management internally; second, replace its wakeup
mechanism. This vertical, module-by-module approach seems complementary
to the horizontal, layer-by-layer refactoring in the "Interrupts vs
signals" thread.

I'll post a more detailed follow-up in that thread to discuss the
broader architectural implications. Attached are the two patches,
reframed to better illustrate this two-step pattern.

/Joel

#!/bin/bash

# Configuration for the PostgreSQL instances using absolute paths.
# This script does NOT modify the shell's PATH variable.

# --- Master Config ---
MASTER_NAME="master"
MASTER_PORT=5432
MASTER_BIN_PATH="$HOME/pg-master/bin"
MASTER_DATA="$HOME/pg-master-data"
MASTER_LOG="/tmp/pg-master.log"

# --- Patch v1 Config ---
PATCH_NAME="patch-v1"
PATCH_PORT=5432
PATCH_BIN_PATH="$HOME/pg-patch-v1/bin"
PATCH_DATA="$HOME/pg-patch-v1-data"
PATCH_LOG="/tmp/pg-patch-v1.log"

# --- Patch v2 Config ---
PATCH_V2_NAME="patch-v2"
PATCH_V2_PORT=5432
PATCH_V2_BIN_PATH="$HOME/pg-patch-v2/bin"
PATCH_V2_DATA="$HOME/pg-patch-v2-data"
PATCH_V2_LOG="/tmp/pg-patch-v2.log"

# Benchmark settings
CHANNEL_NAME="mychannel"
CONNECTIONS=(64 128)
DURATION=10 # Benchmark duration in seconds for each run
MEASUREMENTS=3 # Number of measurements per configuration

# CSV output file
CSV_OUTPUT="benchmark_results.csv"

# Temporary files
PGBENCH_SCRIPT=$(mktemp)

# --- Cleanup Function ---
# Ensures that servers are stopped and temp files are removed on script exit.
cleanup() {
  echo ""
  echo "Cleaning up..."
  # Ensure all servers are stopped, silencing errors if they are not running.
  # Use absolute paths and explicit data directories.
  "$MASTER_BIN_PATH/pg_ctl" -D "$MASTER_DATA" -m fast stop &> /dev/null
  "$PATCH_BIN_PATH/pg_ctl" -D "$PATCH_DATA" -m fast stop &> /dev/null
  "$PATCH_V2_BIN_PATH/pg_ctl" -D "$PATCH_V2_DATA" -m fast stop &> /dev/null
  rm -f "$PGBENCH_SCRIPT"
  echo "Cleanup complete."
}

# Trap the script's exit (normal or interrupted) to run the cleanup function
trap cleanup EXIT

# Initialize CSV file with headers
# echo "version,connections,jobs,tps,run" > "$CSV_OUTPUT"

# --- Benchmark Function ---
# A generic function to run the benchmark for a given configuration.
# It starts, benchmarks, and then stops the specified server instance.
run_benchmark() {
  local name=$1
  local port=$2
  local bin_path=$3
  local data_path=$4
  local log_file=$5

  echo "--- Starting benchmark for: $name ---"

  # Set PGPORT for client tools (pgbench, psql) for this run
  export PGPORT=$port

  # 1. Start the server using absolute path and explicit data directory
  echo "Starting $name server on port $port..."
  "$bin_path/pg_ctl" -D "$data_path" -l "$log_file" -o "-p $port" start
  sleep 2 # Give server a moment to become available

  # Create the pgbench script content
  cat > "$PGBENCH_SCRIPT" << EOF
NOTIFY ${CHANNEL_NAME};
EOF

  # 2. Start the listener in the background for this server
  (echo "LISTEN ${CHANNEL_NAME};"; sleep 100000) | "$bin_path/psql" -d postgres 
&> /dev/null &
  local listener_pid=$!
  
  # 3. Run the benchmark loop
  echo "Running pgbench for connection counts: ${CONNECTIONS[*]}"
  for c in "${CONNECTIONS[@]}"; do
    echo "  Testing with $c connections ($MEASUREMENTS measurements per run)..."
    # Run multiple measurements for each connection count
    for m in $(seq 1 $MEASUREMENTS); do
      # Run pgbench and extract TPS value
      tps=$("$bin_path/pgbench" -d postgres -f "$PGBENCH_SCRIPT" -c "$c" -j 
"$c" -T "$DURATION" -n \
        | grep -E '^tps' \
        | awk '{printf "%.0f", $3}')
      
      # Write to CSV: version,connections,jobs,tps,run
      echo "$name,$c,$c,$tps,$m" >> "$CSV_OUTPUT"
    done
  done
  
  # 4. Stop the listener and the server
  kill "$listener_pid" &> /dev/null
  echo "Stopping $name server..."
  "$bin_path/pg_ctl" -D "$data_path" -m fast stop &> /dev/null
  echo "--- Benchmark for $name complete ---"
  echo ""
}

# --- Main Execution ---

# 1. Run benchmark for master
# run_benchmark "$MASTER_NAME" "$MASTER_PORT" "$MASTER_BIN_PATH" "$MASTER_DATA" 
"$MASTER_LOG"

# 2. Run benchmark for patch-v1
# run_benchmark "$PATCH_NAME" "$PATCH_PORT" "$PATCH_BIN_PATH" "$PATCH_DATA" 
"$PATCH_LOG"

# 3. Run benchmark for patch-v2
run_benchmark "$PATCH_V2_NAME" "$PATCH_V2_PORT" "$PATCH_V2_BIN_PATH" 
"$PATCH_V2_DATA" "$PATCH_V2_LOG"

# 4. Generate report using PostgreSQL
echo ""
echo "# BENCHMARK"
echo ""
echo "## TPS"

# Start the master server to run the analysis
export PGPORT=$MASTER_PORT
"$MASTER_BIN_PATH/pg_ctl" -D "$MASTER_DATA" -l "$MASTER_LOG" -o "-p 
$MASTER_PORT" start &> /dev/null
sleep 2

# Create analysis database and load data
"$MASTER_BIN_PATH/psql" -d postgres -q << EOF
-- Create a temporary database for analysis
DROP DATABASE IF EXISTS bench_analysis;
CREATE DATABASE bench_analysis;
\c bench_analysis

-- Create table for benchmark results
CREATE TABLE benchmark_results (
    version TEXT,
    connections INT,
    jobs INT,
    tps NUMERIC,
    run INT
);

-- Load CSV data
\COPY benchmark_results FROM '$CSV_OUTPUT' CSV HEADER

-- Generate comparison report
WITH avg_results AS (
    SELECT 
        version,
        connections,
        AVG(tps) AS avg_tps,
        STDDEV(tps) AS stddev_tps,
        COUNT(*) AS runs
    FROM benchmark_results
    GROUP BY version, connections
),
comparison AS (
    SELECT 
        m.connections,
        m.avg_tps AS master_tps,
        p1.avg_tps AS patch_v1_tps,
        p2.avg_tps AS patch_v2_tps,
        CASE 
            WHEN m.avg_tps > 0 THEN ((p1.avg_tps - m.avg_tps) / m.avg_tps * 100)
            ELSE 0
        END AS relative_diff_patch_v1_pct,
        CASE 
            WHEN m.avg_tps > 0 THEN ((p2.avg_tps - m.avg_tps) / m.avg_tps * 100)
            ELSE 0
        END AS relative_diff_patch_v2_pct
    FROM avg_results m
    JOIN avg_results p1 ON m.connections = p1.connections
    JOIN avg_results p2 ON m.connections = p2.connections
    WHERE m.version = 'master' AND p1.version = 'patch-v1' AND p2.version = 
'patch-v2'
    ORDER BY m.connections
)

SELECT 
    connections AS "N backends",
    ROUND(master_tps) AS "master",
    ROUND(patch_v1_tps) AS "patch-v1",
    ROUND(patch_v2_tps) AS "patch-v2"
FROM comparison
ORDER BY connections;
EOF

echo ""
echo "## TPS speed-up vs master"

"$MASTER_BIN_PATH/psql" -d bench_analysis -q << EOF
SELECT 
    connections AS "N backends",
    CASE WHEN relative_diff_patch_v1_pct >= 0 THEN '+' ELSE '' END || 
    ROUND(relative_diff_patch_v1_pct) || '%' AS "patch-v1",
    CASE WHEN relative_diff_patch_v2_pct >= 0 THEN '+' ELSE '' END || 
    ROUND(relative_diff_patch_v2_pct) || '%' AS "patch-v2"
FROM (
    WITH avg_results AS (
        SELECT 
            version,
            connections,
            AVG(tps) AS avg_tps
        FROM benchmark_results
        GROUP BY version, connections
    )
    SELECT 
        m.connections,
        CASE 
            WHEN m.avg_tps > 0 THEN ((p1.avg_tps - m.avg_tps) / m.avg_tps * 100)
            ELSE 0
        END AS relative_diff_patch_v1_pct,
        CASE 
            WHEN m.avg_tps > 0 THEN ((p2.avg_tps - m.avg_tps) / m.avg_tps * 100)
            ELSE 0
        END AS relative_diff_patch_v2_pct
    FROM avg_results m
    JOIN avg_results p1 ON m.connections = p1.connections
    JOIN avg_results p2 ON m.connections = p2.connections
    WHERE m.version = 'master' AND p1.version = 'patch-v1' AND p2.version = 
'patch-v2'
) AS comparison
ORDER BY connections;
EOF

# Stop the server
"$MASTER_BIN_PATH/pg_ctl" -D "$MASTER_DATA" -m fast stop &> /dev/null

echo ""
echo "CSV results saved to: $CSV_OUTPUT"

# BENCHMARK

A single backend does `LISTEN mychannel;` and stays idle,
then pgbench is run 3 times for each <N backends>.

script.sql: NOTIFY mychannel;

% pgbench" -f script.sql -c <N backends> -j <N backends> -T 10 -n

## TPS

 N backends | master | patch-v1 | patch-v2
------------+--------+----------+----------
          1 | 117343 |   151422 |   150735
          2 | 158427 |   236705 |   239004
          4 | 177454 |   250783 |   250782
          8 | 116521 |   155466 |   180418
         16 |  45627 |   144740 |   163491
         32 |  37281 |   135602 |   146659
         64 |  36608 |   123870 |   131202
        128 |  34798 |   120302 |   119041
(8 rows)


## TPS speed-up vs master

 N backends | patch-v1 | patch-v2
------------+----------+----------
          1 | +29%     | +28%
          2 | +49%     | +51%
          4 | +41%     | +41%
          8 | +33%     | +55%
         16 | +217%    | +258%
         32 | +264%    | +293%
         64 | +238%    | +258%
        128 | +246%    | +242%
(8 rows)

0001-Optimize-LISTEN-NOTIFY-signaling-with-a-lock-free-at.patch
Description: Binary data

0002-Optimize-LISTEN-NOTIFY-wakeup-by-replacing-signal-wi.patch
Description: Binary data

Re: Optimize LISTEN/NOTIFY

Reply via email to