# Parameters.recycle() does not reset queryStringCharset, causing
cross-request charset leakage

## Summary

`org.apache.tomcat.util.http.Parameters#recycle()` resets the body
`charset` field to `DEFAULT_BODY_CHARSET`, but **does not reset the
`queryStringCharset` field**. When a `Request` object is returned to
Tomcat's per-connection recycle pool and later reused for a subsequent
request on the same thread/connection, the previously set
`queryStringCharset` persists. If an application (or a `Valve`) sets
`queryStringCharset` on a per-request basis only when a specific condition
is met, the next request that does *not* meet that condition will decode
its query string using the stale charset from the previous request,
producing mojibake.

This is a latent issue: applications that consistently set
`queryStringCharset` on every request are unaffected. Applications that set
it conditionally (a common pattern for legacy-URL compatibility valves) are
affected.

- **Component**: Catalina / Coyote
- **Affected class**: `org.apache.tomcat.util.http.Parameters`
- **Confirmed version**: 11.0.14 (`tomcat-embed-core`, bytecode-verified)
- **Also present on `main`**: verified against `apache/tomcat` `main`
branch (post-11.0.22) — `Parameters.recycle()` still omits
`queryStringCharset`:

https://github.com/apache/tomcat/blob/main/java/org/apache/tomcat/util/http/Parameters.java
- **Likely affected**: every 11.x release to date, and probably 10.1.x /
9.0.x as well (please confirm)
- **Severity**: Medium — intermittent, silent data corruption for non-ASCII
query parameters under multi-threaded load

---

## Bytecode evidence (tomcat-embed-core 11.0.14)

Disassembly of `Parameters.recycle()`:

```
public void recycle();
  Code:
     0: aload_0
     1: iconst_0
     2: putfield #49 // Field parameterCount:I
     5: aload_0
     6: getfield #10 // Field paramHashValues:Ljava/util/Map;
     9: invokeinterface ... // Map.clear()
    14: aload_0
    15: iconst_0
    16: putfield #16 // Field didQueryParameters:Z
    19: aload_0
    20: getstatic #71 // Field DEFAULT_BODY_CHARSET
    23: putfield #36 // Field charset:Ljava/nio/charset/Charset;
    26: aload_0
    27: getfield #26 // Field decodedQuery
    30: invokevirtual ... // MessageBytes.recycle()
    33: return
```

Only the body `charset` field (offset 19–23) is reset. The
`queryStringCharset` field is never assigned in this method, so its value
from the previous request lifecycle is preserved.

For comparison, the body `charset` is correctly reset to
`DEFAULT_BODY_CHARSET` (`ISO-8859-1`). The expected fix is the symmetric
reset of `queryStringCharset` to its own default (`UTF-8`).

---

## How it manifests in practice

A common application pattern is a `Valve` that detects legacy
percent-encoded URLs (e.g. EUC-KR, Shift_JIS, GBK, or other non-UTF-8
escapes embedded in historical bookmarks) and switches the parameter
decoder accordingly:

```java
// Simplified illustrative pattern — sets charset only for legacy URLs
if (queryString != null && looksLikeLegacyEncoding(queryString)) {
    request.getCoyoteRequest().getParameters()
        .setQueryStringCharset(StandardCharsets.ISO_8859_1);
}
// else: no setQueryStringCharset call → relies on Tomcat default
```

The author's mental model is "if I don't call `setQueryStringCharset`,
Tomcat uses its default (UTF-8) for this request." With the recycle bug,
that assumption holds only for the *first* request a `Request` instance
handles. After the legacy branch fires once, the `ISO-8859-1` charset leaks
into every subsequent request served by the same recycled `Request` object
until the JVM restarts (or until something else overwrites it).

### Concrete leakage sequence

```
Thread T1, Request object R (newly allocated)
  Request #1: /search?q=%C7%D1%B3%AA (legacy-encoded "한나" in EUC-KR)
    → valve detects legacy → setQueryStringCharset(ISO_8859_1)
    → request completes
    → Parameters.recycle() called
       - charset → reset to DEFAULT_BODY_CHARSET ✓
       - queryStringCharset → NOT reset ✗ (still ISO_8859_1)

  Request #2: /search?q=%EC%98%AC%EB%A6%AC%EA%B3%A0 (UTF-8 "올리고")
    → valve sees valid UTF-8 → does NOT call setQueryStringCharset
    → Tomcat decodes %EC%98%AC%EB%A6%AC%EA%B3%A0 using ISO_8859_1
    → application receives "ì¬ë¦¬ê³ " instead of "올리고"
```

### Intermittency under load balancing

Behind a load balancer with N instances and M worker threads per instance,
the leakage propagates only when two requests land on the **same instance**
*and* the **same connection/recycled Request object**. With two instances
and round-robin connection assignment, this reproduces at roughly 50 %
observed rate in our pre-production environment. On a single-instance local
setup, reproduction is 100 % deterministic when issuing legacy and UTF-8
requests in alternation.

---

## Minimal reproduction (independent Spring Boot project)

The following is a self-contained Spring Boot 4.0.0 project that reproduces
the bug without any application-specific code.

### `build.gradle`

```groovy
plugins {
    id 'java'
    id 'org.springframework.boot' version '4.0.0'
    id 'io.spring.dependency-management' version '1.1.6'
}

group = 'com.example'
version = '0.0.1'
java { toolchain { languageVersion = JavaLanguageVersion.of(21) } }

repositories { mavenCentral() }

dependencies {
    implementation 'org.springframework.boot:spring-boot-starter-web'
}
```

### `src/main/java/com/example/recyclebug/RecycleBugApplication.java`

```java
package com.example.recyclebug;

import org.apache.catalina.connector.Request;
import org.apache.catalina.valves.ValveBase;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import
org.springframework.boot.web.embedded.tomcat.TomcatServletWebServerFactory;
import org.springframework.boot.web.server.WebServerFactoryCustomizer;
import org.springframework.context.annotation.Bean;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;

import jakarta.servlet.ServletException;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.util.regex.Pattern;

@SpringBootApplication
public class RecycleBugApplication {

    public static void main(String[] args) {
        SpringApplication.run(RecycleBugApplication.class, args);
    }

    @Bean
    WebServerFactoryCustomizer<TomcatServletWebServerFactory>
registerValve() {
        return factory -> factory.addContextValves(new
LegacyCharsetValve());
    }

    @RestController
    static class EchoController {
        @GetMapping("/echo")
        public String echo(@RequestParam("q") String q) {
            return "q=" + q;
        }
    }

    /**
     * Reproduces the application-side pattern: conditionally switches the
     * query-string decoder to ISO_8859_1 for legacy-encoded URLs, and
otherwise
     * leaves it untouched (expecting Tomcat's UTF-8 default to apply).
     */
    static class LegacyCharsetValve extends ValveBase {
        // Heuristic: any %XX byte that cannot be the leading byte of a
UTF-8 sequence
        private static final Pattern LEGACY =
Pattern.compile("%[89A-F][0-9A-F]", Pattern.CASE_INSENSITIVE);

        @Override
        public void invoke(Request request,
org.apache.catalina.connector.Response response)
                throws IOException, ServletException {
            String qs = request.getQueryString();
            if (qs != null && looksLegacy(qs)) {
                request.getCoyoteRequest().getParameters()
                       .setQueryStringCharset(StandardCharsets.ISO_8859_1);
            }
            // else: deliberately do nothing — relying on Tomcat default
per request.
            getNext().invoke(request, response);
        }

        private static boolean looksLegacy(String qs) {
            // Crude detector: legacy percent-encoded byte that is invalid
as a
            // UTF-8 leading or continuation byte standalone. Not
exhaustive —
            // sufficient for triggering the bug deterministically.
            return LEGACY.matcher(qs).find() && !isValidUtf8Percent(qs);
        }

        private static boolean isValidUtf8Percent(String qs) {
            try {
                java.net.URLDecoder.decode(qs, StandardCharsets.UTF_8);
                return true;
            } catch (IllegalArgumentException e) {
                return false;
            }
        }
    }
}
```

### Reproduction steps

```bash
./gradlew bootRun
```

In another shell:

```bash
# Legacy EUC-KR percent-encoded "한나"
LEGACY='http://localhost:8080/echo?q=%C7%D1%B3%AA'
# UTF-8 percent-encoded "올리고"
UTF8='http://localhost:8080/echo?q=%EC%98%AC%EB%A6%AC%EA%B3%A0'

# 1) Baseline: UTF-8 request on a fresh connection
curl -s "$UTF8"
# → q=올리고

# 2) Trigger the legacy branch
curl -s "$LEGACY" >/dev/null

# 3) UTF-8 request again, reusing the same connection
curl -s "$UTF8"
# Expected: q=올리고
# Actual: q=ì¬ë¦¬ê³ ← bug: queryStringCharset leaked from request 2 to
request 3
```

To eliminate per-request connection churn, use HTTP keep-alive explicitly:

```bash
curl -s "$LEGACY" "$UTF8" "$UTF8" "$UTF8"
# Requests 2–4 all return mojibake.
```

On a single-instance local server, the bug reproduces 100 % of the time
once the legacy branch has fired on a given connection.

---

## Suggested fix

Symmetric reset of `queryStringCharset` in `Parameters.recycle()`:

```java
public void recycle() {
    parameterCount = 0;
    paramHashValues.clear();
    didQueryParameters = false;
    charset = DEFAULT_BODY_CHARSET;
    queryStringCharset = DEFAULT_URI_CHARSET; // ← add this line (UTF-8)
    decodedQuery.recycle();
}
```

(`DEFAULT_URI_CHARSET` is the existing static used to initialise
`queryStringCharset` at field declaration time, so the post-recycle state
matches the post-construction state — which is the contract implied by
`recycle()`.)

### Application-side workaround (for users on affected versions)

Set `queryStringCharset` explicitly on **every** request rather than only
on the conditional branch:

```java
if (qs != null && looksLegacy(qs)) {
    params.setQueryStringCharset(StandardCharsets.ISO_8859_1);
} else {
    params.setQueryStringCharset(StandardCharsets.UTF_8); // defensive
}
```

This restores the per-request invariant the application author originally
assumed Tomcat was providing.

---

## Why this is hard to notice

1. **Single-developer-machine testing rarely triggers it** — most
developers test one URL at a time on a fresh JVM.
2. **Unit tests of the valve in isolation pass** — the `Request` object is
freshly constructed per test, so no recycled state exists.
3. **The mojibake is silent** — there is no exception; the application just
stores or echoes garbled text.
4. **Load balancing dilutes reproduction rate** to roughly `1/N` per
identical request, which is easily dismissed as a transient client-side
encoding issue.

We spent non-trivial time tracing this through application code, filters,
and `CharacterEncodingFilter` configuration before bytecode inspection of
`recycle()` revealed the asymmetry between `charset` and
`queryStringCharset`.

---

## Environment

- Tomcat: `tomcat-embed-core` 11.0.14
- JDK: 21
- Spring Boot: 4.0.0 (embedded Tomcat)
- OS: Linux x86_64 / macOS arm64 (both reproduce)

Happy to provide additional traces or test on other Tomcat versions if
useful.

Reply via email to