chaokunyang commented on code in PR #3734: URL: https://github.com/apache/fory/pull/3734#discussion_r3418720810
########## THREAT_MODEL.md: ########## @@ -0,0 +1,182 @@ +<!-- +SPDX-License-Identifier: Apache-2.0 + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + https://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +--> + +# Apache Fory — Threat Model (v0 draft) + +## §1 Header + +- **Project:** Apache Fory (`apache/fory`), `main`, against which this draft was written. Fory is a multi-language serialization framework (Java, C++, Python, Go, Rust, JavaScript, Kotlin, Scala, Swift, Dart, C#). +- **Date:** 2026-06-02. **Status:** draft — for Apache Fory PMC review. **Author:** ASF Security team (drafted via the Scovetta threat-model rubric), for PMC ratification. +- **Version binding:** versioned with the project; a report against Fory version *N* is triaged against the model as it stood at *N*, not at HEAD. +- **Reporting cross-reference:** findings that violate a §8 property should be reported privately per the ASF process (`[email protected]` → `[email protected]`); findings under §3 or §9 are closed citing this document. +- **Provenance legend:** *(documented)* = stated in Fory's own docs/repo; *(maintainer)* = confirmed by a Fory PMC member through this process; *(inferred)* = reasoned from architecture/domain knowledge, not yet confirmed — every *(inferred)* claim has a matching §14 open question. +- **Draft confidence:** ~20 documented / 0 maintainer / ~26 inferred. +- **What Fory is:** Apache Fory is a high-performance, multi-language object/data serialization framework. An application uses it in-process to serialize its objects to bytes and deserialize bytes back into objects, either within one language ("native" mode) or across languages ("xlang" mode), with optional zero-copy and a row format. *(documented — README, docs/guide)* + +## §2 Scope and intended use + +- **Primary use:** an **in-process library** linked into a host application that calls `serialize()` / `deserialize()` on its own data types. *(documented — guides)* +- **It is not a network service or daemon.** It has no listening surface, no auth, no users — the embedding application owns where the bytes come from and go. *(inferred)* +- **Caller / trust level:** a single caller — the embedding application — which is **trusted** (it links the library and registers its types). The security-relevant question is not "who calls Fory" but **"where do the bytes handed to `deserialize()` come from"** — trusted producer, or attacker-controlled. *(inferred; the registration guidance is documented)* + +**Component-family table** *(in/out of this model):* + +| Family | Entry point | Notes | In model? | +| --- | --- | --- | --- | +| Object-graph serialization (native, per language) | `fory.serialize` / `deserialize` | the core; instantiates registered types from bytes | **In** *(documented)* | +| Cross-language (xlang) serialization | xlang `serialize`/`deserialize` | type mapping across languages | **In** *(documented)* | +| Row format / zero-copy | row encoders | reads fields in place from a buffer | **In** *(documented)* | +| Class/type registration + "secure mode" | `requireClassRegistration`, `register(...)` | the primary defense | **In** *(documented)* | +| Per-language implementations | `java/`, `cpp/`, `python/`, `go/`, `rust/`, `javascript/`, `kotlin/`, `scala/`, `swift/`, `dart/`, `csharp/` | each is a separate impl of the same model | **In** — but memory-safety profile differs by language (see §5/§8) *(documented: dirs exist)* | +| `examples/`, `benchmarks/`, `integration_tests/` | demo/bench/test | not production surface | **Out** *(see §3)* | + +## §3 Out of scope (explicit non-goals) + +- **The integrity / authenticity / confidentiality of the serialized bytes.** Fory deserializes what it is given; it does not authenticate, MAC, or encrypt payloads. If bytes can be tampered with in transit/at rest, that is the application's problem to solve (sign/encrypt before handing to Fory). *(inferred)* +- **Anything when the caller disables class registration on an untrusted payload source.** `requireClassRegistration(false)` is a documented, deliberately-available footgun; using it against attacker-controlled bytes is out of the model's protection (see §5a/§9). *(documented — config: "Disabling may allow unknown classes to be deserialized, potentially causing security risks")* +- **The behaviour of the application's own registered classes.** Fory instantiates and populates registered types; if a registered class has dangerous side effects in its constructors/setters/finalizers, that is the application's design, not Fory's. *(inferred)* +- **`examples/`, `benchmarks/`, `integration_tests/`** — shipped but not a production trust surface. *(inferred)* + +## §4 Trust boundaries and data flow + +- **The trust boundary is the byte buffer passed to `deserialize()`** (and the row-format buffer). Everything Fory does on the serialize side operates on the application's own in-memory objects (trusted); the deserialize side is where attacker-controlled bytes, if any, enter. *(inferred)* +- **Data flow:** untrusted bytes → format/header parse → (class id / type resolution → **registration check**) → field decode → object graph construction → returned to caller. The registration check is the gate that decides whether an arbitrary type may be instantiated. *(inferred; registration mechanism documented)* +- **Reachability precondition:** a deserialize-side finding is **in-model** only if it is reachable from the byte buffer under the **default secure configuration** (`requireClassRegistration(true)`). A finding that requires `requireClassRegistration(false)`, or that requires the *serialize* side to be fed attacker-controlled live objects, is out-of-model (§5a / trusted-input). *(inferred)* + +## §5 Assumptions about the environment + +- **In-process, no ambient I/O.** Fory does not (by design) open sockets, spawn processes, or read the network; it operates on in-memory buffers handed to it. *(inferred — high-priority confirmation; negative claim)* +- **Per-language memory model differs.** In managed runtimes (Java, Python, Go, JS, …) memory safety is the runtime's; in the **C++** (and unsafe-Rust FFI) paths, malformed input reaching the decoder is a memory-safety surface in a way it is not on the JVM. The model's "memory safety on malformed input" property is therefore language-conditional (see §8). *(inferred)* +- **Codegen / JIT:** on ordinary JVMs Fory generates serializer code at runtime (`codeGenEnabled` default true); disabled on Android / GraalVM native image. This is a performance mechanism over the application's own registered types, not a path for executing attacker bytes. *(documented — config table)* + +## §5a Build-time and configuration variants + +The security envelope is set by runtime configuration, not build flags. The load-bearing knobs *(documented — docs/guide/java/configuration.md)*: + +| Knob | Default | Effect on the model | +| --- | --- | --- | +| `requireClassRegistration` | **`true`** (secure) | When true, only registered types are deserialized — the primary defense against deserializing arbitrary/gadget classes. Disabling "may allow unknown classes to be deserialized, potentially causing security risks." | +| `maxDepth` | **`50`** | Bounds deserialization recursion depth; "can be used to refuse deserialization DDOS attack." | +| `deserializeUnknownClass` | `true` in compatible mode, else `false` | Whether data for unknown/non-existent classes is skipped/deserialized. | +| `compatible` | xlang: `true`; native: `false` | Schema forward/backward compatibility. | +| `suppressClassRegistrationWarnings` | `true` | Registration warnings are useful for security audit but suppressed by default. | + +**The default is the *secure* posture here** (registration required) — the inverse of the usual insecure-default case. The model's §8 properties hold *under the defaults*; a report that only manifests under `requireClassRegistration(false)` is `OUT-OF-MODEL: non-default-build`. Confirm this framing with the PMC (§14). + +## §6 Assumptions about inputs + +Per-entry-point trust table *(registration mechanism + defaults documented; trust framing inferred):* + +| Entry point | Input | Attacker-controllable? | Caller must enforce | +| --- | --- | --- | --- | +| `deserialize(bytes)` / `deserialize(bytes, Class)` | serialized byte buffer | **yes, if the application sources bytes from an untrusted producer** | keep `requireClassRegistration(true)`; register only safe types; integrity-check bytes upstream | +| row-format readers | buffer | **yes** (same as above) | same | +| `serialize(obj)` | a live application object | no — the app's own trusted object | n/a | +| `register(Class, …)` | type registered at setup | no — controlled by the app developer | register only types safe to instantiate from untrusted data | + +- **Size/shape/rate:** `maxDepth` (default 50) bounds nesting; whether total allocation / output size is otherwise bounded against a hostile payload is open (see §8 resource line). *(maxDepth documented; broader bound inferred)* + +## §7 Adversary model + +- **Primary adversary:** a party who controls the **serialized bytes** an application later passes to `deserialize()` (e.g. data arriving over a network the app feeds to Fory, or persisted data an attacker can tamper with). Goal: instantiate dangerous types (gadget-chain RCE), corrupt memory in the native paths, or exhaust CPU/memory. *(inferred — the canonical serialization-framework adversary)* +- **Capabilities:** can craft arbitrary/malformed byte buffers; cannot change the application's Fory configuration or its registered-type set (those are set by the trusted app at startup). *(inferred)* +- **Out of scope:** an attacker who controls the embedding application, its configuration, or the objects passed to `serialize()` — already trusted; an attacker who has set `requireClassRegistration(false)` themselves. *(inferred)* + +## §8 Security properties the project provides + +*(Registration + depth defenses documented; the guarantees framed below are for PMC confirmation.)* + +- **Registered-type-only instantiation (default).** With `requireClassRegistration(true)` (the default), deserialization instantiates only types the application registered, so attacker bytes cannot drive Fory to construct an arbitrary class. *Violation symptom:* an unregistered/unexpected type is instantiated from input under the default config. *Severity:* security-critical (this is the deserialization-RCE defense). *(documented that registration is required by default + that disabling causes risk; the unbypassability guarantee is the claim to confirm)* +- **Bounded recursion depth.** Deserialization beyond `maxDepth` (default 50) throws rather than recursing unbounded. *Violation symptom:* stack overflow / unbounded recursion from crafted nesting under the default. *Severity:* security-critical (DoS). *(documented — config table)* +- **Memory safety on malformed input — language-conditional.** In managed-runtime implementations, malformed bytes yield an exception, not memory corruption. For the **C++** implementation this is the load-bearing property to confirm (malformed-input fuzzing of the C++ decoder). *Violation symptom:* OOB read/write, crash. *Severity:* security-critical. *(inferred — confirm per language)* +- **Resource bounds beyond depth — UNSPECIFIED.** Whether a crafted payload can force large allocation / CPU blowup within the depth limit (e.g. huge declared collection sizes) is a bug or expected is open; the model needs a line (§14). *(inferred; maxDepth documented)* Review Comment: This is stale against the current security doc. `docs/security/deserialization.md` now draws the resource line: no disproportionate allocation before bytes are supplied or proven readable, no stream buffer growth to attacker-declared sizes before exact read/skip, and proportional checks before collection preallocation. I would replace this open question with a link to that model, otherwise future triage will treat an already-settled rule as undefined. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
