This is an automated email from the ASF dual-hosted git repository. chaokunyang pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/fory-site.git
commit 1e1b652717c44da7f8bce1cc1f350afa08b46643 Author: chaokunyang <[email protected]> AuthorDate: Wed May 13 11:15:30 2026 +0000 🔄 synced local 'docs/specification/' with remote 'docs/specification/' --- docs/specification/java_serialization_spec.md | 964 ++++++++++++++----------- docs/specification/xlang_serialization_spec.md | 47 +- 2 files changed, 559 insertions(+), 452 deletions(-) diff --git a/docs/specification/java_serialization_spec.md b/docs/specification/java_serialization_spec.md index 1cba88ebc4..2d759e1c8e 100644 --- a/docs/specification/java_serialization_spec.md +++ b/docs/specification/java_serialization_spec.md @@ -19,563 +19,667 @@ license: | limitations under the License. --- -## Spec overview +## Scope -Apache Fory Java serialization is a dynamic binary format for Java object graphs. It supports -shared references, circular references, polymorphism, and optional schema evolution. The format is -stream friendly: shared type metadata is written inline when needed and there is no meta start -offset. +This document specifies the Apache Fory Java native binary format: the format +used by Java when `withXlang(false)` is configured. The format is optimized for +Java object graphs, Java collection implementations, Java primitive arrays, +Java class registration, Java serialization hooks, and optional schema +evolution. -The Java native format is an extension of the xlang wire format and reuses the same core framing -and encodings; see `docs/specification/xlang_serialization_spec.md` for the shared baseline. +Java native mode and xlang mode share low-level building blocks such as +little-endian numeric payloads, variable-length integer encodings, reference +flags, meta string encodings, and TypeDef/ClassDef concepts. They are different +wire formats. In Java native mode, only the scalar type IDs from `BOOL` through +`STRING` are shared with xlang. Collection, map, struct, array, enum, and +native Java implementation type IDs are Java native IDs unless this document +explicitly says otherwise. -Overall layout: +See [Xlang Serialization Format](xlang_serialization_spec.md) for the +cross-language format. -``` -| fory header | object ref meta | object type meta | object value data | -``` - -All data is encoded in little endian byte order. When running on a big endian platform, array -serializers swap byte order on write/read so the on-wire layout remains little endian. +## Stream Layout -## Fory header +A Java native stream contains one header byte followed by one or more root +objects. Each root object is encoded as a normal object slot: -Java native serialization writes a one byte bitmap header. The header layout mirrors the xlang -bitmap and uses the same flag bits. +```text +| header | root_0 | root_1 | ... | +root: +| reference flag | [type metadata] | [value payload] | ``` -| 6 bits | 1 bit | 1 bit | -+---------------+-------+-------+ -| reserved | oob | xlang | -``` - -- xlang flag: bit 0, set when serialization uses xlang format and clear for Java native format. -- oob flag: bit 1, set when `BufferCallback` is not null. -- reserved bits: bits 2-7, must be zero. -The header is always a single byte; no language ID is written. +All multi-byte fixed-width values are little endian. A big-endian Java runtime +must still write and read little-endian payloads. + +The stream is stateful. Type metadata, class definitions, and object references +are assigned indexes as they are first encountered and may be referenced later +in the same stream. + +## Header + +The header is a single byte: + +```text +| bits 7..2 reserved | bit 1 out-of-band | bit 0 xlang | +``` + +- `xlang` must be `0` for Java native mode. +- `out-of-band` is `1` when a `BufferCallback` is configured. +- Reserved bits must be `0`. + +Java native mode does not write a language ID after the header. + +## Reference Slots + +Objects, nullable fields, and reference-tracked fields use the standard Fory +reference slot. The first byte is signed: + +| Flag | Byte | Payload that follows | +| --------------------- | ---- | ---------------------------------------------------------------- | +| `NULL_FLAG` | `-3` | No payload. The slot value is `null`. | +| `REF_FLAG` | `-2` | `varuint32` reference ID of an earlier object. | +| `NOT_NULL_VALUE_FLAG` | `-1` | Value payload. No reference ID is assigned for this occurrence. | +| `REF_VALUE_FLAG` | `0` | Value payload. Assign the next reference ID before reading data. | + +When reference tracking is disabled for a slot, writers use only `NULL_FLAG` +and `NOT_NULL_VALUE_FLAG`. + +Primitive field fast paths do not wrap non-null primitive values in a reference +slot. Boxed primitives and other nullable values use the slot selected by field +metadata. + +## Type Metadata + +Dynamic object slots write type metadata before the value payload. Type metadata +identifies the serializer and, when needed, carries class names or ClassDef +metadata. + +```text +| varuint32 type_id | [type-specific metadata] | +``` + +Registered Java classes, Java native built-ins, and Fory internal serializers +use numeric type IDs. Unregistered classes or classes registered by name carry +name metadata. Schema-evolution classes may carry a ClassDef. + +### Native Type ID Ranges + +| Range | Meaning | +| -------- | ------------------------------------------------------------------ | +| `0` | `UNKNOWN`, used in metadata for dynamic or object-typed positions. | +| `1..21` | Shared scalar IDs from `BOOL` through `STRING`. | +| `22..63` | Reserved in Java native mode for the xlang internal ID range. | +| `64..68` | Reserved for future Java native internal IDs. | +| `69..98` | Java native built-ins listed below. | +| `99+` | User and runtime class IDs assigned by the Java `ClassResolver`. | + +The shared scalar IDs are: + +| ID | Name | Java value domain | +| --- | --------------- | --------------------------------------- | +| 1 | `BOOL` | boolean values in xlang metadata | +| 2 | `INT8` | signed 8-bit integer metadata | +| 3 | `INT16` | signed 16-bit integer metadata | +| 4 | `INT32` | fixed-width signed 32-bit metadata | +| 5 | `VARINT32` | variable-width signed 32-bit metadata | +| 6 | `INT64` | fixed-width signed 64-bit metadata | +| 7 | `VARINT64` | variable-width signed 64-bit metadata | +| 8 | `TAGGED_INT64` | tagged signed 64-bit metadata | +| 9 | `UINT8` | unsigned 8-bit metadata | +| 10 | `UINT16` | unsigned 16-bit metadata | +| 11 | `UINT32` | fixed-width unsigned 32-bit metadata | +| 12 | `VAR_UINT32` | variable-width unsigned 32-bit metadata | +| 13 | `UINT64` | fixed-width unsigned 64-bit metadata | +| 14 | `VAR_UINT64` | variable-width unsigned 64-bit metadata | +| 15 | `TAGGED_UINT64` | tagged unsigned 64-bit metadata | +| 16 | `FLOAT8` | reserved 8-bit float metadata | +| 17 | `FLOAT16` | half precision float metadata | +| 18 | `BFLOAT16` | bfloat16 metadata | +| 19 | `FLOAT32` | 32-bit floating point metadata | +| 20 | `FLOAT64` | 64-bit floating point metadata | +| 21 | `STRING` | Java `String` | + +Java native built-ins start at ID `69`: -## Reference meta +| ID | Name | Java type or serializer owner | +| --- | ---------------------------- | ---------------------------------------- | +| 69 | `VOID_ID` | `java.lang.Void` | +| 70 | `CHAR_ID` | `java.lang.Character` | +| 71 | `PRIMITIVE_VOID_ID` | `void` | +| 72 | `PRIMITIVE_BOOL_ID` | `boolean` | +| 73 | `PRIMITIVE_INT8_ID` | `byte` | +| 74 | `PRIMITIVE_CHAR_ID` | `char` | +| 75 | `PRIMITIVE_INT16_ID` | `short` | +| 76 | `PRIMITIVE_INT32_ID` | `int` | +| 77 | `PRIMITIVE_FLOAT32_ID` | `float` | +| 78 | `PRIMITIVE_INT64_ID` | `long` | +| 79 | `PRIMITIVE_FLOAT64_ID` | `double` | +| 80 | `PRIMITIVE_BOOLEAN_ARRAY_ID` | `boolean[]` | +| 81 | `PRIMITIVE_BYTE_ARRAY_ID` | `byte[]` | +| 82 | `PRIMITIVE_CHAR_ARRAY_ID` | `char[]` | +| 83 | `PRIMITIVE_SHORT_ARRAY_ID` | `short[]` | +| 84 | `PRIMITIVE_INT_ARRAY_ID` | `int[]` | +| 85 | `PRIMITIVE_FLOAT_ARRAY_ID` | `float[]` | +| 86 | `PRIMITIVE_LONG_ARRAY_ID` | `long[]` | +| 87 | `PRIMITIVE_DOUBLE_ARRAY_ID` | `double[]` | +| 88 | `STRING_ARRAY_ID` | `String[]` | +| 89 | `OBJECT_ARRAY_ID` | `Object[]` and object array serializers | +| 90 | `ARRAYLIST_ID` | `java.util.ArrayList` | +| 91 | `HASHMAP_ID` | `java.util.HashMap` | +| 92 | `HASHSET_ID` | `java.util.HashSet` | +| 93 | `CLASS_ID` | `java.lang.Class` | +| 94 | `EMPTY_OBJECT_ID` | Empty-object serializer | +| 95 | `LAMBDA_STUB_ID` | Lambda replacement stub | +| 96 | `JDK_PROXY_STUB_ID` | JDK proxy replacement stub | +| 97 | `REPLACE_STUB_ID` | `writeReplace`/`readResolve` replacement | +| 98 | `NONEXISTENT_META_SHARED_ID` | Unknown class placeholder | + +### Registered, Named, and Unregistered Classes + +Java native mode supports three class identity forms: + +- ID registration: the type ID is the registered numeric class ID. +- Name registration: the type metadata carries namespace and type name strings. +- Unregistered class: the type metadata carries the package name as namespace + and the simple Java class name as type name. + +Class registration is the fastest and most compact form. Name-based forms are +used when stable names are required or class registration is disabled. + +### Meta Sharing + +When meta sharing is enabled, class metadata is written once and referenced by a +stream-local index: + +```text +| varuint32 marker | [class definition bytes if new] | + +marker = (index << 1) | flag +flag = 0: new definition, class definition bytes follow +flag = 1: reference to an earlier definition +``` + +Indexes are assigned in first-use order. + +## Schema Modes + +Java native mode has two object schema modes. + +### Schema-Consistent Mode + +Schema-consistent mode is used when compatible mode is disabled. The writer and +reader must have matching fields and field order. No per-object ClassDef is +required for ordinary registered classes. Field values are written directly in +protocol order. + +### Compatible Mode + +Compatible mode writes ClassDef metadata for struct-like classes. Readers match +local fields against remote ClassDef fields by identifier, read matching fields, +and skip unknown fields using the remote field type metadata. Compatible mode is +the Java native schema-evolution path. + +## Field Order + +Java native object serializers use the same deterministic field-order +categories as the current xlang protocol: + +1. Primitive non-nullable numeric and boolean scalar fields. +2. Primitive nullable numeric and boolean scalar fields, including boxed Java + primitive wrappers. +3. Non-primitive fields. + +Primitive groups keep the primitive comparator: + +1. Fixed-width primitive encodings before compressed or variable-width + primitive encodings. +2. Larger primitive width before smaller primitive width. +3. Internal primitive type ID ascending. +4. Field identifier. + +Non-primitive fields sort directly by field identifier. Non-primitive type ID, +serializer kind, collection kind, map kind, and Java implementation class do not +participate in field order. + +Field identifiers are selected as follows: + +- If a field has an explicit non-negative `@ForyField(id = ...)`, that numeric + ID is the field identifier. +- Otherwise, the Java field name converted to snake_case is the field + identifier. +- Negative annotation values are not valid field IDs. The annotation default + value `-1` means no explicit ID and is ignored for identifier selection. -Reference tracking uses the same flags as the xlang specification. +Identifier comparison is: -| Flag | Byte Value | Description | -| ------------------- | ---------- | -------------------------------------------------------------------------------------------------------- | -| NULL FLAG | `-3` | Object is null. No further bytes are written for this object. | -| REF FLAG | `-2` | Object was already serialized. Followed by unsigned varint32 reference ID. | -| NOT_NULL VALUE FLAG | `-1` | Object is non-null but reference tracking is disabled for this type. Object data follows immediately. | -| REF VALUE FLAG | `0` | Object is referencable and this is its first occurrence. Object data follows. Assigns next reference ID. | +1. If both fields have explicit IDs, compare IDs numerically. +2. If only one field has an explicit ID, the ID-based field sorts before the + name-based field. +3. If neither field has an explicit ID, compare snake_case names + lexicographically. +4. If identifiers are equal, use deterministic tie-breakers such as declaring + class and original field name. Untagged fields with the same snake_case + identifier in the same class are invalid. A child field that hides an + inherited field with the same Java field name keeps only the nearest field in + xlang TypeDef metadata because the inherited field has no distinct untagged + identifier. -When reference tracking is disabled globally or for a specific field/type, only `NULL FLAG` and -`NOT_NULL VALUE FLAG` are used. +Generated serializers may keep separate internal descriptor groups for +primitive, collection, map, built-in, and user-defined serializers so they can +emit specialized fast paths. Those internal groups are an implementation detail +and must not change wire field order. -## Type system and type IDs +## ClassDef Encoding -Java native serialization uses the unified type ID layout shared with xlang: +Compatible mode and meta sharing encode Java class definitions as TypeDef +records. A TypeDef has an 8-byte header followed by class metadata bytes: -``` -full_type_id = (user_type_id << 8) | internal_type_id +```text +| 8-byte header | [varuint32 extra_size] | class metadata bytes | ``` -- `internal_type_id` is the low 8 bits describing the kind (enum/struct/ext, named variants, or a - built-in type). -- `user_type_id` is the numeric registration ID (0-based) for user-defined enum/struct/ext types. -- Named types use `NAMED_*` internal IDs and carry names in metadata rather than embedding a user - ID. - -### Shared internal type IDs (0-63) - -Java native mode shares the xlang internal IDs for all values below 64. IDs `0~56` are defined by -the xlang spec, while `57~63` are reserved for future internal use. These IDs are stable across -languages. - -See the internal type ID table in -[Xlang Serialization Format](xlang_serialization_spec.md#internal-type-id-table). -Java shares all IDs `< 64`, with `57~63` reserved for future internal use. - -### Java native built-in type IDs - -Java native serialization assigns Java-specific built-ins starting at -`Types.BOUND + 5` (`Types.BOUND` is 64; 5 IDs are reserved for future use). -Type IDs in `0~56` are shared with xlang; `57~63` are reserved; `64+` are only -valid in Java native mode. - -| Type ID | Name | Description | -| ------- | -------------------------- | ------------------------------ | -| 69 | VOID_ID | java.lang.Void | -| 70 | CHAR_ID | java.lang.Character | -| 71 | PRIMITIVE_VOID_ID | void | -| 72 | PRIMITIVE_BOOL_ID | boolean | -| 73 | PRIMITIVE_INT8_ID | byte | -| 74 | PRIMITIVE_CHAR_ID | char | -| 75 | PRIMITIVE_INT16_ID | short | -| 76 | PRIMITIVE_INT32_ID | int | -| 77 | PRIMITIVE_FLOAT32_ID | float | -| 78 | PRIMITIVE_INT64_ID | long | -| 79 | PRIMITIVE_FLOAT64_ID | double | -| 80 | PRIMITIVE_BOOLEAN_ARRAY_ID | boolean[] | -| 81 | PRIMITIVE_BYTE_ARRAY_ID | byte[] | -| 82 | PRIMITIVE_CHAR_ARRAY_ID | char[] | -| 83 | PRIMITIVE_SHORT_ARRAY_ID | short[] | -| 84 | PRIMITIVE_INT_ARRAY_ID | int[] | -| 85 | PRIMITIVE_FLOAT_ARRAY_ID | float[] | -| 86 | PRIMITIVE_LONG_ARRAY_ID | long[] | -| 87 | PRIMITIVE_DOUBLE_ARRAY_ID | double[] | -| 88 | STRING_ARRAY_ID | String[] | -| 89 | OBJECT_ARRAY_ID | Object[] | -| 90 | ARRAYLIST_ID | java.util.ArrayList | -| 91 | HASHMAP_ID | java.util.HashMap | -| 92 | HASHSET_ID | java.util.HashSet | -| 93 | CLASS_ID | java.lang.Class | -| 94 | EMPTY_OBJECT_ID | empty object stub | -| 95 | LAMBDA_STUB_ID | lambda stub | -| 96 | JDK_PROXY_STUB_ID | JDK proxy stub | -| 97 | REPLACE_STUB_ID | writeReplace/readResolve stub | -| 98 | NONEXISTENT_META_SHARED_ID | meta-shared unknown class stub | - -### Registration and named types - -User-defined enum/struct/ext types can be registered by numeric ID or by name. - -- Numeric registration: `full_type_id = (user_id << 8) | internal_type_id`. -- Name registration: type meta uses namespace and type name (see below). -- Unregistered types are encoded as named types using namespace = package name and type name = - simple class name. - -Named type selection rules for unregistered types: - -- enum -> NAMED_ENUM -- struct-like serializers -> NAMED_STRUCT (or NAMED_COMPATIBLE_STRUCT in compatible mode) -- all other custom serializers -> NAMED_EXT - -## Type meta encoding - -Every value is written with a type ID followed by optional type metadata: - -1. Write `type_id` using varuint32 small7 encoding. -2. For `NAMED_ENUM`, `NAMED_STRUCT`, `NAMED_EXT`, `NAMED_COMPATIBLE_STRUCT`: - - If meta share is enabled: write shared class meta (streaming format). - - Otherwise: write namespace and type name as meta strings. -3. For `COMPATIBLE_STRUCT`: - - If meta share is enabled: write shared class meta (streaming format). - - Otherwise: no extra meta (type ID is sufficient). -4. All other types: no extra meta. - -### Shared class meta (streaming) - -When meta share is enabled, Java uses the streaming shared meta protocol and writes TypeDef -bytes inline on first use. +Header bits: +```text +| 52-bit hash | 3 reserved bits | 1 compress bit | 8 size bits | ``` -| varuint32: index_marker | [class def bytes if new] | -index_marker = (index << 1) | flag -flag = 1 -> reference -flag = 0 -> new type -``` +- `size`: the lower 8 bits. If the value is `0xff`, read `extra_size` as + `varuint32` and add it to `0xff`. +- `compress`: set when class metadata bytes are compressed by the configured + meta compressor. +- `reserved`: must be zero. +- `hash`: 52 bits derived from MurmurHash3 x64_128 seed 47 over + `class_metadata_bytes || header_low12_le`. `header_low12_le` is the low 12 + header bits encoded as two little-endian bytes with the upper four bits of the + second byte clear. Take lane 0 of the MurmurHash3 result, left-shift it by 12 + with signed 64-bit wraparound, apply signed absolute value, and mask with + `0xfffffffffffff000`. -- If `flag == 1`, this is a reference to a previously written type. No class def bytes follow. -- If `flag == 0`, this is a new type definition and class def bytes are written inline. +### Class Metadata Body -The index is assigned sequentially in the order types are first encountered. +```text +| root_kind_and_layer_count | class_layer_0 | class_layer_1 | ... | -## Schema modes +class_layer: +| varuint32 class_header | [registered type IDs or names] | field_info... | +``` -Java native serialization supports two schema modes: +`root_kind_and_layer_count` stores the root TypeDef kind in the high four bits +and `(num_layers - 1)` in the low four bits. If the low four bits are `0b1111`, +read an extra `varuint32` and add it to `15`. -- Schema consistent (compatible mode disabled): fields are serialized in a fixed order and no - ClassDef is required. Type meta uses `STRUCT` or `NAMED_STRUCT` for user-defined classes. -- Schema evolution (compatible mode enabled): fields are serialized with schema evolution metadata - (ClassDef). Type meta uses `COMPATIBLE_STRUCT` or `NAMED_COMPATIBLE_STRUCT`. +Root kind codes: -## ClassDef format (compatible mode) +| Code | Kind | +| ----- | -------------------------------------------- | +| 0 | `STRUCT` | +| 1 | `COMPATIBLE_STRUCT` | +| 2 | `NAMED_STRUCT` | +| 3 | `NAMED_COMPATIBLE_STRUCT` | +| 4 | `ENUM` | +| 5 | `NAMED_ENUM` | +| 6 | `EXT` | +| 7 | `NAMED_EXT` | +| 8 | `TYPED_UNION` | +| 9 | `NAMED_UNION` | +| 10-14 | Reserved | +| 15 | Extended-kind escape, rejected until defined | -ClassDef is the schema evolution metadata encoded for compatible structs. It is written inline -when shared meta is enabled, or referenced by index when already seen. +`class_header = (num_fields << 1) | registered_flag`. -### Binary layout +- If `registered_flag == 1`, write the class type ID as one byte. For + user-registered `ENUM`, `STRUCT`, `COMPATIBLE_STRUCT`, `EXT`, and + `TYPED_UNION`, write the user type ID as `varuint32`. +- If `registered_flag == 0`, write namespace and type name as meta strings. -``` -| 8 bytes header | [varuint32 extra size] | class meta bytes | -``` +Class layers are encoded from parent to leaf. Field lists inside each layer use +the field order defined above. -Header layout (lower bits on the right): +### Field Info -``` -| 52-bit hash | 3 bits reserved | 1 bit compress | 8-bit size | +Each field is encoded as: + +```text +| field_header | [extended_name_or_id_size] | [field name bytes] | field_type | ``` -- size: lower 8 bits. If size equals the mask (0xFF), write extra size as varuint32 and add it. -- compress: bit 8, set when class meta bytes are compressed. -- reserved: bits 9-11 are reserved for future use and must be zero. -- hash: 52 stored hash bits derived from MurmurHash3 x64_128 seed 47 over - `class meta bytes || header_low12_le`. `header_low12_le` is two little-endian bytes containing - the low 12 header bits (size, compress, and reserved bits); the upper four bits of the second - byte are zero. Take lane 0 of the 128-bit MurmurHash3 result as a signed int64, left-shift it by - 12 with two's-complement 64-bit wraparound, apply signed absolute value (leaving `INT64_MIN` - unchanged), then mask with `0xfffffffffffff000`. The final header is the masked hash bits OR-ed - with the low 12 header bits. +`field_header` bits: -### Class meta bytes +| Bits | Meaning | +| ---- | ------------------------------------------------ | +| 0 | `trackingRef` | +| 1 | `nullable` | +| 2..3 | field name encoding | +| 4..6 | encoded name length minus one, or compact tag ID | +| 7 | reserved, must be zero | -Class meta encodes a linearized class hierarchy (from parent to leaf) and field metadata: +Field name encodings: -``` -| root_kind_and_num_classes | class_layer_0 | class_layer_1 | ... | +| Code | Encoding | +| ---- | ------------------------------------ | +| 0 | UTF-8 | +| 1 | all-to-lower special encoding | +| 2 | lower/upper/digit special encoding | +| 3 | tag ID; field name bytes are omitted | -class_layer: -| num_fields << 1 | registered_flag | [type_id if registered] | -| namespace | type_name | field_infos | -``` +For name encodings, bits `4..6` store `encoded_length - 1` when it is less than +`7`. If the value is `7`, read an extra `varuint32` and add it to `7`. -- `root_kind_and_num_classes` stores the root TypeDef kind in the high four bits and - `(num_layers - 1)` in the low four bits. - - Root kind codes are `STRUCT=0`, `COMPATIBLE_STRUCT=1`, `NAMED_STRUCT=2`, - `NAMED_COMPATIBLE_STRUCT=3`, `ENUM=4`, `NAMED_ENUM=5`, `EXT=6`, `NAMED_EXT=7`, - `TYPED_UNION=8`, and `NAMED_UNION=9`. - - Kind codes `10-14` are reserved and `15` is an extended-kind escape rejected until defined. - - If the low four bits equal `0b1111`, read an extra varuint32 small7 and add it. - - The actual number of layers is `num_classes + 1`. -- `registered_flag` is 1 if the class is registered by numeric ID. -- If registered by ID, the one-byte class type ID follows. For user-registered ID kinds, the - user type ID follows as varuint32. -- If registered by name or unregistered, namespace and type name are written as meta strings. +For tag ID encoding, bits `4..6` store the numeric field ID when it is less than +`7`. If the value is `7`, read an extra `varuint32` and add it to `7`. Field IDs +must be non-negative. Duplicate field IDs in one TypeDef are invalid. -### Field info +### Field Type -Each field uses a compact header followed by its name bytes (omitted when TAG_ID is used) and its -type info: +Field types describe how compatible readers read or skip the field payload. +Top-level field types write only the type tag. Nested field types store +`nullable` and `trackingRef` in the low bits: -``` -| field_header | [field_name_bytes] | field_type | +```text +nested_field_type_header = (type_tag << 2) | (nullable << 1) | trackingRef ``` -`field_header` bits: +Type tags: -- bit 0: trackingRef -- bit 1: nullable -- bits 2-3: field name encoding -- bits 4-6: name length (len-1), or tag ID when TAG_ID is used; value 7 indicates extended length -- bit 7: reserved (0) +| Tag | Field type | Payload | +| --- | --------------------------- | -------------------------------- | +| 0 | Object/dynamic | none | +| 1 | Map | key field type, value field type | +| 2 | Collection/List/Set | element field type | +| 3 | Java array | dimensions, component field type | +| 4 | Enum | none | +| 5+ | Registered or built-in type | `tag - 5` is the type ID | -Field name encoding: +## Meta Strings -- 0: UTF8 -- 1: ALL_TO_LOWER_SPECIAL -- 2: LOWER_UPPER_DIGIT_SPECIAL -- 3: TAG_ID (field name omitted, tag ID stored in size field) +Namespaces, type names, and field names use the meta string encodings defined +by the xlang specification. A meta string header stores the byte length and +encoding kind; extended lengths are written as `varuint32`. -If length is extended (size==7), an extra varuint32 small7 storing `(len-1) - 7` follows. +Package and namespace names use UTF-8, all-to-lower special encoding, or +lower/upper/digit special encoding. Type names use UTF-8, +lower/upper/digit special encoding, first-to-lower special encoding, or +all-to-lower special encoding. Field names use the field-info encoding table +above. -### Field type encoding +## Primitive Values -Field types are encoded with a type tag and optional nested type info. For nested types, the header -includes nullable/trackingRef flags in the low bits. -Top-level field types use the tag only (no flags). +Primitive values are written without type metadata when the field serializer is +known statically: -Type tags: +| Java type | Payload | +| --------- | --------------------------------------------------------------------------- | +| `boolean` | one byte: `0` or `1` | +| `byte` | one signed byte | +| `char` | two-byte UTF-16 code unit, little endian | +| `short` | two-byte signed integer, little endian | +| `int` | fixed int32 little endian, or ZigZag varint32 when configured | +| `long` | fixed int64 little endian, ZigZag varint64, or tagged int64 when configured | +| `float` | IEEE 754 binary32, little endian | +| `double` | IEEE 754 binary64, little endian | -| Tag | Field type | -| --- | ----------------------------------------- | -| 0 | Object (ObjectFieldType) | -| 1 | Map (MapFieldType) | -| 2 | Collection/List/Set (CollectionFieldType) | -| 3 | Array (ArrayFieldType) | -| 4 | Enum (EnumFieldType) | -| 5+ | Registered type (RegisteredFieldType) | +Boxed primitives use the same value payload after the selected null/reference +slot. -Encoding rules: +## String Values -- ObjectFieldType: write tag 0. -- MapFieldType: write tag 1, then key type, then value type. -- CollectionFieldType: write tag 2, then element type. -- ArrayFieldType: write tag 3, then dimensions, then component type. -- EnumFieldType: write tag 4. -- RegisteredFieldType: write tag `5 + type_id`. +Java strings are encoded as: -For nested types, nullable/trackingRef flags are stored in the low bits of the header as -`(type_tag << 2) | (nullable << 1) | tracking_ref`. +```text +| varuint36_small7 header | bytes | -## Meta string encoding +header = (num_bytes << 2) | coder +``` -Namespace, type names, and field names use the same meta string encodings as the xlang spec. +`coder` values: -### Package and type names +| Value | Encoding | +| ----- | -------------------- | +| 0 | Latin-1 | +| 1 | UTF-16 little endian | +| 2 | UTF-8 | -Header format: +`num_bytes` is the byte length of the encoded payload. -``` -| 6 bits size | 2 bits encoding | -``` +## Enum Values -- size is the byte length of the encoded name. -- if size == 63, write extra length `(size - 63)` as varuint32 small7. +Enum value payload depends on configuration: -Encodings: +- Ordinal mode writes the enum ordinal as `varuint32`. +- `@ForyEnumId` mode writes the configured non-negative enum tag as + `varuint32`. +- Name mode writes the enum constant name as a meta string. -- Package name: UTF8, ALL_TO_LOWER_SPECIAL, LOWER_UPPER_DIGIT_SPECIAL -- Type name: UTF8, LOWER_UPPER_DIGIT_SPECIAL, FIRST_TO_LOWER_SPECIAL, ALL_TO_LOWER_SPECIAL +`@ForyEnumId` may be declared on enum constants, on one integer field, or on one +zero-argument integer getter, according to the Java API contract. Duplicate or +negative enum tags are invalid. -### Field names +## Arrays -Field name encoding is described in the ClassDef field header section. When using TAG_ID, the -field name bytes are omitted and the tag ID is stored in the size field. +### Primitive Arrays -### Encoding algorithms +Primitive arrays write a length prefix and contiguous little-endian element +payload: -See the xlang specification for encoding algorithms and tables: -`docs/specification/xlang_serialization_spec.md#meta-string`. - -## Value encodings +```text +| varuint32 byte_length | raw element bytes | +``` -This section describes the byte layouts for common built-in serializers used in Java native -serialization. Custom serializers (EXT) may define additional formats but must still follow the -reference and type meta rules described above. +Compressed `int[]` and `long[]` arrays use element count followed by compressed +elements: -### Primitives +```text +int[] compressed: +| varuint32 length | varint32... | -- boolean: 1 byte (0x00 or 0x01). -- byte: 1 byte. -- short: 2 bytes little endian. -- char: 2 bytes little endian (UTF-16 code unit). -- int: - - fixed: 4 bytes little endian. - - varint: signed varint32 (ZigZag) when `compressInt` is enabled. -- long: - - fixed: 8 bytes little endian. - - varint: signed varint64 (ZigZag) when `longEncoding=VARINT`. - - tagged: tagged int64 when `longEncoding=TAGGED`. -- float: IEEE 754 float32, little endian. -- double: IEEE 754 float64, little endian. +long[] compressed: +| varuint32 length | varint64 or tagged_int64... | +``` -Varint encodings follow the xlang spec: -`docs/specification/xlang_serialization_spec.md#unsigned-varint32`. +`byte[]` is the binary serializer and writes `varuint32 length` followed by raw +bytes. -### String +### Object Arrays -Strings are encoded as: +Object arrays write the array length and an element type mode: +```text +| varuint32_small7 (length << 1 | monomorphic_flag) | +| [shared element class metadata] | +| element slots... | ``` -| varuint36_small: (num_bytes << 2) | coder | string bytes | -``` - -- coder: 2-bit value - - 0: LATIN1 - - 1: UTF16 - - 2: UTF8 -- num_bytes: byte length of the encoded string payload. - -UTF16 is encoded as little endian 2-byte code units. -### Enum +- If `monomorphic_flag == 1`, all non-null elements use the same element + serializer. The shared element class metadata is written once. +- If `monomorphic_flag == 0`, each non-null element writes its own type + metadata. -- If `serializeEnumByName` is enabled: write enum name as a meta string. -- Otherwise: write an enum tag as varuint32 small7. - - By default the tag is the declaration ordinal. - - If the enum configures `@ForyEnumId`, write the configured stable id instead. Java supports - annotating exactly one id field, exactly one zero-argument id getter, or every enum constant - with explicit tag values. +Each nullable or reference-tracked element is still represented by a reference +slot before its element payload. -### Binary (byte[]) +## Collections -Primitive byte arrays are encoded as: +Java collection serializers write collection size, element flags, optional +shared element type metadata, and element payloads: +```text +| varuint32_small7 size | elements_header | [element type metadata] | elements... | ``` -| varuint32: num_bytes | raw bytes | -``` - -### Primitive arrays -Primitive arrays write a byte-length prefix followed by the little-endian primitive payload unless -compression is enabled: +`elements_header` bits: -``` -| varuint32: byte_length | raw bytes | -``` +| Bit | Meaning | +| --- | ------------------------------------- | +| 0 | Element reference tracking is enabled | +| 1 | At least one element may be null | +| 2 | Declared element type is used | +| 3 | All non-null elements share one type | -- `compressIntArray`: int[] encoded as `| varuint32: length | varint32... |`. -- `compressLongArray`: long[] encoded as `| varuint32: length | varint64/tagged... |`. +When all non-null elements share a type and the declared element type is not +used, the shared element type metadata is written once before element payloads. +Otherwise each non-null element writes its own type metadata. Null and reference +flags follow the reference-slot rules. -### Object arrays +### Collection Subclasses -Object arrays encode length and a monomorphic flag: +Specialized serializers for supported JDK collection subclasses write +subclass-owned field layers before the element payload: -``` -| varuint32_small7: (length << 1) | mono_flag | +```text +| varuint32_small7 size | +| [comparator reference for sorted/priority collections] | +| varuint32_small7 num_class_layers | +| class_layer_fields... | +| elements_header | [element type metadata] | elements... | ``` -- If `mono_flag == 1`, all elements share a known component serializer. Each element uses ref - flags and the component serializer writes the value. -- If `mono_flag == 0`, each element uses ref flags and writes its own class info and data. +`num_class_layers` is the exact number of subclass field layers encoded in the +payload. Readers must reject a payload whose layer count does not match the +local serializer because the value payload does not carry enough layer identity +to skip a mismatched subclass layout. -### Collections (List/Set) +## Maps -Collections encode length and a one-byte elements header: +Maps write entry count followed by one or more chunks. Each chunk groups entries +with compatible key and value metadata: -``` -| varuint32_small7: length | elements_header | [elem_class_info] | elements... | +```text +| varuint32_small7 size | chunk... | ``` -`elements_header` bits (see `CollectionFlags`): +Non-null chunks: -- bit 0: TRACKING_REF -- bit 1: HAS_NULL -- bit 2: IS_DECL_ELEMENT_TYPE -- bit 3: IS_SAME_TYPE +```text +| header | uint8 chunk_size | [key type metadata] | [value type metadata] | entries... | +``` -If `IS_SAME_TYPE` is set and `IS_DECL_ELEMENT_TYPE` is not set, the element class info is written -once before the elements. Element values then follow with either ref flags (if TRACKING_REF) or -per-element null flags (if HAS_NULL). +`chunk_size` is in `1..255`. -If `IS_SAME_TYPE` is not set, each element is written with its own class info and data (and -optionally ref flags). +`header` bits: -#### Child collection subclasses +| Bit | Meaning | +| --- | ----------------------------------- | +| 0 | Key reference tracking is enabled | +| 1 | Chunk may contain null keys | +| 2 | Declared key type is used | +| 3 | Value reference tracking is enabled | +| 4 | Chunk may contain null values | +| 5 | Declared value type is used | -Optimized serializers for subclasses of supported JDK collection implementations write subclass -field layers before element payloads: +Null key or null value entries are encoded as single-entry special chunks +without a `chunk_size` byte: -``` -| varuint32_small7: length | [comparator_ref] | varuint32_small7: num_class_layers | -| class_layer_fields... | [elements_header | elem_class_info | elements...] | -``` +- null key and non-null value: special null-key header, then value payload. +- non-null key and null value: special null-value header, then key payload. +- null key and null value: `KV_NULL` header only. -- `comparator_ref` is present only for sorted-set and priority-queue subclasses. -- `num_class_layers` is the exact number of subclass-owned field layers written after the collection - header and before the element payload. -- Readers must reject a payload whose `num_class_layers` does not match the local serializer's layer - count. These serializers do not carry per-layer class identity in the value payload, so mismatched - layers cannot be skipped safely. +`EnumMap` writes one serializer-owned payload mode byte before its normal map +payload: -### Maps +- `0`: normal payload follows. +- `1`: Java-serialized empty `EnumMap` payload. -Maps encode entry count and then a sequence of chunks. Each chunk groups entries that share key -and value types. +### Map Subclasses -``` -| varuint32_small7: size | chunk_1 | chunk_2 | ... | +Specialized serializers for supported JDK map subclasses write subclass-owned +field layers before entry chunks: -chunk (non-null entries): -| header | chunk_size | [key_class_info] | [value_class_info] | entries... | +```text +| varuint32_small7 size | +| [comparator reference for sorted maps] | +| varuint32_small7 num_class_layers | +| class_layer_fields... | +| chunk... | ``` -`header` bits (see `MapFlags`): +Readers must reject mismatched `num_class_layers` for the same reason as +collection subclasses. -- bit 0: TRACKING_KEY_REF -- bit 1: KEY_HAS_NULL -- bit 2: KEY_DECL_TYPE -- bit 3: TRACKING_VALUE_REF -- bit 4: VALUE_HAS_NULL -- bit 5: VALUE_DECL_TYPE +## JDK Wrappers and Views -If `KEY_DECL_TYPE` or `VALUE_DECL_TYPE` is unset, the corresponding class info is written once at -the start of the chunk. `chunk_size` is a single byte (1..255) and `MAX_CHUNK_SIZE` is 255. +Java native mode has serializers for selected JDK wrappers and views: -#### Null key/value entries +- Unmodifiable and synchronized collection/map wrappers keep the wrapper type + metadata and write the wrapped source collection or map as a normal object + payload. +- Recognized sublist views keep the sublist type metadata and write one + serializer-owned mode byte. Mode `0` writes visible elements as a collection + payload. Mode `1` writes view offset, size, and source list reference. +- `Collections.newSetFromMap` writes the backing map payload. +- Immutable JDK collection serializers keep list, set, or map payload + semantics and materialize an equivalent immutable or unmodifiable container + on read. -Entries with null key or null value are encoded as special single-entry chunks without a -`chunk_size` byte: +Android and JVM implementations may choose different concrete public backing +types for wrapper payloads, but the serializer-owned payload modes above define +the wire shape. -- null key, non-null value: `NULL_KEY_VALUE_DECL_TYPE*` flags, then value payload -- null value, non-null key: `NULL_VALUE_KEY_DECL_TYPE*` flags, then key payload -- null key and null value: `KV_NULL` header only +## Struct and Object Payloads -These chunks always represent exactly one entry. +Struct-like object payloads contain field values in protocol field order. The +selected serializer owns the exact field fast path: -`EnumMap` has an EnumMap-owned one-byte payload mode before its map payload: +```text +| field_0 payload | field_1 payload | ... | +``` -- `0`: normal payload, then `varuint32_small7` size, key enum class info, and the map chunks above. -- `1`: Java-serialized empty `EnumMap` payload. Android uses this mode when an empty map has no - public key from which to derive the enum class. Readers on Android and JVM must accept both modes. +For each field, field metadata decides whether the field writes a primitive +payload directly, a nullable slot, a reference-tracked slot, type metadata, or a +specialized collection/map/array payload. -#### Child map subclasses +Compatible-mode readers use the remote ClassDef field list to map fields by +identifier. Unknown fields are skipped using their remote field type metadata. -Optimized serializers for subclasses of supported JDK map implementations write subclass field -layers before map entry chunks: +Generated serializers may split large generated methods and hoist serializers, +field offsets, collection metadata, or map metadata. Those generated-code +decisions must preserve the same object payload order. -``` -| varuint32_small7: size | [comparator_ref] | varuint32_small7: num_class_layers | -| class_layer_fields... | [chunk_1 | chunk_2 | ...] | -``` +## Throwable Payloads -- `comparator_ref` is present only for sorted-map subclasses. -- `num_class_layers` is the exact number of subclass-owned field layers written after the map header - and before the entry chunks. -- Readers must reject a payload whose `num_class_layers` does not match the local serializer's layer - count. These serializers do not carry per-layer class identity in the value payload, so mismatched - layers cannot be skipped safely. - -### JDK collection/map wrappers and views - -Java native mode may use specialized serializers for JDK collection/map wrappers and views. These -serializers do not introduce a new collection/map protocol branch; they write ordinary object, -collection, or map payloads in serializer-owned value slots. - -- Unmodifiable and synchronized wrappers keep the outer wrapper type metadata. The wrapper value - payload is the wrapped source collection or map written as a normal referencable object. Android - writers use public source implementations for that payload: `ArrayList`, `HashSet`, `TreeSet`, - `HashMap`, or `TreeMap`. Readers rewrap the source through `Collections.unmodifiable*` or - `Collections.synchronized*`. -- Recognized sublist view classes keep their outer sublist type metadata and use a - serializer-local one-byte payload mode. Mode `0` writes visible elements as a normal collection - payload. Mode `1` writes view metadata as `offset`, `size`, and source list reference. Android - writers use mode `0`; JVM writers may use mode `1` when the view fields match the supported JDK - shape. Readers on Android and JVM must accept both modes. -- `Collections.newSetFromMap` writes a backing-map payload. Android writers use `HashMap` backing - type metadata. -- Immutable JDK collection serializers keep ordinary list/set/map payload semantics. Android readers - materialize public unmodifiable containers when JDK internal immutable constructors are not - available. - -Xlang mode uses the xlang collection/map protocol and does not encode Java wrapper or view internals. - -### Objects and structs - -Object values are encoded as: +`Throwable` serializers preserve standard Java throwable state and +subclass-owned fields: +```text +| stack_trace_ref | cause_ref | message_ref | +| varuint32 suppressed_count | suppressed_ref... | +| varuint32 extra_field_count | extra_field_name/value... | +| varuint32_small7 num_class_layers | +| class_layer_fields... | ``` -| ref meta | type meta | field data | -``` - -Field data is written by the serializer selected by the class info. For standard object -serialization: -- Fields are sorted deterministically using `DescriptorGrouper` order: - primitives, boxed primitives, built-ins, collections, maps, then other fields, with names sorted - within each category. -- For compatible mode, `MetaSharedSerializer` uses ClassDef field metadata to read and skip - unknown fields. -- For each field, the serializer uses field metadata (nullable, trackingRef, polymorphic) to decide - whether to write ref flags and/or type meta before the field value. +`extra_field_count` is reserved for serializer-owned extension fields and is +currently written as zero. `num_class_layers` must match the local throwable +serializer layout on read. -### Throwable values +## Replacement and Java Serialization Hooks -`Throwable` subclasses use a specialized payload that preserves stack trace, cause, message, -suppressed exceptions, and subclass-owned fields: +Java native mode supports serializer-owned handling for Java object replacement +and Java serialization hooks: -``` -| stack_trace_ref | cause_ref | message_string_ref | -| varuint32: suppressed_count | suppressed_ref... | -| varuint32: extra_field_count | extra_field_name/value... | -| varuint32_small7: num_class_layers | class_layer_fields... | -``` +- `writeReplace`/`readResolve` values use replacement metadata and payloads + owned by the replacement serializer. +- JDK proxy and lambda stubs use their registered native stub IDs. +- Types that require Java Object Serialization compatibility may be delegated to + serializers that reproduce the required Java semantics inside a Fory object + slot. -- `extra_field_count` is reserved for serializer-owned extension fields and is currently written as - zero. -- `num_class_layers` is the exact number of `Throwable` subclass field layers written after the - built-in Throwable state. -- Readers must reject a payload whose `num_class_layers` does not match the local serializer's layer - count. The Throwable value payload does not carry per-layer class identity, so mismatched layers - cannot be skipped safely. +These serializers still obey the stream header, reference slot, and type +metadata rules in this document. -### Extensions (EXT) +## Unknown Classes -Extension types are encoded by their registered serializer. Type meta is still written before the -value as described above. The serializer is responsible for the value layout. +When meta sharing is enabled and a reader does not have a local class for a +remote ClassDef, Java may materialize an unknown-class placeholder using +`NONEXISTENT_META_SHARED_ID`. The placeholder stores enough field data to +preserve and copy the unknown value according to the unknown-class serializer. +It does not make the unknown Java class available to user code. -## Out-of-band buffers +## Out-of-Band Buffers -When a `BufferCallback` is provided, the oob flag is set in the header and serializers may emit -buffer references instead of inline bytes (for example, large primitive arrays). The out-of-band -buffer protocol is specific to the callback implementation; the main stream only contains -references to those buffers. +When the header out-of-band bit is set, serializers may write references to +external buffers instead of writing all bytes inline. The callback defines the +external buffer transport. The main stream remains a valid Fory stream +containing references to those buffers at serializer-owned payload positions. diff --git a/docs/specification/xlang_serialization_spec.md b/docs/specification/xlang_serialization_spec.md index 1c9a424931..bb5b2c1598 100644 --- a/docs/specification/xlang_serialization_spec.md +++ b/docs/specification/xlang_serialization_spec.md @@ -646,8 +646,8 @@ Field names: Field order: -Field order is implementation-defined. Decoders must match fields by name or tag ID rather than -position. Fory uses a stable grouping and sorting order to produce deterministic TypeDefs. +TypeDef field lists use the same ordering defined in [Field order](#field-order). Compatible +decoders must still match fields by name or tag ID rather than relying only on position. ## Meta String @@ -1469,10 +1469,21 @@ language-specific helper classes. For every field, compute a stable identifier used for ordering: -- If a tag ID is configured (e.g., `@ForyField(id=...)`), use the tag ID as a decimal string. +- If a non-negative tag ID is configured (e.g., `@ForyField(id=...)`), use the tag ID. - Otherwise, use the field name converted to `snake_case`. -Tag IDs must be unique within a type; duplicate tag IDs are invalid. +Configured tag IDs must be non-negative. A negative configured tag ID is invalid; languages may +use a negative value only as a default or internal sentinel for "no tag ID configured", which falls +back to the `snake_case` field name and is not a tag ID. Tag IDs must be unique within a type; +duplicate tag IDs are invalid. + +Field identifiers compare as follows: + +1. If both fields have tag IDs, compare the IDs numerically. +2. If only one field has a tag ID, the tagged field sorts first. +3. If neither field has a tag ID, compare the `snake_case` names lexicographically. +4. If fields still compare equal, use deterministic language-local tie-breakers such as declaring + class name, original field name, or original field index. ##### Step 2: Group assignment @@ -1480,13 +1491,9 @@ Assign each field to exactly one group in the following order: 1. **Primitive (non-nullable)**: primitive or boxed numeric/boolean types with `nullable=false`. 2. **Primitive (nullable)**: primitive or boxed numeric/boolean types with `nullable=true`. -3. **Built-in (non-container)**: internal type IDs that are not user-defined and not UNKNOWN, - excluding collections and maps (for example: STRING, TIME types, UNION/TYPED_UNION/NAMED_UNION, - primitive arrays). -4. **Collection**: list/set/object-array fields. Non-primitive arrays are treated as LIST for - ordering purposes. -5. **Map**: map fields. -6. **Other**: user-defined enum/struct/ext and UNKNOWN types. +3. **Non-primitive**: every other field, including strings, time/date/duration/decimal/binary + values, unions, primitive arrays, collections, maps, enums, structs, ext/user-defined types, + UNKNOWN fields, object arrays, and all other non-primitive schemas. ##### Step 3: Intra-group ordering @@ -1498,16 +1505,11 @@ Within each group, apply the following sort keys in order until a difference is types (`VARINT32`, `VAR_UINT32`, `VARINT64`, `VAR_UINT64`, `TAGGED_INT64`, `TAGGED_UINT64`). 2. **Primitive size** (descending): 8-byte > 4-byte > 2-byte > 1-byte. 3. **Internal type ID** (ascending) as a tie-breaker for equal sizes. -4. **Field identifier** (lexicographic ascending). - -**Built-in / Collection / Map groups (3-5):** - -1. **Internal type ID** (ascending). -2. **Field identifier** (lexicographic ascending). +4. **Field identifier** using the comparator from Step 1. -**Other group (6):** +**Non-primitive group (3):** -1. **Field identifier** (lexicographic ascending). +1. **Field identifier** using the comparator from Step 1. If two fields still compare equal after the rules above, preserve a deterministic order by comparing declaring class name and then the original field name. This tie-breaker should be @@ -1517,8 +1519,9 @@ reachable only in invalid schemas (e.g., duplicate tag IDs). - The ordering above is used for serialization order and TypeDef field lists. Schema hashes use the field identifier ordering described in the schema hash section. -- Collection/map normalization is required so peers with different concrete types (e.g., - `List` vs `Collection`) still agree on ordering. +- Non-primitive type IDs and codec categories must not affect field order. Implementations may keep + internal categories to preserve optimized serializers and generated code paths, but the categories + are not ordering keys. - The compressed numeric rule is critical for cross-language consistency: compressed integer fields are always placed after all fixed-width integer fields. @@ -1535,7 +1538,7 @@ MurmurHash3 x64_128 of the struct fingerprint string: - For each field, build `<field_id_or_name>,<field_type_fingerprint>;`. - Field identifier is the tag ID if present, otherwise the snake_case field name. -- Sort by field identifier lexicographically before concatenation. +- Sort by the field identifier comparator from [Field order](#field-order) before concatenation. - `field_type_fingerprint` is recursive: - Leaf: `<type_id>,<ref>,<nullable>` - `LIST` / `SET`: `<type_id>,<ref>,<nullable>[<element_fingerprint>]` --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
