[PATCH v4 2/4] x86-disas: add x86-mini metadata documentation

Michael Clark Wed, 14 May 2025 16:50:14 -0700

add detailed information on the instruction opcode encoding
format for LEX/VEX/EVEX prefix, map and opcode encoding, the
operand encoding format, the field order encoding format and
notes on instruction synthesis for parameterized opcodes.


these docs are from an external project with an MIT license:

- https://github.com/michaeljclark/x86

Signed-off-by: Michael Clark <mich...@anarch128.org>
---
 docs/x86-metadata.txt | 303 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 303 insertions(+)
 create mode 100644 docs/x86-metadata.txt

diff --git a/docs/x86-metadata.txt b/docs/x86-metadata.txt
new file mode 100644
index 000000000000..ce96cedfe100
--- /dev/null
+++ b/docs/x86-metadata.txt
@@ -0,0 +1,303 @@
+x86 Instruction Set Metadata
+============================
+
+Legacy x86 instructions have been parameterized in the instruction set
+metadata using a new LEX prefix for instruction encoding with abstract width
+suffix codes that synthesize multiple instruction widths using combinations
+of operand size prefixes and `REX.W` bits. This new LEX format makes legacy
+instruction encodings consistent with VEX and EVEX encodings as well as
+eliminating some redundancy in the metadata.
+
+There are a small number of special cases for legacy instructions which need
+mode-dependent overrides for cases such as, a different opcode is used for
+different modes, or the instruction has a quirk where the operand size does
+not follow the default rules for instruction word and address sizes:
+
+- `.wx` is used to specify 64-bit instructions that default to 32-bit
+  operands in 64-bit mode.
+- `.ww` is used to specify 64-bit instructions that default to 64-bit
+  operands in 64-bit mode.
+- `.o16` is used to specify an instruction override specific to 16-bit mode.
+- `.o32` is used to specify an instruction override specific to 32-bit mode.
+- `.o64` is used to specify an instruction override specific to 64-bit mode.
+
+CSV File Format
+===============
+
+The instruction set metadata in the `disas/x86-data` directory has the
+following fields which map to instruction encoding tables in the Intel
+Architecture Software Developer's Manual:
+
+- _Instruction_: opcode and operands from Opcode/Instruction column.
+- _Opcode_: instruction encoding from Opcode/Instruction column.
+- _Valid 64-bit_: 64-bit valid field from 64/32 bit Mode Support column.
+- _Valid 32-bit_: 32-bit valid field from 64/32 bit Mode Support column.
+- _Valid 16-bit_: 16-bit valid field from Compat/Legacy Mode column.
+- _Feature Flags_: extension name from CPUID Feature Flag column.
+- _Operand 1_: Operand 1 column from Instruction Operand Encoding table.
+- _Operand 2_: Operand 2 column from Instruction Operand Encoding table.
+- _Operand 3_: Operand 3 column from Instruction Operand Encoding table.
+- _Operand 4_: Operand 4 column from Instruction Operand Encoding table.
+- _Tuple Type_: Tuple Type column from Instruction Operand Encoding table.
+
+The instruction set metadata in the `data` directory is derived from
+[x86-csv](https://github.com/GregoryComer/x86-csv), although it has had
+extensive modifications to fix transcription errors, to revise legacy
+instruction encodings to conform to the new LEX format, as well as add
+missing details such as missing operands or recently added AVX-512
+instruction encodings and various other instruction set extensions.
+
+Table Generation
+================
+
+The appendices outline the printable form of the mnemonics used in the
+generated tables to describe operands, instruction encodings and field order.
+The mnemonics are referenced in the instruction set metadata files which are
+translated to enums and arrays by `scripts/x86-tablegen.py` which then map to
+the enum type and set definitions in `disas/x86.h`:
+
+- _enum x86_opr_ - operand encoding enum type and set attributes.
+- _enum x86_enc_ - instruction encoding enum type and set attributes.
+- _enum x86_ord_ - operand to instruction encoding field map set attributes.
+
+The enum values are combined together with _logical or_ combinations to
+form the primary metadata tables used by the encoder and decoder library:
+
+- _struct x86_opc_data_ - table type for instruction opcode encodings.
+- _struct x86_opr_data_ - table type for unique sets of instruction operands.
+- _struct x86_ord_data_ - table type for unique sets of instruction field 
orders.
+
+***Note***: There are some differences between the mnemonics used in the
+CSV metadata and the C enums. Exceptions are described in `operand_map` and
+`opcode_map` within `scripts/x86_tablegen.py`. The primary differences are
+in the names used in the operand columns to indicate operand field order,
+otherwise a type prefix is added, dots and brackets are omitted, and forward
+slashes are translated to underscores.
+
+Appendices
+==========
+
+This section describes the mnemonics used in the primary data structures:
+
+- _Appendix A - Operand Encoding_ - describes instruction operands.
+- _Appendix B - Operand Order_ - describes instruction field ordering.
+- _Appendix C - Instruction Encoding Prefixes_ - describes encoding prefixes.
+- _Appendix D - Instruction Encoding Suffixes_ - describes encoding suffixes.
+- _Appendix E - Instruction Synthesis Notes_ - notes on prefix synthesis.
+
+Appendix A - Operand Encoding
+=============================
+
+This table outlines the operand mnemonics used in instruction operands
+_(enum x86_opr)_.
+
+| operand            | description                                           |
+|:-------------------|:------------------------------------------------------|
+| `r`                | integer register                                      |
+| `v`                | vector register                                       |
+| `k`                | mask register                                         |
+| `seg`              | segment register                                      |
+| `creg`             | control register                                      |
+| `dreg`             | debug register                                        |
+| `bnd`              | bound register                                        |
+| `mem`              | memory reference                                      |
+| `rw`               | integer register word-sized (16/32/64 bit)            |
+| `ra`               | integer register addr-sized (16/32/64 bit)            |
+| `mw`               | memory reference word-sized (16/32/64 bit)            |
+| `mm`               | vector register 64-bit                                |
+| `xmm`              | vector register 128-bit                               |
+| `ymm`              | vector register 256-bit                               |
+| `zmm`              | vector register 512-bit                               |
+| `r8`               | register 8-bit                                        |
+| `r16`              | register 16-bit                                       |
+| `r32`              | register 32-bit                                       |
+| `r64`              | register 64-bit                                       |
+| `m8`               | memory reference 8-bit byte                           |
+| `m16`              | memory reference 16-bit word                          |
+| `m32`              | memory reference 32-bit dword                         |
+| `m64`              | memory reference 64-bit qword                         |
+| `m128`             | memory reference 128-bit oword/xmmword                |
+| `m256`             | memory reference 256-bit ymmword                      |
+| `m512`             | memory reference 512-bit zmmword                      |
+| `m80`              | memory reference 80-bit tword/tbyte                   |
+| `m384`             | memory reference 384-bit key locker handle            |
+| `mib`              | memory reference bound                                |
+| `m16bcst`          | memory reference 16-bit word broadcast                |
+| `m32bcst`          | memory reference 32-bit word broadcast                |
+| `m64bcst`          | memory reference 64-bit word broadcast                |
+| `vm32`             | vector memory 32-bit                                  |
+| `vm64`             | vector memory 64-bit                                  |
+| `{er}`             | operand suffix - embedded rounding control            |
+| `{k}`              | operand suffix - apply mask register                  |
+| `{sae}`            | operand suffix - suppress all execptions              |
+| `{z}`              | operand suffix - zero instead of merge                |
+| `{rs2}`            | operand suffix - register stride 2                    |
+| `{rs4}`            | operand suffix - register stride 4                    |
+| `r/m8`             | register unsized memory 8-bit                         |
+| `r/m16`            | register unsized memory 16-bit                        |
+| `r/m32`            | register unsized memory 32-bit                        |
+| `r/m64`            | register unsized memory 64-bit                        |
+| `k/m8`             | mask register memory 8-bit                            |
+| `k/m16`            | mask register memory 16-bit                           |
+| `k/m32`            | mask register memory 32-bit                           |
+| `k/m64`            | mask register memory 64-bit                           |
+| `bnd/m64`          | bound register memory 64-bit                          |
+| `bnd/m128`         | bound register memory 128-bit                         |
+| `rw/mw`            | register or memory 16/32/64-bit (word size)           |
+| `r8/m8`            | 8-bit register 8-bit memory                           |
+| `r?/m?`            | N-bit register N-bit memory                           |
+| `mm/m?`            | 64-bit vector N-bit memory                            |
+| `xmm/m?`           | 128-bit vector N-bit memory                           |
+| `ymm/m?`           | 256-bit vector N-bit memory                           |
+| `zmm/m?`           | 512-bit vector N-bit memory                           |
+| `xmm/m?/m?bcst`    | 128-bit vector N-bit memory N-bit broadcast           |
+| `ymm/m?/m?bcst`    | 256-bit vector N-bit memory N-bit broadcast           |
+| `zmm/m?/m?bcst`    | 512-bit vector N-bit memory N-bit broadcast           |
+| `vm32x`            | 32-bit vector memory in xmm                           |
+| `vm32y`            | 32-bit vector memory in ymm                           |
+| `vm32z`            | 32-bit vector memory in zmm                           |
+| `vm64x`            | 64-bit vector memory in xmm                           |
+| `vm64y`            | 64-bit vector memory in ymm                           |
+| `vm64z`            | 64-bit vector memory in zmm                           |
+| `st0`              | implicit register st0                                 |
+| `st1`              | implicit register st1                                 |
+| `es`               | implicit segment es                                   |
+| `cs`               | implicit segment cs                                   |
+| `ss`               | implicit segment ss                                   |
+| `ds`               | implicit segment ds                                   |
+| `fs`               | implicit segment fs                                   |
+| `gs`               | implicit segment gs                                   |
+| `aw`               | implicit register (ax/eax/rax)                        |
+| `cw`               | implicit register (cx/ecx/rcx)                        |
+| `dw`               | implicit register (dx/edx/rdx)                        |
+| `bw`               | implicit register (bx/ebx/rbx)                        |
+| `pa`               | implicit indirect register (ax/eax/rax)               |
+| `pc`               | implicit indirect register (cx/ecx/rcx)               |
+| `pd`               | implicit indirect register (dx/edx/rdx)               |
+| `pb`               | implicit indirect register (bx/ebx/rbx)               |
+| `psi`              | implicit indirect register (si/esi/rsi)               |
+| `pdi`              | implicit indirect register (di/edi/rdi)               |
+| `xmm0`             | implicit register xmm0                                |
+| `xmm0_7`           | implicit registers xmm0-xmm7                          |
+| `1`                | constant 1                                            |
+| `ib`               | 8-bit immediate                                       |
+| `iw`               | 16-bit or 32-bit immediate (mode + operand size)      |
+| `iwd`              | 16-bit or 32-bit immediate (mode)                     |
+| `id`               | 32-bit immediate                                      |
+| `iq`               | 64-bit immediate                                      |
+| `rel8`             | 8-bit displacement                                    |
+| `relw`             | 6-bit or 32-bit displacement (mode + operand size)    |
+| `moffs`            | indirect memory offset                                |
+| `far16/16`         | 16-bit seg 16-bit far displacement                    |
+| `far16/32`         | 16-bit seg 32-bit far displacement                    |
+| `memfar16/16`      | indirect 16-bit seg 16-bit far displacement           |
+| `memfar16/32`      | indirect 16-bit seg 32-bit far displacement           |
+| `memfar16/64`      | indirect 16-bit seg 64-bit far displacement           |
+
+Appendix B - Operand Order
+==========================
+
+This table outlines the mnemonics used to map operand field order
+_(enum x86_ord)_.
+
+| mnemonic | description                                                     |
+|:---------|:----------------------------------------------------------------|
+| `imm`    | ib, iw, i16, i32, i64                                           |
+| `reg`    | modrm.reg                                                       |
+| `mrm`    | modrm.r/m                                                       |
+| `sib`    | modrm.r/m sib                                                   |
+| `is4`    | register from ib                                                |
+| `ime`    | i8, i16 (special case for CALLF/JMPF/ENTER)                     |
+| `vec`    | VEX.vvvv                                                        |
+| `opr`    | opcode +r                                                       |
+| `one`    | constant 1                                                      |
+| `rax`    | constant al/ax/eax/rax                                          |
+| `rcx`    | constant cl/cx/ecx/rcx                                          |
+| `rdx`    | constant dl/dx/edx/rdx                                          |
+| `rbx`    | constant bl/bx/ebx/rbx                                          |
+| `rsp`    | constant sp/esp/rsp                                             |
+| `rbp`    | constant bp/ebp/rbp                                             |
+| `rsi`    | constant si/esi/rsi                                             |
+| `rdi`    | constant di/edi/rdi                                             |
+| `st0`    | constant st(0)                                                  |
+| `stx`    | constant st(i)                                                  |
+| `seg`    | constant segment                                                |
+| `xmm0`   | constant xmm0                                                   |
+| `xmm0_7` | constant xmm0-xmm7                                              |
+| `mxcsr`  | constant mxcsr                                                  |
+| `rflags` | constant rflags                                                 |
+
+Appendix C - Instruction Encoding Prefixes
+==========================================
+
+This table outlines the mnemonic prefixes used in instruction encodings
+_(enum x86_enc)_.
+
+| mnemonic | description                                                     |
+|:---------|:----------------------------------------------------------------|
+| `lex`    | legacy instruction                                              |
+| `vex`    | VEX encoded instruction                                         |
+| `evex`   | EVEX encoded instruction                                        |
+| `.lz`    | VEX encoding L=0 and L=1 is unassigned                          |
+| `.l0`    | VEX encoding L=0                                                |
+| `.l1`    | VEX encoding L=1                                                |
+| `.lig`   | VEX/EVEX encoding ignores length L=any                          |
+| `.128`   | VEX/EVEX encoding uses 128-bit vector L=0                       |
+| `.256`   | VEX/EVEX encoding uses 256-bit vector L=1                       |
+| `.512`   | EVEX encoding uses 512-bit vector L=2                           |
+| `.66`    | prefix byte 66 is used for opcode mapping                       |
+| `.f2`    | prefix byte f2 is used for opcode mapping                       |
+| `.f3`    | prefix byte f3 is used for opcode mapping                       |
+| `.9b`    | prefix byte 9b is used for opcode mapping (x87 only)            |
+| `.0f`    | map 0f is used in opcode                                        |
+| `.0f38`  | map 0f38 is used in opcode                                      |
+| `.0f3a`  | map 0f3a is used in opcode                                      |
+| `.wn`    | no register extension, fixed operand size                       |
+| `.wb`    | register extension, fixed operand size                          |
+| `.wx`    | REX and/or operand size extension, optional 66 or REX.W0/W1     |
+| `.ww`    | REX and/or operand size extension, optional 66 and REX.WIG      |
+| `.w0`    | LEX/VEX/EVEX optional REX W0 with operand size used in opcode   |
+| `.w1`    | LEX/VEX/EVEX mandatory REX W1 with operand size used in opcode  |
+| `.wig`   | VEX/EVEX encoding width ignored                                 |
+
+Appendix D - Instruction Encoding Suffixes
+==========================================
+
+This table outlines the mnemonic suffixes used in instruction encodings
+_(enum x86_enc)_.
+
+| mnemonic | description                                                     |
+|:---------|:----------------------------------------------------------------|
+| `/r`     | ModRM byte                                                      |
+| `/0../7` | ModRM byte with 'r' field used for functions 0 to 7             |
+| `XX+r`   | opcode byte with 3-bit register added to the opcode             |
+| `XX`     | opcode byte                                                     |
+| `ib`     | 8-bit immediate                                                 |
+| `iw`     | 16-bit or 32-bit immediate (real mode XOR operand size)         |
+| `i16`    | 16-bit immediate                                                |
+| `i32`    | 32-bit immediate                                                |
+| `i64`    | 64-bit immediate                                                |
+| `o16`    | encoding uses prefix 66 in 32-bit and 64-bit modes              |
+| `o32`    | encoding uses prefix 66 in 16-bit mode                          |
+| `o64`    | encoding is used exclusively in 64-bit mode with REX.W=1        |
+| `a16`    | encoding uses prefix 67 in 32-bit and 64-bit modes              |
+| `a32`    | encoding uses prefix 67 in 16-bit mode                          |
+| `a64`    | encoding is used exclusively in 64-bit mode                     |
+| `lock`   | memory operand encodings can be used with the LOCK prefix       |
+| `rep`    | string instructions that can be used with the REP prefix        |
+
+Appendix E - Instruction Synthesis Notes
+========================================
+
+The `.wx` and `.ww` mnemonics are used to synthesize prefix combinations:
+
+- `.wx` labels opcodes with _default 32-bit operand size in 64-bit mode_
+  to synthesize 16/32/64-bit versions using REX and operand size prefix,
+  or in 16/32-bit modes synthesizes 16/32-bit versions using only the
+  operand size prefix. REX is used for register extension on opcodes
+  with `rw` or `rw/mw` operands or fixed register operands like `aw`.
+- `.ww` labels opcodes with _default 64-bit operand size in 64-bit mode_
+  to synthesize 16/64-bit versions using only the operand size prefix,
+  or in 16/32-bit modes. synthesizes 16/32-bit versions using only the
+  operand size prefix. REX is used for register extension on opcodes
+  with `rw` or `rw/mw` operands or fixed register operands like `aw`.
-- 
2.43.0

[PATCH v4 2/4] x86-disas: add x86-mini metadata documentation

Reply via email to