On Mon, Jan 12, 2015 at 3:40 AM, Paolo Bonzini <pbonz...@redhat.com> wrote: > On 11/01/2015 04:57, sfel...@gmail.com wrote: >> +PCI Configuration Space >> +----------------------- >> + >> +Each switch instance registers as a PCI device with PCI configuration space: >> + >> + offset width description value >> + --------------------------------------------- >> + 0x0 2 Vendor ID 0x1b36 >> + 0x2 2 Device ID 0x0006 >> + 0x4 4 Command/Status >> + 0x8 1 Revision ID 0x01 >> + 0x9 3 Class code 0x2800 >> + 0xC 1 Cache line size >> + 0xD 1 Latency timer >> + 0xE 1 Header type >> + 0xF 1 Built-in self test >> + 0x10 4 Base address low >> + 0x14 4 Base address high >> + 0x18-28 Reserved >> + 0x2C 2 Subsystem vendor ID 0x0000 >> + 0x2E 2 Subsystem ID 0x0000 > > This should not be guaranteed to 0, should it?
Your're right. Added a note that subsystem implementation will fill this in. > >> + 0x30-38 Reserved >> + 0x3C 1 Interrupt line >> + 0x3D 1 Interrupt pin 0x00 >> + 0x3E 1 Min grant 0x00 >> + 0x3D 1 Max latency 0x00 >> + 0x40 1 TRDY timeout >> + 0x41 1 Retry count >> + 0x42 2 Reserved >> + >> + >> +SECTION 3: Memory-Mapped Register Space >> +======================================= >> + >> +There are two memory-mapped BARs. BAR0 maps device register space and is >> +0x2000 in size. BAR1 maps MSI-X vector and PBA tables and is also 0x2000 in >> +size, allowing for 256 MSI-X vectors. The host BIOS will assign the base >> +address location. The host driver/OS will map the base address to host >> memory, >> +giving the driver mmio access to the device register space. > > No need for the bits after "The host BIOS..." since that's just normal PCI. Gone. >> +All registers are 4 or 8 bytes long. It is assumed host software will >> access 4 >> +byte registers with one 4-byte access, and 8 byte registers with either two >> +4-byte accesses or a single 8-byte access. In the case of two 4-byte >> accesses, >> +access must be lower and then upper 4-bytes, in that order. > > Double 4-byte accesses are not implemented, are they? They are now :) Tested on i386. I'll include changes with v4. >> +Interrupt credits >> +^^^^^^^^^^^^^^^^^ >> + >> +MSI-X vectors used for descriptor ring completions use a credit mechanism >> for >> +efficient device, PCIe bus, OS and driver operations. Each descriptor ring >> has >> +a credit count which represent the number of outstanding descriptors to be >> +processed by the driver. As the device marks descriptors complete, the >> credit >> +count is incremented. As the driver processes those outstanding >> descriptors, >> +it returns credits back to the device. This way, the device knows the >> driver's >> +progress and can make decisions about when to fire the next interrupt or >> not. >> +When the credit count is zero, and the first descriptors are posted for the >> +driver, a single interrupt is fired. Once the interrupt is fired, the >> +interrupt is disabled (auto-masked). In response to the interrupt, the >> driver >> +will process descriptors and PIO write a returned credit value for that >> +descriptor ring. If the driver returns all credits (the driver caught up >> with >> +the device and there is no outstanding work), then the interrupt is >> unmasked, >> +but not fired. If only partial credits are returned, the interrupt remains >> +masked but the device generates an interrupt, signaling the driver that more >> +outstanding work is available. > > Perhaps mention that this masking is unrelated to the MSI-X interrupt > mask register? Done. >> +SECTION 5: Test Registers >> +========================= >> + >> +Rocker switch has several test registers to support troubleshooting register > > s/Rocker switch/Rocker/ Done. >> +access, interrupt generation, and DMA operations: >> + >> + TEST_REG, offset 0x0010, 32-bit (R/W) >> + TEST_REG64, offset 0x0018, 64-bit (R/W) >> + TEST_IRQ, offset 0x0020, 32-bit (R/W) >> + TEST_DMA_ADDR, offset 0x0028, 64-bit (R/W) >> + TEST_DMA_SIZE, offset 0x0030, 32-bit (R/W) >> + TEST_DMA_CTRL, offset 0x0034, 32-bit (R/W) >> + >> +Reads to TEST_REG and TEST_REG64 will read a value 2x the last value >> written to > > s/2x/equal to twice/ Done. >> +the register. The 32-bit and 64-bit versions are for testing 32-bit and >> 64-bit >> +host accesses. > > Right now, as mentioned above, 64-bit registers must be accessed with a > single 32-bit host access. Fixed in implementation. > In the case of 32-bit host accesses, should TEST_REG64's value be > latched until the upper half is written? If so, please mention it and > describe that this behavior is shared with the other 64-bit Rocker > registers. > >> +Bits written to TEST_IRQ will cause same (unmasked) bits to be written to >> +IRQ_STAT and an interrupt generated. Use IRQ_MASK to mask and unmask >> +particular bits. > > It looks like actually TEST_IRQ will generate a single interrupt, not > many of them. So writing 1 sets bits 1 in the PBA, not bit 0. Writing > 3 sets bits 3, not bits 0 and 1. Good catch...updated doc. > Please do not use "IRQ_STAT", call it the PBA instead. Also remove the > reference to IRQ_MASK, it's uninteresting. > >> +SECTION 7: Switch Control >> +========================= >> + >> +This section covers switch-wide register settings. >> + >> +Control >> +------- >> + >> +This register is used for low level control of the switch. >> + >> + CONTROL: offset 0x0300, 32-bit, (W) >> + >> + bit name description >> + >> ------------------------------------------------------------------------ >> + [0] CONTROL_RESET If set, device will perform reset (same >> + as pci reset) > > It's not the same as PCI reset, as it will not reset BARs for example. Fixed. >> + >> +SECTION 8: CPU Packet Processing >> +================================ >> + >> +For packets ingressing on switch ports that are not forwarded by the switch >> but >> +rather directed to the host CPU for further processing are delivered in the >> +DMA RX ring. Likewise, for host CPU originating packets destined to egress >> on >> +switch ports onto the network are scheduled by software using the DMA TX >> ring. > > Ingress packets for ports that are not forwarded by the switch are > directed to the host CPU for further processing, and delivered in the > DMA RX ring. Likewise, the host CPU can use the DMA TX ring to schedule > packets that will egress onto the network. Fixed by simplifying. >> + >> +Tx Packet Processing >> +-------------------- >> + >> +Software schedules packets for egress on switch ports using the DMA TX >> ring. A >> +TX descriptor buffer describes the packet location and size in host DMA-able >> +memory, the destination port, and any hardware-offload functions (such as L3 >> +payload checksum offload). Software then bumps the descriptor head to >> signal >> +hardware of new Tx work. In response, hardware will DMA read Tx >> descriptors up >> +to head, DMA read descriptor buffer and packet data, perform offloading >> +functions, and finally frame packet on wire (network). Once packet >> processing >> +is complete, hardware will writeback status to descriptor(s) to signal to >> +software that Tx is complete and software resources (e.g. skb) backing >> packet >> +can be released. >> + >> +Figure 2 shows an example 3-fragment packet queued with one Tx descriptor. >> A >> +TLV is used for each packet fragment. >> + >> + pkt frag 1 >> + +–––––––+ +–+ >> + +–––+ | | >> + desc buf | | | | >> + +––––––––+ | | | | >> + Tx ring +–––+ +–––––+ | | | >> + +–––––––––+ | | TLVs | +–––––––+ | >> + | +–––+ +––––––––+ pkt frag 2 | >> + | desc 0 | | +–––––+ +–––––––+ | >> + +–––––––––+ | TLVs | +–––+ | | >> + head+–+ | +––––––––+ | | | >> + | desc 1 | | +–––––+ +–––––––+ |pkt >> + +–––––––––+ | TLVs | | | >> + | | +––––––––+ | pkt frag 3 | >> + | | | +–––––––+ | >> + +–––––––––+ +–––+ | | >> + | | | | | >> + | | | | | >> + +–––––––––+ | | | >> + | | | | | >> + | | | | | >> + +–––––––––+ | | | >> + | | +–––––––+ +–+ >> + | | >> + +–––––––––+ >> + >> + fig 2. >> + >> +The TLVs for Tx descriptor buffer are: >> + >> + field width description >> + --------------------------------------------------------------------- >> + PPORT 4 Destination physical port # >> + TX_OFFLOAD 1 Hardware offload modes: >> + 0: no offload >> + 1: insert IP csum (ipv4 only) >> + 2: insert TCP/UDP csum >> + 3: L3 csum calc and insert >> + into csum offset (TX_L3_CSUM_OFF) >> + 16-bit 1's complement csum value. >> + IPv4 pseudo-header and IP >> + already calculated by OS >> + and inserted. >> + 4: TSO (TCP Segmentation Offload) >> + TX_L3_CSUM_OFF 2 For L3 csum offload mode, the offset, >> + from the beginning of the packet, >> + of the csum field in the L3 header >> + TX_TSO_MSS 2 For TSO offload mode, the >> + Maximum Segment Size in bytes >> + TX_TSO_HDR_LEN 2 For TSO offload mode, the >> + length of ethernet, IP, and >> + TCP/UDP headers, including IP >> + and TCP options. >> + TX_FRAGS <array> Packet fragments >> + TX_FRAG <nest> Packet fragment >> + TX_FRAG_ADDR 8 DMA address of packet fragment >> + TX_FRAG_LEN 2 Packet fragment length >> + >> +Possible status return codes in descriptor on completion are: >> + >> + DESC_COMP_ERR reason >> + -------------------------------------------------------------------- >> + 0 OK >> + ENXIO address or data read err on desc buf or packet >> + fragment > > This is more like EFAULT actually. > >> + EINVAL bad pport or TSO or csum offloading error >> + ENOMEM no memory for internal staging tx fragment > > QEMU is portable and these values are not, unfortunately. So please > hardcode them to be 6/22/12 respectively. > > Or even better, to avoid the temptation, make them 1/2/3 and create new > constants ROCKER_OK, ROCKER_ERR_FAULT, ROCKER_ERR_INVAL, ROCKER_ERR_NOMEM. Since Linux driver is already out there in 3.18, we're stuck with the values defined in errno.h for x86_64. But, no problem, I've hard-coded those values for ROCKER_EINVAL, ROCKER_ENOMEM, etc. I'll switch the Linux driver over to these constants when it's touched again. > In any case, since you are at it, sort them in either numeric order or > alphabetic order (apart from OK which can remain first). > >> +Rx Packet Processing >> +-------------------- >> + >> +For packets ingressing on switch ports that are not forwarded by the switch >> but >> +rather directed to the host CPU for further processing are delivered in the >> +DMA RX ring. Rx descriptor buffers are allocated by software and placed on >> the >> +ring. Hardware will fill Rx descriptor buffers with packet data, write the >> +completion, and signal to software that a new packet is ready. Since Rx >> packet >> +size is not known a-priori, the Rx descriptor buffer must be allocated for >> +worst-case packet size. A single Rx descriptor will contain the entire Rx >> +packet data in one RX_PACKET TLV. Other Rx TLVs describe and hardware >> offloads >> +performed on the packet, such as checksum validation. >> + >> +The TLVs for Rx descriptor buffer are: >> + >> + field width description >> + --------------------------------------------------- >> + PPORT 4 Source physical port # >> + RX_FLAGS 2 Packet parsing flags: >> + (1 << 0): IPv4 packet >> + (1 << 1): IPv6 packet >> + (1 << 2): csum calculated >> + (1 << 3): IPv4 csum good >> + (1 << 4): IP fragment >> + (1 << 5): TCP packet >> + (1 << 6): UDP packet >> + (1 << 7): TCP/UDP csum good >> + RX_CSUM 2 IP calculated checksum: >> + IPv4: IP payload csum >> + IPv6: header and payload csum >> + (Only valid is RX_FLAGS:csum calc is set) >> + RX_PACKET (N) <var> Packet data >> + >> +Possible status return codes in descriptor on completion are: >> + >> + DESC_COMP_ERR reason >> + -------------------------------------------------------------------- >> + 0 OK >> + ENXIO address or data read err on desc buf >> + ENOMEM no memory for internal staging desc buf >> + EMSGSIZE Rx descriptor buffer wasn't big enough to contain >> + pactet data TLV and other TLVs. > > EMSGSIZE in fact doesn't exist on Windows even. So make this > ROCKER_ERR_MSGSIZE==4. > > >> + field width description >> + ---------------------------------------------------- >> + OF_DPA_CMD 2 CMD_[ADD|MOD] >> + OF_DPA_TBL 2 Flow table ID >> + 0: ingress port >> + 10: vlan >> + 20: termination mac >> + 30: unicast routing >> + 40: multicast routing >> + 50: bridging >> + 60: ACL policy > > Decimal, I guess. Better mention it, if only for completeness. > >> +Possible status return codes in descriptor on completion are: >> + >> + DESC_COMP_ERR command reason >> + -------------------------------------------------------------------- >> + 0 all OK >> + EFAULT all head or tail index outside >> + of ring >> + ENXIO all address or data read err on >> + desc buf >> + ENOSPC GET_STATS cmd descriptor buffer wasn't >> + big enough to contain >> write-back >> + TLVs >> + EINVAL ADD|MOD invalid parameters passed in >> + EEXIST ADD entry already exists >> + ENOSPC ADD no space left in flow table >> + ENOENT MOD|DEL|GET_STATS group ID invalid >> + EBUSY DEL group reference count non-zero >> + ENODEV ADD next group ID doesn't exist > > Same as above, please add decimal values instead of overloading errno. Updated doc with new ROCKER_Exxx return codes. > > Paolo