On 30-Apr-20 6:36 PM, David Christensen wrote:
On 4/30/20 4:34 AM, Burakov, Anatoly wrote:
On 30-Apr-20 12:29 AM, David Christensen wrote:
Current SPAPR IOMMU support code dynamically modifies the DMA window
size in response to every new memory allocation. This is potentially
dangerous because all existing mappings need to be unmapped/remapped in
order to resize the DMA window, leaving hardware holding IOVA addresses
that are not properly prepared for DMA. The new SPAPR code statically
assigns the DMA window size on first use, using the largest physical
memory address when IOVA=PA and the base_virtaddr + physical memory size
when IOVA=VA. As a result, memory will only be unmapped when
specifically requested.
Signed-off-by: David Christensen <d...@linux.vnet.ibm.com>
---
Hi David,
I haven't yet looked at the code in detail (will do so later), but
some general comments and questions below.
+ /*
+ * Read "System RAM" in /proc/iomem:
+ * 00000000-1fffffffff : System RAM
+ * 200000000000-201fffffffff : System RAM
+ */
+ FILE *fd = fopen(proc_iomem, "r");
+ if (fd == NULL) {
+ RTE_LOG(ERR, EAL, "Cannot open %s\n", proc_iomem);
+ return -1;
+ }
A quick check on my machines shows that when cat'ing /proc/iomem as
non-root, you get zeroes everywhere, which leads me to believe that
you have to be root to get anything useful out of /proc/iomem. Since
one of the major selling points of VFIO is the ability to run as
non-root, depending on iomem kind of defeats the purpose a bit.
I observed the same thing on my system during development. I didn't see
anything that precluded support for RTE_IOVA_PA in the VFIO code. Are
you suggesting that I should explicitly not support that configuration?
If you're attempting to use RTE_IOVA_PA then you're already required to
run as root, so there shouldn't be an issue accessing this
Oh, right, forgot about that. That's OK then.
+ return 0;
+
+ } else if (rte_eal_iova_mode() == RTE_IOVA_VA) {
+ /* Set the DMA window to base_virtaddr + system memory size */
+ const char proc_meminfo[] = "/proc/meminfo";
+ const char str_memtotal[] = "MemTotal:";
+ int memtotal_len = sizeof(str_memtotal) - 1;
+ char buffer[256];
+ uint64_t size = 0;
+
+ FILE *fd = fopen(proc_meminfo, "r");
+ if (fd == NULL) {
+ RTE_LOG(ERR, EAL, "Cannot open %s\n", proc_meminfo);
+ return -1;
+ }
+ while (fgets(buffer, sizeof(buffer), fd)) {
+ if (strncmp(buffer, str_memtotal, memtotal_len) == 0) {
+ size = rte_str_to_size(&buffer[memtotal_len]);
+ break;
+ }
+ }
+ fclose(fd);
+
+ if (size == 0) {
+ RTE_LOG(ERR, EAL, "Failed to find valid \"MemTotal\"
entry "
+ "in file %s\n", proc_meminfo);
+ return -1;
+ }
+
+ RTE_LOG(DEBUG, EAL, "MemTotal is 0x%" PRIx64 "\n", size);
+ /* if no base virtual address is configured use 4GB */
+ spapr_dma_win_len = rte_align64pow2(size +
+ (internal_config.base_virtaddr > 0 ?
+ (uint64_t)internal_config.base_virtaddr : 1ULL << 32));
+ rte_mem_set_dma_mask(__builtin_ctzll(spapr_dma_win_len));
I'm not sure of the algorithm for "memory size" here.
Technically, DPDK can reserve memory segments anywhere in the VA space
allocated by memseg lists. That space may be far bigger than system
memory (on a typical Intel server board you'd see 128GB of VA space
preallocated even though the machine itself might only have, say, 16GB
of RAM installed). The same applies to any other arch running on
Linux, so the window needs to cover at least RTE_MIN(base_virtaddr,
lowest memseglist VA address) and up to highest memseglist VA address.
That's not even mentioning the fact that the user may register
external memory for DMA which may cause the window to be of
insufficient size to cover said external memory.
I also think that in general, "system memory" metric is ill suited for
measuring VA space, because unlike system memory, the VA space is
sparse and can therefore span *a lot* of address space even though in
reality it may actually use very little physical memory.
I'm open to suggestions here. Perhaps an alternative in /proc/meminfo:
VmallocTotal: 549755813888 kB
I tested it with 1GB hugepages and it works, need to check with 2M as
well. If there's no alternative for sizing the window based on
available system parameters then I have another option which creates a
new RTE_IOVA_TA mode that forces IOVA addresses into the range 0 to X
where X is configured on the EAL command-line (--iova-base, --iova-len).
I use these command-line values to create a static window.
A whole new IOVA mode, while being a cleaner solution, would require a
lot of testing, and it doesn't really solve the external memory problem,
because we're still reliant on the user to provide IOVA addresses.
Perhaps something akin to VA/IOVA address reservation would solve the
problem, but again, lots of changes and testing, all for a comparatively
narrow use case.
The vmalloc area seems big enough (512 terabytes on your machine, 32
terabytes on mine), so it'll probably be OK. I'd settle for:
1) start at base_virtaddr OR lowest memseg list address, whichever is lowest
2) end at lowest addr + VmallocTotal OR highest memseglist addr,
whichever is higher
3) a check in user DMA map function that would warn/throw an error
whenever there is an attempt to map an address for DMA that doesn't fit
into the DMA window
I think that would be best approach. Thoughts?
Dave
Dave
--
Thanks,
Anatoly