On 6/23/21 12:39 PM, Igor Mammedov wrote:
> On Wed, 23 Jun 2021 10:37:38 +0100
> Joao Martins <joao.m.mart...@oracle.com> wrote:
>
>> On 6/23/21 8:11 AM, Igor Mammedov wrote:
>>> On Tue, 22 Jun 2021 16:49:00 +0100
>>> Joao Martins <joao.m.mart...@oracle.com> wrote:
>>>
>>>> It is assumed that the whole GPA space is available to be
>>>> DMA addressable, within a given address space limit. Since
>>>> v5.4 based that is not true, and VFIO will validate whether
>>>> the selected IOVA is indeed valid i.e. not reserved by IOMMU
>>>> on behalf of some specific devices or platform-defined.
>>>>
>>>> AMD systems with an IOMMU are examples of such platforms and
>>>> particularly may export only these ranges as allowed:
>>>>
>>>> 0000000000000000 - 00000000fedfffff (0 .. 3.982G)
>>>> 00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
>>>> 0000010000000000 - ffffffffffffffff (1Tb .. 16Pb)
>>>>
>>>> We already know of accounting for the 4G hole, albeit if the
>>>> guest is big enough we will fail to allocate a >1010G given
>>>> the ~12G hole at the 1Tb boundary, reserved for HyperTransport.
>>>>
>>>> When creating the region above 4G, take into account what
>>>> IOVAs are allowed by defining the known allowed ranges
>>>> and search for the next free IOVA ranges. When finding a
>>>> invalid IOVA we mark them as reserved and proceed to the
>>>> next allowed IOVA region.
>>>>
>>>> After accounting for the 1Tb hole on AMD hosts, mtree should
>>>> look like:
>>>>
>>>> 0000000100000000-000000fcffffffff (prio 0, i/o):
>>>> alias ram-above-4g @pc.ram 0000000080000000-000000fc7fffffff
>>>> 0000010000000000-000001037fffffff (prio 0, i/o):
>>>> alias ram-above-1t @pc.ram 000000fc80000000-000000ffffffffff
>>>
>>> why not push whole ram-above-4g above 1Tb mark
>>> when RAM is sufficiently large (regardless of used host),
>>> instead of creating yet another hole and all complexity it brings along?
>>>
>>
>> There's the problem with CMOS which describes memory above 4G, part of the
>> reason I cap it to the 1TB minus the reserved range i.e. for AMD, CMOS would
>> only describe up to 1T.
>>
>> But should we not care about that then it's an option, I suppose.
> we probably do not care about CMOS with so large RAM,
> as long as QEMU generates correct E820 (cmos mattered only with old Seabios
> which used it for generating memory map)
>
OK, good to know.
Any extension on CMOS would probably also be out of spec.
>> We would waste 1Tb of address space because of 12G, and btw the logic here
>> is not so different than the 4G hole, in fact could probably share this
>> with it.
> the main reason I'm looking for alternative, is complexity
> of making hole brings in. At this point, we can't do anything
> with 4G hole as it's already there, but we can try to avoid that
> for high RAM and keep rules there simple as it's now.
>
Right. But for what is worth, that complexity is spread in two parts:
1) dealing with a sparse RAM model (with more than one hole)
2) offsetting everything else that assumes a linear RAM map.
I don't think that even if we shift start of RAM to after 1TB boundary that
we would get away top solving item 2 -- which personally is where I find this
a tad bit more hairy. So it would probably make this patch complexity smaller,
but
not vary much in how spread the changes get.
> Also partitioning/splitting main RAM is one of the things that
> gets in the way converting it to PC-DIMMMs model.
>
Can you expand on that? (a link to a series is enough)
> Loosing 1Tb of address space might be acceptable on a host
> that can handle such amounts of RAM
>