Skip to content

Commit

Permalink
GIPC: Introduce G-stage page table In Process Context (GIPC) capability
Browse files Browse the repository at this point in the history
The current G-stage page table resides within the device context and is
only bound to one PCI-e function. Consequently, this setup limits the
function's capability to serve only a single virtual machine.

PCI-e introduces SR-IOV to address this issue, yet it comes with
constraints: SR-IOV consumes PCI configuration address space and must be
configured during the PCI device enumeration process. Its finite
capacity and inflexibility make it ill-suited for fulfilling the needs of
higher-density I/O environments.

The number of use cases for virtualizing DMA devices that do not have
built-in SR_IOV capability is increasing. The Linux VFIO driver framework
provides device-agnostic and unified APIs for direct device access. This
framework is called as Virtual Function I/O (VFIO) Mediated devices.

The GIPC of riscv IOMMU could bind any mediated devices to any virtual
machine without SR-IOV.

Here is the abstract of this approach:
 - Use PC.iohgatp instead of DC.iohgatp.
 - Use PC.msi* instead of DC.msi*.
 - Use SPA instead of GPA for DC.fsc.pdtp and PDT. (PDT is maintained by
   the host instead of guest. e.g. The guest use virtio-IOMMU API to
   configure a portion, not the entire, of the PDT. Every PDT entry
   could be a software-defined device for any VM.)
 - Use SPA instead of GPA for DC.fsc.pdtp and PDT. (The host maintains
   the PDT instead of the guest. For example, the guest uses the
   virtio-IOMMU API to configure a portion, not the entire PDT. Every PDT
   entry could be a software-defined device for any VM.)

TODO:
 - iommu_ref_model

Signed-off-by: Guo Ren <[email protected]>
Signed-off-by: Hao Ziyi <[email protected]>
Signed-off-by: Dust Li <[email protected]>
Signed-off-by: Shuai Xue <[email protected]>
Signed-off-by: Feng Guanghui <[email protected]>
  • Loading branch information
guoren83 committed Aug 29, 2024
1 parent 6fc5fe9 commit 984fda1
Show file tree
Hide file tree
Showing 4 changed files with 124 additions and 30 deletions.
118 changes: 104 additions & 14 deletions src/iommu_data_structures.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ then only the first-stage suffices to perform necessary address translations and
protections; the second-stage scheme may be effectively disabled for the device by
programming the second-stage address translation scheme to be `Bare`.

When second-stage address translation is not Bare, the `DC` holds the PPN of the
When second-stage address translation is not Bare, the `DC` or `PC` holds the PPN of the
root second-stage page table; a guest-soft-context-ID (`GSCID`), which
facilitates invalidation of cached address translations on a per-virtual-machine
basis; and the second-stage address translation scheme.
Expand All @@ -40,6 +40,9 @@ a data structure called the Process Context (`PC`).
When a PDT is active, the controls for first-stage address translation are held
in the (`PC`).

When a PDT is active with `capabilies.GIPC = 1`, the controls for first-stage
and second-stage address translation are held in the (`PC`).

When a PDT is not active, the controls for first-stage address translation are
held in the `DC` itself.

Expand Down Expand Up @@ -104,12 +107,21 @@ traverse the DDT radix-tree are as follows:
], config:{lanes: 1, hspace:1024, fontsize: 16}}
....

Three formats of the process-context structure are supported:
* *Base Format* - A 16-byte PC used when `capabilities.GIPC = 0` and `capabilities.MSI_FLAT = 0`

* *Extended Format* - In the extended format a 32-byte process context is used
when `capabilities.GIPC = 1` and `capabilities.MSI_FLAT = 0`.

* *Extended Format with MSI page table* - In the extended format with MSI page table a
64-byte process context is used when `capabilities.GIPC = 1` and `capabilities.MSI_FLAT = 1`.

The PDT may be configured to be a 1, 2, or 3 level radix-tree depending on the
maximum width of the `process_id` supported by that device. The partitioning
of the `process_id` to obtain the process directory indices (PDI) to traverse
the PDT radix-tree are as follows:

.`process_id` partitioning for PDT radix-tree traversal
.Base format `process_id` partitioning for PDT radix-tree traversal

[wavedrom, , ]
....
Expand All @@ -119,6 +131,29 @@ the PDT radix-tree are as follows:
{bits: 3, name: 'PDI[2]'},
], config:{lanes: 1, hspace:1024, fontsize: 16}}
....

.Extended format `process_id` partitioning for PDT radix-tree traversal

[wavedrom, , ]
....
{reg: [
{bits: 7, name: 'PDI[0]'},
{bits: 9, name: 'PDI[1]'},
{bits: 4, name: 'PDI[2]'},
], config:{lanes: 1, hspace:1024, fontsize: 16}}
....

.Extended format with MSI page table `process_id` partitioning for PDT radix-tree traversal

[wavedrom, , ]
....
{reg: [
{bits: 6, name: 'PDI[0]'},
{bits: 9, name: 'PDI[1]'},
{bits: 5, name: 'PDI[2]'},
], config:{lanes: 1, hspace:1024, fontsize: 16}}
....

[NOTE]
====
The `process_id` partitioning is designed to require a maximum of 4 KiB, a
Expand Down Expand Up @@ -421,6 +456,8 @@ When `SXL` is 1, the following rules apply:
], config:{lanes: 2, hspace: 1024, fontsize: 16}}
....
When `DC.tc.PDTV` is set `and capabilities.GIPC = 1`, the `DC.iohgatp` field is
ignored, and `PC.iohgatp` is used instead. Otherwise, `DC.iohgatp` is used.
The `iohgatp` field holds the PPN of the root second-stage page table and a
virtual machine identified by a guest soft-context ID (`GSCID`), to facilitate
address-translation fences on a per-virtual-machine basis. If multiple devices
Expand Down Expand Up @@ -562,14 +599,20 @@ determines the number of levels of the PDT.
], config:{lanes: 2, hspace: 1024, fontsize: 16}}
....
When second-stage address translation is not Bare, the `pdtp.PPN` field holds a
When second-stage address translation is not Bare and `capabilities.GIPC = 0`, the `pdtp.PPN` field holds a
guest PPN. The GPA of the root PDT is then converted by guest physical address
translation process, as controlled by the `iohgatp`, into a supervisor physical
address. Translating addresses of PDT using a second-stage page table, allows the
PDT to be held in memory allocated by the guest OS and allows the guest OS to
directly edit the PDT to associate a virtual-address space identified by a
first-stage page table with a `process_id`.
When second-stage address translation is not Bare and `capabilities.GIPC = 1`,
the `PPN` field holds a supervisor PPN. The supervisor physical address of PDT root
page, allows the PDT to be configured into the supervisor physical address space to
allow the guest OS to use virtio-IOMMU API edit the PDT in hypervisor to associate a
virtual-address space identified by a VS-stage page table with a `process_id`.
[[PDTP_MODE_ENC]]
.Encoding of `pdtp.MODE` field
[width=75%]
Expand Down Expand Up @@ -775,9 +818,9 @@ A valid (`V==1`) non-leaf PDT entry holds the PPN of the next-level PDT.
==== Leaf PDT entry
The leaf PDT page is indexed by `PDI[0]` and holds the 16-byte process-context
(`PC`).
(`PC`) when `capabilities.GIPC = 0` and `capabilities.MSI_FLAT = 0`.
.Process-context
.Base-format process-context
[wavedrom, , ]
....
Expand All @@ -787,7 +830,41 @@ The leaf PDT page is indexed by `PDI[0]` and holds the 16-byte process-context
], config:{lanes: 2, hspace: 1024, fontsize: 16}}
....
The `PC` is interpreted as two 64-bit doublewords. The byte order of each of the
The leaf PDT page is indexed by `PDI[0]` and holds the 32-byte process-context
(`PC`) when `capabilities.GIPC = 1` and `capabilities.MSI_FLAT = 0`.
.Extended-format process-context
[wavedrom, , ]
....
{reg: [
{bits: 64, name: 'reserved'},
{bits: 64, name: 'IO Hyp. guest addr. translation and prot. (iohgatp)'},
{bits: 64, name: 'Translation-attributes (ta)'},
{bits: 64, name: 'First-stage-context (fsc)'},
], config:{lanes: 2, hspace: 1024, fontsize: 16}}
....
The leaf PDT page is indexed by `PDI[0]` and holds the 64-byte process-context
(`PC`) when `capabilities.GIPC = 1` and `capabilities.MSI_FLAT = 1`.
.Extended-format process-context with MSI-page-table
[wavedrom, , ]
....
{reg: [
{bits: 64, name: 'reserved'},
{bits: 64, name: 'MSI-page-table pointer (msiptp)'},
{bits: 64, name: 'MSI-address-mask (msi_addr_mask)'},
{bits: 64, name: 'MSI-address-pattern (msi_addr_pattern)'},
{bits: 64, name: 'reserved'},
{bits: 64, name: 'IO Hyp. guest addr. translation and prot. (iohgatp)'},
{bits: 64, name: 'Translation-attributes (ta)'},
{bits: 64, name: 'First-stage-context (fsc)'},
], config:{lanes: 2, hspace: 1024, fontsize: 16}}
....
The `PC` is interpreted as multi 64-bit doublewords. The byte order of each of the
doublewords in memory, little-endian or big-endian, is the endianness as
determined by `DC.tc.SBE`. The IOMMU may read the `PC` fields in any order.
Expand Down Expand Up @@ -849,14 +926,18 @@ field controls the supported paged virtual-memory schemes. When `PC.fsc.MODE` is
not `Bare`, the `PC.fsc.PPN` field holds the PPN of the root page of a
first-stage page table.
When second-stage address translation is not Bare, the `PC.fsc.PPN` field holds
When second-stage address translation is not Bare and `capabilities.GIPC = 0`, the `PC.fsc.PPN` field holds
a guest PPN of the root of a first-stage page table. Addresses of the first-stage
page table entries are then converted by guest physical address translation
process, as controlled by the `DC.iohgatp`, into a supervisor physical address.
A guest OS may thus directly edit the first-stage page table to limit access by
the device to a subset of its memory and specify permissions for the device
accesses.
When second-stage address translation is not Bare and `capabilities.GIPC = 1`, the `PC.fsc.PPN` field holds
a supervisor PPN of the root of a first-stage page table. A guest OS may edit
the first-stage page table with the help of hypervisor.
[NOTE]
====
The `PC.ta.PSCID` identifies an address space. If an identical
Expand All @@ -866,6 +947,15 @@ the first page table or the second page table. These are the only expected
behaviors.
====
===== IO hypervisor guest address translation and protection (`iohgatp`)
The same as `DC.iohgatp`.
===== MSI page table pointer (`msiptp`)
The same as `DC.msiptp`.
===== MSI address mask (`msi_addr_mask`) and pattern (`msi_addr_pattern`)
The same as `DC.msi_addr_mask` and `DC.msi_addr_pattern`.
[[PC_MISCONFIG]]
==== Process-context configuration checks
Expand Down Expand Up @@ -1039,9 +1129,9 @@ is as follows:
==== Process to locate the Process-context
The device-context provides the PDT root page PPN (`pdtp.ppn`). When
`DC.iohgatp.mode` is not `Bare`, `pdtp.PPN` as well as `pdte.PPN` are Guest
`DC/PC.iohgatp.mode` is not `Bare`, `pdtp.PPN` as well as `pdte.PPN` are Guest
Physical Addresses (GPA) which must be translated into Supervisor Physical
Addresses (SPA) using the second-stage page table pointed to by `DC.iohgatp`.
Addresses (SPA) using the second-stage page table pointed to by `DC/PC.iohgatp`.
The memory accesses to the PDT are treated as implicit read memory accesses
by the second-stage.
Expand All @@ -1051,7 +1141,7 @@ The process to locate the Process-context for a transaction using its
. Let `a` be `pdtp.PPN x 2^12^` and let `i = LEVELS - 1`. When
`pdtp.MODE` is `PD20`, `LEVELS` is three. When `pdtp.MODE` is
`PD17`, `LEVELS` is two. When `pdtp.MODE` is `PD8`, `LEVELS` is one.
. If `DC.iohgatp.mode != Bare`, then `a` is a GPA. Invoke the process
. If `DC/PC.iohgatp.mode != Bare`, then `a` is a GPA. Invoke the process
to translate `a` to a SPA as an implicit memory access. If faults occur during
second-stage address translation of `a` then stop and report the fault detected
by the second-stage address translation process. The translated `a` is used in
Expand All @@ -1066,7 +1156,7 @@ The process to locate the Process-context for a transaction using its
. If any bits or encoding that are reserved for future standard use are
set within `pdte`, stop and report "PDT entry misconfigured" (cause = 267).
. Let `i = i - 1` and let `a = pdte.PPN x 2^12^`. Go to step 2.
. Let `PC` be the value of the 16-bytes at address `a + PDI[0] x 16`. If accessing `PC`
. Let `PC` be the value of the 16/32/64-bytes at address `a + PDI[0] x 16/32/64`. If accessing `PC`
violates a PMA or PMP check, then stop and report "PDT entry load access
fault" (cause = 265). If `PC` access detects a data corruption
(a.k.a. poisoned data), then stop and report "PDT data corruption"
Expand Down Expand Up @@ -1120,7 +1210,7 @@ file and translating the address using the MSI page table is as follows:
** `y = 1 0 1 0 0 1 1 0`
** then the value of `extract(x, y)` has bits `0 0 0 0 a c f g`.
. Let `m` be `(DC.msiptp.PPN x 2^12^)`.
. Let `m` be `(DC/PC.msiptp.PPN x 2^12^)`.
. Let `msipte` be the value of sixteen bytes at address `(m | (I x 16))`. If
accessing `msipte` violates a PMA or PMP check, then stop and report
"MSI PTE load access fault" (cause = 261).
Expand Down Expand Up @@ -1169,8 +1259,8 @@ operations may be performed by the I/O bridge.
When `capabilities.AMO_HWAD` is 1, the IOMMU supports updating the A and D bits in
PTEs atomically. When updating of A and D bits in second-stage PTEs is enabled
(`DC.tc.GADE=1`) and/or updating of A and D bits in first-stage PTEs is enabled
(`DC.tc.SADE=1`) the following rules apply:
(`DC/PC.tc.GADE=1`) and/or updating of A and D bits in first-stage PTEs is enabled
(`DC/PC.tc.SADE=1`) the following rules apply:
. The A and/or D bit updates by the IOMMU must follow the rules specified by the
Privileged specification for validity, permission checking, and atomicity.
Expand Down
7 changes: 4 additions & 3 deletions src/iommu_intro.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ accesses. The IOMMU may employ similar address translation caches, referred as
IOMMU Address Translation Cache (IOATC). The IOMMU provides mechanisms for
software to synchronize the IOATC with the memory resident data structures used
for address translation when they are modified. Software may configure the
device context with a software defined context identifier called guest
device/process context with a software defined context identifier called guest
soft-context identifier (`GSCID`) to indicate that a collection of devices are
assigned to the same VM and thus access a common virtual address space.
Software may configure the process context with a software defined context
Expand Down Expand Up @@ -139,10 +139,10 @@ would naturally be subject to the same address translation that an IOMMU
applies to other memory writes. However, the RISC-V Advanced Interrupt
Architecture cite:[AIA] requires that IOMMUs treat MSIs directed to virtual
machines specially, in part to simplify software, and in part to allow optional
support for memory-resident interrupt files. The device context is configured by
support for memory-resident interrupt files. The device/process context is configured by
software with parameters to identify memory accesses to a virtual interrupt file
and to be translated using a MSI address translation table configured by software
in the device context.
in the device/process context.

=== Glossary
.Terms and definitions
Expand Down Expand Up @@ -170,6 +170,7 @@ in the device context.
| DMA | Direct Memory Access.
| GPA | Guest Physical Address: An address in the virtualized
physical memory space of a virtual machine.
| GIPC | G-stage page table In Process Context.
| GSCID | Guest soft-context identifier: An identification number used
by software to uniquely identify a collection of devices
assigned to a virtual machine. An IOMMU may tag IOATC
Expand Down
6 changes: 4 additions & 2 deletions src/iommu_registers.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -151,7 +151,8 @@ the IOMMU. At reset, the register shall contain the IOMMU supported features.
{bits: 1, name: 'PD8'},
{bits: 1, name: 'PD17'},
{bits: 1, name: 'PD20'},
{bits: 15, name: 'reserved'},
{bits: 1, name: 'GIPC'},
{bits: 14, name: 'reserved'},
{bits: 8, name: 'custom'},
], config:{lanes: 8, hspace:1024}}
....
Expand Down Expand Up @@ -221,7 +222,8 @@ the IOMMU. At reset, the register shall contain the IOMMU supported features.
|38 |`PD8` |RO | One level PDT with 8-bit process_id supported.
|39 |`PD17` |RO | Two level PDT with 17-bit process_id supported.
|40 |`PD20` |RO | Three level PDT with 20-bit process_id supported.
|55:41 | reserved |RO | Reserved for standard use
|41 |`GIPC` |RO | G-stage page table In Process Context.
|55:42 | reserved |RO | Reserved for standard use
|63:56 |_custom_ |RO | _Designated for custom use_
|===

Expand Down
23 changes: 12 additions & 11 deletions src/iommu_sw_guidelines.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -146,9 +146,9 @@ device with `device_id = D`) then the following invalidations must be performed:
* `IODIR.INVAL_DDT` with `DV=1` and `DID=D`
* If `DC.tc.PDTV==1` then `IODIR.INVAL_PDT` with `DV=1`, `PV=0`, and `DID=D`

* If `DC.iohgatp.MODE != Bare`
** `IOTINVAL.VMA` with `GV=1`, `AV=PSCV=0`, and `GSCID=DC.iohgatp.GSCID`
** `IOTINVAL.GVMA` with `GV=1`, `AV=0`, and `GSCID=DC.iohgatp.GSCID`
* If `DC/PC.iohgatp.MODE != Bare`
** `IOTINVAL.VMA` with `GV=1`, `AV=PSCV=0`, and `GSCID=DC/PC.iohgatp.GSCID`
** `IOTINVAL.GVMA` with `GV=1`, `AV=0`, and `GSCID=DC/PC.iohgatp.GSCID`
* else
** If `DC.tc.PDTV==1 || DC.tc.PDTV == 0 && DC.fsc.MODE == Bare`
*** `IOTINVAL.VMA` with `GV=AV=PSCV=0`
Expand All @@ -171,8 +171,8 @@ If software changes a leaf-level PDT entry (i.e, a process context (`PC`), for
performed:

* `IODIR.INVAL_PDT` with `DV=1`, `PV=1`, `DID=D` and `PID=P`
* If `DC.iohgatp.MODE != Bare`
** `IOTINVAL.VMA` with `GV=1`, `AV=0`, `PV=1`, `GSCID=DC.iohgatp.GSCID`,
* If `DC/PC.iohgatp.MODE != Bare`
** `IOTINVAL.VMA` with `GV=1`, `AV=0`, `PV=1`, `GSCID=DC/PC.iohgatp.GSCID`,
and `PSCID=PC.PSCID`
* else
** `IOTINVAL.VMA` with `GV=0`, `AV=0`, `PV=1`, and `PSCID=PC.PSCID`
Expand All @@ -188,12 +188,12 @@ number `I` that corresponds to an untranslated MSI address `A` then the followin
invalidations must be performed:

* `IOTINVAL.GVMA` with `GV=AV=1`, `ADDR[63:12]=A[63:12]` and
`GSCID=DC.iohgatp.GSCID`
`GSCID=DC/PC.iohgatp.GSCID`

To invalidate all cache entries from a MSI page table the following
invalidations must be performed:

* `IOTINVAL.GVMA` with `GV=1`, `AV=0`, and `GSCID=DC.iohgatp.GSCID`
* `IOTINVAL.GVMA` with `GV=1`, `AV=0`, and `GSCID=DC/PC.iohgatp.GSCID`

Between a change to the MSI PTE and when an invalidation command to invalidate
the cached PTE is processed by the IOMMU, the IOMMU may use the old PTE value
Expand All @@ -207,19 +207,20 @@ If software changes a leaf second-stage page-table entry of a VM where the chang
affects translation for a guest-PPN `G` then the following invalidations must be
performed:

* `IOTINVAL.GVMA` with `GV=AV=1`, `GSCID=DC.iohgatp.GSCID`, and `ADDR[63:12]=G`
* `IOTINVAL.GVMA` with `GV=AV=1`, `GSCID=DC/PC.iohgatp.GSCID`, and `ADDR[63:12]=G`

If software changes a non-leaf second-stage page-table entry of a VM
then the following invalidations must be performed:

* `IOTINVAL.GVMA` with `GV=1`, `AV=0`, `GSCID=DC.iohgatp.GSCID`
* `IOTINVAL.GVMA` with `GV=1`, `AV=0`, `GSCID=DC/PC.iohgatp.GSCID`

The `DC` has fields that hold a guest-PPN. An implementation may translate such
The `DC` has fields that hold a guest-PPN when `capabilities.GIPC = 0`. An implementation may translate such
fields to a supervisor-PPN as part of caching the `DC`. If the second-stage page
table update affects translation of guest-PPN held in the `DC` then software
must invalidate all such cached `DC` using `IODIR.INVAL_DDT` with `DV=1` and
`DID` set to the corresponding `device_id`. Alternatively, an
`IODIR.INVAL_DDT` with `DV=0` may be used to invalidate all cached `DC`.
The `DC` hasn't fields that hold a guest-PPN when `capabilities.GIPC = 1`.

Between a change to the second-stage PTE and when an invalidation command to
invalidate the cached PTE is processed by the IOMMU, the IOMMU may use the
Expand All @@ -238,7 +239,7 @@ specified in <<IVMA>>.

When a change is made to a first-stage page table, and the second-stage is
not Bare, then software must perform invalidations using `IOTINVAL.VMA` with
`GV=1`, `GSCID=DC.iohgatp.GSCID` and `AV` and `PSCV` operands appropriate for
`GV=1`, `GSCID=DC/PC.iohgatp.GSCID` and `AV` and `PSCV` operands appropriate for
the modification as specified in <<IVMA>>.

Between a change to the first-stage PTE and when an invalidation command to
Expand Down

0 comments on commit 984fda1

Please sign in to comment.