Skip to content

Commit

Permalink
Merge pull request #61 from e0ne/doca-driver
Browse files Browse the repository at this point in the history
Rename MOFED to NVIDIA DOCA Driver
  • Loading branch information
rollandf authored Jul 9, 2024
2 parents 508108d + fe5c25e commit 08b0e54
Show file tree
Hide file tree
Showing 6 changed files with 33 additions and 33 deletions.
42 changes: 21 additions & 21 deletions docs/customizations/helm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -224,9 +224,9 @@ For example:
memory: "300Mi"
================
MLNX_OFED Driver
================
===================
NVIDIA DOCA Driver
===================

.. list-table::
:header-rows: 1
Expand All @@ -238,19 +238,19 @@ MLNX_OFED Driver
* - ofedDriver.deploy
- Bool
- false
- Deploy the MLNX_OFED driver container
- Deploy the NVIDIA DOCA Driver driver container
* - ofedDriver.repository
- String
- nvcr.io/nvidia/mellanox
- MLNX_OFED driver image repository
- NVIDIA DOCA Driver image repository
* - ofedDriver.image
- String
- doca-driver
- MLNX_OFED driver image name
- NVIDIA DOCA Driver image name
* - ofedDriver.version
- String
- |mofed-version|
- MLNX_OFED driver version
- NVIDIA DOCA Driver version
* - ofedDriver.initContainer.enable
- Bool
- true
Expand Down Expand Up @@ -282,35 +282,35 @@ MLNX_OFED Driver
* - ofedDriver.imagePullSecrets
- List
- []
- An optional list of references to secrets to use for pulling any of the MLNX_OFED driver images
- An optional list of references to secrets to use for pulling any of the NVIDIA DOCA Driver images
* - ofedDriver.env
- List
- []
- An optional list of `environment variables <https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.29/#envvar-v1-core>`_ passed to the NVIDIA OFED driver image
* - ofedDriver.startupProbe.initialDelaySeconds
- Int
- 10
- MLNX_OFED startup probe initial delay
- NVIDIA DOCA Driver startup probe initial delay
* - ofedDriver.startupProbe.periodSeconds
- Int
- 20
- MLNX_OFED startup probe interval
- NVIDIA DOCA Driver startup probe interval
* - ofedDriver.livenessProbe.initialDelaySeconds
- Int
- 30
- MLNX_OFED liveness probe initial delay
- NVIDIA DOCA Driver liveness probe initial delay
* - ofedDriver.livenessProbe.periodSeconds
- Int
- 30
- MLNX_OFED liveness probe interval
- NVIDIA DOCA Driver liveness probe interval
* - ofedDriver.readinessProbe.initialDelaySeconds
- Int
- 10
- MLNX_OFED readiness probe initial delay
- NVIDIA DOCA Driver readiness probe initial delay
* - ofedDriver.readinessProbe.periodSeconds
- Int
- 30
- MLNX_OFED readiness probe interval
- NVIDIA DOCA Driver readiness probe interval
* - ofedDriver.upgradePolicy.autoUpgrade
- Bool
- true
Expand Down Expand Up @@ -360,11 +360,11 @@ MLNX_OFED Driver
- false
- Fail Mellanox OFED deployment if precompiled OFED driver container image does not exists

======================================
MLNX_OFED Driver Environment Variables
======================================
===============================================
NVIDIA DOCA Driver Driver Environment Variables
===============================================

The following are special environment variables supported by the MLNX_OFED container to configure its behavior:
The following are special environment variables supported by the NVIDIA DOCA Driver container to configure its behavior:

.. list-table::
:header-rows: 1
Expand All @@ -378,7 +378,7 @@ The following are special environment variables supported by the MLNX_OFED conta
- Create an udev rule to preserve "old-style" path based netdev names e.g enp3s0f0
* - UNLOAD_STORAGE_MODULES
- "false"
- | Unload host storage modules prior to loading MLNX_OFED modules:
- | Unload host storage modules prior to loading NVIDIA DOCA Driver modules:
| * ib_isert
| * nvme_rdma
| * nvmet_rdma
Expand All @@ -387,12 +387,12 @@ The following are special environment variables supported by the MLNX_OFED conta
| * ib_srpt
* - ENABLE_NFSRDMA
- "false"
- Enable loading of NFS related storage modules from a MLNX_OFED container
- Enable loading of NFS related storage modules from a NVIDIA DOCA Driver container
* - RESTORE_DRIVER_ON_POD_TERMINATION
- "true"
- Restore host drivers when a container

In addition, it is possible to specify any environment variables to be exposed to the MLNX_OFED container, such as the standard "HTTP_PROXY", "HTTPS_PROXY", "NO_PROXY".
In addition, it is possible to specify any environment variables to be exposed to the NVIDIA DOCA Driver container, such as the standard "HTTP_PROXY", "HTTPS_PROXY", "NO_PROXY".

.. warning::
CREATE_IFNAMES_UDEV is set automatically by the Network Operator, depending on the Operating System of the worker nodes in the cluster (the cluster is assumed to be homogenous).
Expand Down
4 changes: 2 additions & 2 deletions docs/files/RHEL_Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ ARG D_KERNEL_VER
ARG OFED_SRC_LOCAL_DIR

RUN set -x && \
# MOFED installation requirements
# NVIDIA DOCA Driver installation requirements
dnf install -y autoconf gcc make rpm-build

# Build driver
Expand Down Expand Up @@ -123,7 +123,7 @@ RUN set -x && \
./mlnx-tools-*.rpm

RUN set -x && \
# MOFED functional requirements
# NVIDIA DOCA Driver functional requirements
dnf install -y pciutils hostname udev ethtool \
# Container functional requirements
jq iproute kmod procps-ng udev
Expand Down
8 changes: 4 additions & 4 deletions docs/getting-started-kubernetes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -622,7 +622,7 @@ Network Operator Deployment for GPUDirect Workloads

GPUDirect requires the following:

* MLNX_OFED v5.5-1.0.3.2 or newer
* NVIDIA DOCA Driver v5.5-1.0.3.2 or newer
* GPU Operator v1.9.0 or newer
* NVIDIA GPU and driver supporting GPUDirect e.g Quadro RTX 6000/8000 or NVIDIA T4/NVIDIA V100/NVIDIA A100

Expand Down Expand Up @@ -1003,7 +1003,7 @@ Network Operator Deployment with an SR-IOV InfiniBand Network

Network Operator deployment with InfiniBand network requires the following:

* MLNX_OFED and OpenSM running. OpenSM runs on top of the MLNX_OFED stack, so both the driver and the subnet manager should come from the same installation. Note that partitions that are configured by OpenSM should specify defmember=full to enable the SR-IOV functionality over InfiniBand. For more details, please refer to `this article <https://docs.mellanox.com/display/MLNXOFEDv51258060/OpenSM>`.
* NVIDIA DOCA Driver and OpenSM running. OpenSM runs on top of the NVIDIA DOCA Driver stack, so both the driver and the subnet manager should come from the same installation. Note that partitions that are configured by OpenSM should specify defmember=full to enable the SR-IOV functionality over InfiniBand. For more details, please refer to `this article <https://docs.mellanox.com/display/MLNXOFEDv51258060/OpenSM>`.
* InfiniBand device – Both the host device and switch ports must be enabled in InfiniBand mode.
* rdma-core package should be installed when an inbox driver is used.

Expand Down Expand Up @@ -1122,7 +1122,7 @@ Network Operator Deployment with an SR-IOV InfiniBand Network with PKey Manageme

Network Operator deployment with InfiniBand network requires the following:

* MLNX_OFED and OpenSM running. OpenSM runs on top of the MLNX_OFED stack, so both the driver and the subnet manager should come from the same installation. Note that partitions that are configured by OpenSM should specify defmember=full to enable the SR-IOV functionality over InfiniBand. For more details, please refer to `this article`_.
* NVIDIA DOCA Driver and OpenSM running. OpenSM runs on top of the NVIDIA DOCA Driver stack, so both the driver and the subnet manager should come from the same installation. Note that partitions that are configured by OpenSM should specify defmember=full to enable the SR-IOV functionality over InfiniBand. For more details, please refer to `this article`_.
* NVIDIA UFM running on top of OpenSM. For more details, please refer to `the project documentation`_.
* InfiniBand device – Both the host device and the switch ports must be enabled in InfiniBand mode.
* rdma-core package should be installed when an inbox driver is used.
Expand Down Expand Up @@ -1186,7 +1186,7 @@ Current limitations:
ipamPlugin:
deploy: true
Wait for MLNX_OFED to install and apply the following CRs:
Wait for NVIDIA DOCA Driver to install and apply the following CRs:

``sriov-ib-network-node-policy.yaml``

Expand Down
4 changes: 2 additions & 2 deletions docs/getting-started-openshift.rst
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ If you are planning to use SR-IOV, follow these `instructions <https://docs.open
.. warning::
The SR-IOV resources created will have the `openshift.io` prefix.

For the default SriovOperatorConfig CR to work with the MLNX_OFED container, please run this command to update the following values:
For the default SriovOperatorConfig CR to work with the NVIDIA DOCA Driver container, please run this command to update the following values:

.. code-block:: bash
Expand Down Expand Up @@ -319,7 +319,7 @@ The `pod.yaml` configuration file for such a deployment:
restartPolicy: OnFailure
containers:
- image: <rdma image>
name: mofed-test-ctr
name: doca-test-ctr
securityContext:
capabilities:
add: [ "IPC_LOCK" ]
Expand Down
6 changes: 3 additions & 3 deletions docs/life-cycle-management.rst
Original file line number Diff line number Diff line change
Expand Up @@ -367,16 +367,16 @@ Troubleshooting
- Required Action
* - The node is in upgrade-failed state.
- * Drain the node manually by running kubectl drain <node name> --ignore-daemonsets.
* Delete the MLNX_OFED pod on the node manually, by running the following command: ``kubectl delete pod -n `kubectl get pods --A --field-selector spec.nodeName=<node name> -l nvidia.com/ofed-driver --no-headers | awk '{print $1 " "$2}'```.
* Delete the NVIDIA DOCA Driver pod on the node manually, by running the following command: ``kubectl delete pod -n `kubectl get pods --A --field-selector spec.nodeName=<node name> -l nvidia.com/ofed-driver --no-headers | awk '{print $1 " "$2}'```.

**NOTE:** If the "Safe driver loading" feature is enabled, you may also need to remove the ``nvidia.com/ofed-driver-upgrade.driver-wait-for-safe-load`` annotation from the node object to unblock the loading of the driver
``kubectl annotate node <node_name> nvidia.com/ofed-driver-upgrade.driver-wait-for-safe-load-``

* Wait for the node to complete the upgrade.

* - The updated MLNX_OFED pod failed to start/ a new version of MLNX_OFED cannot be installed on the node.
* - The updated NVIDIA DOCA Driver pod failed to start/ a new version of NVIDIA DOCA Driver cannot be installed on the node.
- Manually delete the pod by using ``kubectl delete -n <Network Operator Namespace> <pod name>``.
If following the restart the pod still fails, change the MLNX_OFED version in the NicClusterPolicy to the previous version or to another working version.
If following the restart the pod still fails, change the NVIDIA DOCA Driver version in the NicClusterPolicy to the previous version or to another working version.

=================================
Uninstalling the Network Operator
Expand Down
2 changes: 1 addition & 1 deletion docs/platform-support.rst
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ The following component versions are deployed by the Network Operator:
* - Node Feature Discovery
- |node-feature-discovery-version|
- Optionally deployed. May already be present in the cluster with proper configuration.
* - NVIDIA MLNX_OFED driver container
* - NVIDIA DOCA Driver container
- |mofed-version|
-
* - k8s-rdma-shared-device-plugin
Expand Down

0 comments on commit 08b0e54

Please sign in to comment.