From d6c4d8dd07968d0cf2e43d4619252b572282dcde Mon Sep 17 00:00:00 2001
From: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>
Date: Mon, 9 Dec 2024 10:13:15 +0100
Subject: [PATCH] Improve CUB tuning documentation (#3058)

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
---
 docs/cub/benchmarking.rst |   7 ++
 docs/cub/tuning.rst       | 169 +++++++++++++++++++++++++++-----------
 2 files changed, 126 insertions(+), 50 deletions(-)
diff --git a/docs/cub/benchmarking.rst b/docs/cub/benchmarking.rst
index 9441e5de4d0..53326d49cb5 100644
--- a/docs/cub/benchmarking.rst
+++ b/docs/cub/benchmarking.rst
@@ -326,3 +326,10 @@ before viewing the report using `ncu-ui`:
     ncu-ui base.ncu-rep
 
 The version of `ncu-ui` needs to be at least as high as the version of `ncu` used to create the report.
+
+Authoring benchmarks
+--------------------------------------------------------------------------------
+
+CUB's benchmarks serve a dual purpose.
+They are used to measure and compare the performance of CUB and to tune CUB's algorithms.
+More information on how to create new benchmarks is provided in the :ref:`CUB tuning guide <cub-tuning>`.
diff --git a/docs/cub/tuning.rst b/docs/cub/tuning.rst
index 3646e9f56bc..977efd2e8bf 100644
--- a/docs/cub/tuning.rst
+++ b/docs/cub/tuning.rst
@@ -4,18 +4,26 @@ CUB Tuning Infrastructure
 ================================================================================
 
 Device-scope algorithms in CUB have many knobs that do not affect the algorithms' correctness but can significantly impact performance. For instance, the number of threads per block and items per thread can be tuned to maximize performance for a given device and data type.
-This document describes CUB Tuning Infrastructure, a set of tools facilitating the process of
+This document describes CUB's tuning Infrastructure, a set of tools facilitating the process of
 selecting optimal tuning parameters for a given device and data type.
 
 Definitions
 --------------------------------------------------------------------------------
 
-Terms might be ambiguous in a generic context. Below, we omit the word "tuning" but assume it in all definitions.
-Algorithms are tuned for different workloads. For instance, radix sort can be tuned for different key types, different number of keys, and different distribution of keys. We separate tuning parameters into two categories:
+We omit the word "tuning" but assume it in the definitions for all terms below,
+so those terms may mean something else in a more generic context.
 
-* **Compile-time (ct) Workload** - a workload that can be recognized at compile time. For instance, the combination of key type and offset type is a compile-time workload for radix sort.
+Algorithms are tuned for different workloads, which are sub spaces of all benchmark versions defined by NVBench via a benchmark's axes.
+For instance, radix sort can be tuned for different key types, different number of keys, and different distributions of keys.
+We partition the space spanned by a benchmark's axes into two categories:
 
-* **Runtime (rt) Workload** - a workload that can be recognized only at runtime. For instance, the number of keys along with their distribution is a runtime workload for radix sort.
+* **Compile-time (ct) Workload** - a workload that can be recognized at compile time. For instance, the combination of key type and offset type is a compile-time workload for radix sort. A compile-time workload is a point in the space spanned by the Cartesian product of compile-time type axes of NVBench.
+
+* **Runtime (rt) Workload** - a workload that can be recognized only at runtime. For instance, the number of keys along with their distribution is a runtime workload for radix sort. A runtime workload is a point in the space spanned by the Cartesian product of non-compile-time type axes of NVBench.
+
+The tuning infrastructure can optimize algorithms only for specific compile-time workloads,
+aggregating results across all runtime workloads:
+It searches through a space of parameters to find the combination for a given compile-time workload with the highest score:
 
 * **Parameter** - a parameter that can be tuned to maximize performance for a given device and data type. For instance, the number of threads per block and items per thread are tuning parameters.
 
@@ -25,35 +33,40 @@ Algorithms are tuned for different workloads. For instance, radix sort can be tu
 
 * **Search Space** - Cartesian product of parameter spaces. For instance, search space for an algorithm with tunable items per thread and threads per block might look like :math:`\{(ipt \times tpb) | ipt \in \{1, \dots, 25\} \text{and} tpb \in \{32, 64, 96, 128, \dots, 1024\}\}`.
 
-* **Variant** - a point from corresponding search space.
+* **Variant** - a point in the corresponding search space.
 
-* **Base** - a variant that CUB uses by default.
+* **Base** - the variant that CUB uses by default.
 
-* **Score** - a single number representing the performance for a given compile-time workload and all runtime workloads. For instance, a weighted-sum of speedups of a given variant compared to its base for all runtime workloads is a score.
+* **Score** - a single number representing the performance for a given compile-time workload across all runtime workloads. For instance, a weighted-sum of speedups of a given variant compared to its base for all runtime workloads is a score.
 
 * **Search** - a process consisting of covering all variants for all compile-time workloads to find a variant with maximal score.
 
 
-Contributing Benchmarks
+Authoring Benchmarks
 --------------------------------------------------------------------------------
 
-There are a few constraints on benchmarks. First of all, all benchmarks in a single
-file should share type axes. Only alphabetical characters, numbers and underscore are allowed in the
-benchmark name.  The name of the file represents the name of the algorithm.
-For instance, the :code:`benchmarks/bench/radix_sort/keys.cu` file name is going to be transformed
-into :code:`cub.bench.radix_sort.keys` that's further used in the infrastructure.
-
-You start writing a benchmark by including :code:`nvbench_helper.cuh` file. It contains all
+CUB benchmarks are split into multiple files based on the algorithm they are testing
+and potentially further into compile-time flavors that are tuned for individually
+(e.g.: sorting only keys vs. key-value pairs, or reducing using sum vs. using min).
+The name of the directory represents the name of the algorithm.
+The filename corresponds on the flavor.
+For instance, the benchmark :code:`benchmarks/bench/radix_sort/keys.cu` tests the radix sort implementation sorting only keys.0
+The file name is going to be transformed into :code:`cub.bench.radix_sort.keys...`,
+which is the benchmark name reported by the infrastructure.
+
+Benchmarks are based on NVBench.
+You start writing a benchmark by including :code:`nvbench_helper.cuh`. It contains all
 necessary includes and definitions.
 
 .. code:: c++
 
   #include <nvbench_helper.cuh>
 
-The next step is to define a search space. The search space is represented by a number of comments.
-The format consists of :code:`%RANGE%` keyword, a parameter name, and a range. The range is
-represented by three numbers: start, end, and step. For instance, the following code defines a search
-space for the number of threads per block and items per thread.
+The next step is to define a search space. The search space is represented by a number of C++ comments.
+The format consists of the :code:`%RANGE%` keyword, a parameter macro, a short parameter name, and a range.
+The range is represented by three numbers: :code:`start:end:step`.
+Start and end are included.
+For instance, the following code defines a search space for two parameters, the number of threads per block and items per thread.
 
 .. code:: c++
 
@@ -62,7 +75,7 @@ space for the number of threads per block and items per thread.
 
 Next, you need to define a benchmark function. The function accepts :code:`nvbench::state &state` and
 a :code:`nvbench::type_list`. For more details on the benchmark signature, take a look at the
-nvbench docs.
+`NVBench documentation <https://github.com/NVIDIA/nvbench>`_.
 
 .. code:: c++
 
@@ -70,23 +83,39 @@ nvbench docs.
   void algname(nvbench::state &state, nvbench::type_list<T, OffsetT>)
   {
 
-Now we have to specialize the dispatch layer. The tuning infrastructure will use `TUNE_BASE` macro
-to distinguish between the base and the variant. When base is used, do not specify the policy, so
-that the default one is used. If the macro is not defined, specify custom policy using macro
-names defined at the search space specification step.
+Tuning relies on CUB's device algorithms to expose a dispatch layer which can be parameterized by a policy hub.
+CUB usually provides a default policy hub, but when tuning we want to overwrite it, so we have to specialize the dispatch layer.
+The tuning infrastructure will use the :code:`TUNE_BASE` macro to distinguish between compiling the base version (i.e. baseline) of a benchmark
+and compiling a variant for a given set of tuning parameters.
+When base is used, no policy is specified, so that the default one CUB provides is used.
+If :code:`TUNE_BASE` is not defined, we specify a custom policy
+using the parameter macros defined in the :code:`%RANGE%` comments which specify the search space.
 
 .. code:: c++
 
   #if TUNE_BASE
-    using dispatch_t = cub::DispatchReduce<T, OffsetT>;
+    using dispatch_t = cub::DispatchReduce<T, OffsetT>; // uses default policy hub
   #else
-    using policy_t = policy_hub_t<accum_t, offset_t>;
-    using dispatch_t = cub::DispatchReduce<T, OffsetT, policy_t>;
+    template <typename AccumT, typename OffsetT>
+    struct policy_hub_t {
+      struct MaxPolicy : cub::ChainedPolicy<300, policy_t, policy_t> {
+        static constexpr int threads_per_block  = TUNE_THREADS_PER_BLOCK;
+        static constexpr int items_per_thread   = TUNE_ITEMS_PER_THREAD;
+        ...
+      };
+    };
+
+    using dispatch_t = cub::DispatchReduce<T, OffsetT, policy_hub_t<accum_t, offset_t>>;
   #endif
 
-If possible, do not initialize the input data in the benchmark function. Instead, use the
-:code:`gen` function. This function will fill the input vector with random data on GPU with no
-compile-time overhead.
+The custom policy hub used for tuning should only expose a single :code:`MaxPolicy` so CUB will use it.
+It must contain all parameters from the search space.
+
+The :code:`state` passed into the benchmark function allows access to runtime workload axes,
+for example the number of elements to process.
+When creating containers for the input avoid to initialize data yourself.
+Instead, use the :code:`gen` function,
+which will fill the input vector with random data on GPU with no compile-time overhead.
 
 .. code:: c++
 
@@ -96,7 +125,8 @@ compile-time overhead.
 
     gen(seed_t{}, in);
 
-You can optionally add memory usage to the state:
+In addition to benchmark runtime, NVBench can also report information on the achieved memory bandwidth.
+For this, you can optionally provide information on the memory reads and writes of the algorithm to the :code:`state`:
 
 .. code:: c++
 
@@ -104,7 +134,12 @@ You can optionally add memory usage to the state:
     state.add_global_memory_reads<T>(elements, "Size");
     state.add_global_memory_writes<T>(1);
 
-Now we are ready to allocate temporary storage:
+Most CUB algorithms need to be called twice:
+
+1. once to query the amount of temporary storage needed,
+2. once to run the actual algorithm.
+
+We perform the first call now and allocate temporary storage:
 
 .. code:: c++
 
@@ -116,10 +151,11 @@ Now we are ready to allocate temporary storage:
                          static_cast<offset_t>(elements),
                          0 /* stream */);
 
-    thrust::device_vector<nvbench::uint8_t> temp(temp_size);
+    thrust::device_vector<char> temp(temp_size);
     auto *temp_storage = thrust::raw_pointer_cast(temp.data());
 
-Finally, we can run the algorithm:
+Finally, we can execute the timed region of the benchmark,
+which contains the second call to a CUB algorithm and performs the actual work we want to benchmark:
 
 .. code:: c++
 
@@ -133,11 +169,8 @@ Finally, we can run the algorithm:
     });
   }
 
-Having the benchmark function, we can tell nvbench about it. A few things to note here. First of all,
-compile-time axes should be annotated as :code:`{ct}`. The runtime axes might be optionally annotated
-as :code:`{io}` which stands for importance-ordered. This will tell the tuning infrastructure that
-the later values on the axis are more important. If the axis is not annotated, each value will be
-treated as equally important.
+This concludes defining the benchmark function.
+Now we need to tell NVBench about it:
 
 .. code:: c++
 
@@ -146,21 +179,30 @@ treated as equally important.
     .set_type_axes_names({"T{ct}", "OffsetT{ct}"})
     .add_int64_power_of_two_axis("Elements{io}", nvbench::range(16, 28, 4));
 
+:code:`NVBENCH_BENCH_TYPES` registers the benchmark as one with multiple compile-time workloads,
+which are defined by the Cartesian product of the type lists in :code:`NVBENCH_TYPE_AXES`.
+:code:`set_name(...)` sets the name of the benchmark.
+Only alphabetical characters, numbers and underscores are allowed in the benchmark name.
 
-When you define a type axis that's annotated as :code:`{ct}`, you might want to consider optimizing
+Furthermore, compile-time axes should be suffixed with :code:`{ct}`. The runtime axes might be optionally annotated
+as :code:`{io}` which stands for importance-ordered. This will tell the tuning infrastructure that
+the later values on the axis are more important. If the axis is not annotated, each value will be
+treated as equally important.
+
+When you define a type axis annotated with :code:`{ct}`, you should consider optimizing
 the build time. Many variants are going to be build, but the search is considering one compile-time
-use case at a time. This means, that if you have many types to tune for, you'll end up having
-many specializations that you don't need. To avoid this, for each compile time axis, you can
-expect a `TUNE_AxisName` macro with the type that's currently being tuned. For instance, if you
-have a type axes :code:`T{ct}` and :code:`OffsetT` (as shown above), you can use the following
-construct:
+use case at a time. This means that if you have many types to tune for, you'll end up having
+many template specializations that you don't need. To avoid this, for each compile time axis, the tuning framework will predefine
+a `TUNE_AxisName` macro with the type that's currently being tuned. For instance, if you
+have the type axes :code:`T{ct}` and :code:`OffsetT` (as shown above), you can use the following
+pattern to narrow down the types you compile for:
 
 .. code:: c++
 
   #ifdef TUNE_T
-  using types = nvbench::type_list<TUNE_T>;
+  using all_types = nvbench::type_list<TUNE_T>;
   #else
-  using types = all_types;
+  using all_types = nvbench::type_list<char, short, int, long, ...>;
   #endif
 
   #ifdef TUNE_OffsetT
@@ -192,7 +234,14 @@ This logic is already implemented if you use any of the following predefined typ
      - :code:`int32_t, int64_t`
 
 
-But you are free to define your own axis names and use the logic above for them (see sort pairs example).
+But you are free to define your own axis names and use the logic above for them (see the sort pairs example).
+
+A single benchmark file can define multiple benchmarks (multiple benchmark functions registered with :code:`NVBENCH_BENCH_TYPES`).
+All benchmarks in a single file must share the same compile-time axes.
+The tuning infrastructure will run all benchmarks in a single file together for the same compile-time workload
+and compute a common score across all benchmarks and runtime workloads.
+This is useful to tune an algorithm for multiple runtime use cases at once,
+that we don't intend to provide separate tuning policies for.
 
 
 Search Process
@@ -211,8 +260,12 @@ Both :code:`-a` and :code:`-R` options are optional. The first one is used to sp
 for. The second one is used to specify benchmarks to be tuned. If not specified, all benchmarks are
 going to be tuned.
 
+Analyzing the results
+--------------------------------------------------------------------------------
+
 The result of the search is stored in the :code:`build/cccl_meta_bench.db` file. To analyze the
-result you can use the :code:`analyze.py` script:
+result you can use the :code:`analyze.py` script.
+The :code:`--coverage` flag will show the amount of variants that were covered per compile-time workload:
 
 .. code:: bash
 
@@ -220,6 +273,10 @@ result you can use the :code:`analyze.py` script:
     cub.bench.radix_sort.keys[T{ct}=I8, OffsetT{ct}=I32] coverage: 167 / 522 (31.9923%)
     cub.bench.radix_sort.keys[T{ct}=I8, OffsetT{ct}=I64] coverage: 152 / 522 (29.1188%)
 
+The :code:`--top N` flag will list the best :code:`N` variants for each compile-time workload:
+
+.. code:: bash
+
   $ ../benchmarks/scripts/analyze.py --top=5
     cub.bench.radix_sort.keys[T{ct}=I8, OffsetT{ct}=I32]:
               variant     score      mins     means      maxs
@@ -236,6 +293,18 @@ result you can use the :code:`analyze.py` script:
     98  ipt_21.tpb_448  1.231045  1.152798  1.298332  1.621110
     85  ipt_20.tpb_480  1.229382  1.135447  1.294937  1.631225
 
+The name of the variant contains the short parameter names and values used for the variant.
+For each variant, a score is reported. The base has a score of 1.0, so each score higher than 1.0 is an improvement over the base.
+However, because a single variant contains multiple runtime workloads, also the minimum, mean, maximum score is reported.
+If all those three values are larger than 1.0, the variant is strictly better than the base.
+If only the mean or max are larger than 1.0, the variant may perform better in most runtime workloads, but regress in others.
+This information can be used to change the existing tuning policies in CUB.
+
+..
+    TODO(bgruber): the following is outdated:
+
+.. code:: bash
+
   $ ../benchmarks/scripts/analyze.py --variant='ipt_(18|19).tpb_512'
 
 The last command plots distribution of the elapsed times for the specified variants.