fastmachinelearning · JanFSchulte · Dec 9, 2024 · Oct 29, 2024 · Nov 5, 2024 · Nov 5, 2024
diff --git a/README.md b/README.md
@@ -49,8 +49,8 @@ hls_model = hls4ml.converters.keras_to_hls(config)
 hls4ml.utils.fetch_example_list()
 ```
 
-### Building a project with Xilinx Vivado HLS (after downloading and installing from [here](https://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html))
-Note: Vitis HLS is not yet supported. Vivado HLS versions between 2018.2 and 2020.1 are recommended.
+### Building a project.
+We will build the project using Xilinx Vivado HLS, which can be downloaded and installed from [here](https://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html). Alongside Vivado HLS, hls4ml also supports Vitis HLS, Intel HLS, Catapult HLS and has some experimental support dor Intel oneAPI. The target back-end can be changed using the argument backend when building the model.
 
 ```Python
 # Use Vivado HLS to synthesize the model

diff --git a/docs/command.rst → docs/advanced/command.rst b/docs/command.rst → docs/advanced/command.rst
@@ -50,7 +50,7 @@ hls4ml config
 
    hls4ml config [-h] [-m MODEL] [-w WEIGHTS] [-o OUTPUT]
 
-This creates a conversion configuration file. Visit Configuration section of the :doc:`Setup <setup>` page for more details on how to write a configuration file.
+This creates a conversion configuration file. Visit Configuration section of the :doc:`Setup <../setup>` page for more details on how to write a configuration file.
 
 **Arguments**
 

diff --git a/docs/flows.rst → docs/advanced/flows.rst b/docs/flows.rst → docs/advanced/flows.rst
diff --git a/docs/advanced/oneapi.rst b/docs/advanced/oneapi.rst
@@ -3,18 +3,17 @@ oneAPI Backend
 ==============
 
 The ``oneAPI`` backend of hls4ml is designed for deploying NNs on Intel/Altera FPGAs. It will eventually
-replace the ``Quartus`` backend, which should really have been called the Intel HLS backend. (The actual Quartus
-program continues to be used with IP produced by the ``oneAPI`` backend.)
-This section discusses details of the ``oneAPI`` backend.
+replace the ``Quartus`` backend, which targeted Intel HLS. (Quartus continues to be used with IP produced by the
+``oneAPI`` backend.) This section discusses details of the ``oneAPI`` backend.
 
 The ``oneAPI`` code uses SYCL kernels to implement the logic that is deployed on FPGAs. It naturally leads to the
-accelerator style of programming. In the IP Component flow, which is currently the only flow supported, the
+accelerator style of programming. In the SYCL HLS (IP Component) flow, which is currently the only flow supported, the
 kernel becomes the IP, and the "host code" becomes the testbench. An accelerator flow, with easier deployment on
 PCIe accelerator boards, is planned to be added in the future.
 
 The produced work areas use cmake to build the projects in a style based
 `oneAPI-samples <https://github.com/oneapi-src/oneAPI-samples/tree/main/DirectProgramming/C%2B%2BSYCL_FPGA>`_.
-The standard ``fpga_emu``, ``report``, ``fpga_sim``, and ``fpga`` are supported. Additionally, ``make lib``
+The standard ``fpga_emu``, ``report``, ``fpga_sim``, and ``fpga`` make targets are supported. Additionally, ``make lib``
 produces the library used for calling the ``predict`` function from hls4ml. The ``compile`` and ``build`` commands
 in hls4ml interact with the cmake system, so one does not need to manually use the build system, but it there
 if desired.
@@ -30,6 +29,7 @@ io_parallel and io_stream
 As mentioned in the :ref:`I/O Types` section, ``io_parallel`` is for small models, while ``io_stream`` is for
 larger models. In ``oneAPI``, there is an additional difference: ``io_stream`` implements each layer on its
 own ``task_sequence``. Thus, the layers run in parallel, with pipes connecting the inputs and outputs. This
-is similar in style to the `dataflow` implementation on Vitis, but more explicit. On the other hand, ``io_parallel``
-always uses a single task, relying on pipelining within the task for good performance. In contrast, the Vitis
-backend sometimes uses dataflow with ``io_parallel``.
+is similar in style to the `dataflow` implementation on Vitis HLS, but more explicit. It is also a change
+relative to the Intel HLS-based ``Quartus`` backend. On the other hand, ``io_parallel`` always uses a single task,
+relying on pipelining within the task for good performance. In contrast, the Vitis backend sometimes uses dataflow
+with ``io_parallel``.
diff --git a/docs/api/auto.rst b/docs/api/auto.rst
@@ -0,0 +1,16 @@
+=============================
+Automatic precision inference
+=============================
+
+The automatic precision inference (implemented in :py:class:`~hls4ml.model.optimizer.passes.infer_precision.InferPrecisionTypes`) attempts to infer the appropriate widths for a given precision.
+It is initiated by configuring a precision in the configuration as 'auto'. Functions like :py:class:`~hls4ml.utils.config.config_from_keras_model` and :py:class:`~hls4ml.utils.config.config_from_onnx_model`
+automatically set most precisions to 'auto' if the ``'name'`` granularity is used.
+
+.. note::
+    It is recommended to pass the backend to the ``config_from_*`` functions so that they can properly extract all the configurable precisions.
+
+The approach taken by the precision inference is to set accumulator and other precisions to never truncate, using only the bitwidths of the inputs (not the values). This is quite conservative,
+especially in cases where post-training quantization is used, or if the bit widths were set fairly loosely. The recommended action in that case is to edit the configuration and explicitly set
+some widths in it, potentially in an iterative process after seeing what precisions are automatically set. Another option, currently implemented in :py:class:`~hls4ml.utils.config.config_from_keras_model`,
+is to pass a maximum bitwidth using the ``max_precison`` option. Then the automatic precision inference will never set a bitwdith larger than the bitwidth or an integer part larger than the integer part of
+the ``max_precision`` that is passed. (The bitwidth and integer parts are treated separately.)
diff --git a/docs/api/configuration.rst b/docs/api/configuration.rst
@@ -45,8 +45,8 @@ This python dictionary can be edited as needed. A more advanced configuration ca
         default_precision='fixed<16,6>',
         backend='Vitis')
 
-This will include per-layer configuration based on the model. Including the backend is recommended because some configation options depend on the backend. Note, the precisions at the
-higher granularites usually default to 'auto', which means that ``hls4ml`` will try to set it automatically. Note that higher granularity settings take precendence
+This will include per-layer configuration based on the model. Including the backend is recommended because some configuration options depend on the backend. Note, the precisions at the
+higher granularites usually default to 'auto', which means that ``hls4ml`` will try to set it automatically (see :ref:`Automatic precision inference`). Note that higher granularity settings take precedence
 over model-level settings. See :py:class:`~hls4ml.utils.config.config_from_keras_model` for more information on the various options.
 
 One can override specific values before using the configuration:

diff --git a/docs/details.rst → docs/api/details.rst b/docs/details.rst → docs/api/details.rst
diff --git a/docs/index.rst b/docs/index.rst
@@ -6,28 +6,40 @@
     status
     setup
     release_notes
-    details
-    flows
-    command
     reference
 
 .. toctree::
     :hidden:
     :glob:
     :caption: Quick API Reference
 
-    api/*
+    api/configuration
+    api/auto
+    api/details
+    api/hls-model
+    api/profiling
+
+.. toctree::
+    :hidden:
+    :glob:
+    :caption: Internal Layers
+
+    ir/dense
+    ir/activations
+    ir/conv
 
 .. toctree::
     :hidden:
     :caption: Advanced Features
 
+    advanced/flows
     advanced/qonnx
     advanced/fifo_depth
     advanced/extension
     advanced/oneapi
     advanced/accelerator
     advanced/model_optimization
+    advanced/command
 
 .. toctree::
     :hidden:

diff --git a/docs/ir/activations.rst b/docs/ir/activations.rst
@@ -0,0 +1,14 @@
+===========
+Activations
+===========
+
+Most activations without extra parameters are represented with the ``Activation`` layer, and those with single parameters (leaky ReLU, thresholded ReLU, ELU) as ``ParametrizedActivation``.
+``PReLU`` has its own class because it has a parameter matrix (stored as a weight). The hard (piecewise linear) sigmoid and tanh functions are implemented in a ``HardActivation`` layer,
+and ``Softmax`` has its own layer class.
+
+Softmax has four implementations that the user can choose from by setting the ``implementation`` parameter:
+
+* **latency**:  Good latency, but somewhat high resource usage. It does not work well if there are many output classes.
+* **stable**:  Slower but with better accuracy, useful in scenarios where higher accuracy is needed.
+* **legacy**:  An older implementation with poor accuracy, but good performance. Usually the latency implementation is preferred.
+* **argmax**:  If you don't care about normalized outputs and only care about which one has the highest value, using argmax saves a lot of resources. This sets the highest value to 1, the others to 0.
diff --git a/docs/ir/conv.rst b/docs/ir/conv.rst
@@ -0,0 +1,32 @@
+==================
+Convolution Layers
+==================
+
+Standard convolutions
+=====================
+
+These are the standard 1D and 2D convolutions currently supported by hls4ml, and the fallback if there is no special pointwise implementation.
+
+io_parallel
+-----------
+
+Parallel convolutions are for cases where the model needs to be small and fast, though synthesizability limits can be quickly reached. Also note that skip connections
+are not supported in io_parallel.
+
+For the Xilinx backends and Catapult, there is a very direct convolution implementation when using the ``Latency`` strategy. This is only for very small models because the
+high number of nested loops. The ``Resource`` strategy in all cases defaults to an algorithm using the *im2col* transformation. This generally supports larger models. The ``Quartus``,
+``oneAPI``, and ``Catapult`` backends also implement a ``Winograd`` algorithm choosable by setting the ``implementation`` to ``Winograd`` or ``combination``. Note that
+the Winograd implementation is available for only a handful of filter size configurations, and it is less concerned about bit accuracy and overflow, but it can be faster.
+
+io_stream
+---------
+
+There are two main classes of io_stream implementations, ``LineBuffer`` and  ``Encoded``. ``LineBuffer`` is always the default, and generally produces marginally better results,
+while ``Catapult`` and ``Vivado`` also implement ``Encoded``, choosable with the ``convImplementation`` configuration option. In all cases, the data is processed serially, one pixel
+at a time, with a pixel containing an array of all the channel values for the pixel.
+
+Depthwise convolutions
+======================
+
+Pointwise convolutions
+======================
diff --git a/docs/ir/dense.rst b/docs/ir/dense.rst
@@ -0,0 +1,25 @@
+============
+Dense Layers
+============
+
+One-dimensional Dense Layers
+============================
+
+One-dimensional dense layers implement a matrix multiply and bias add. The produced code is also used by other layers to implement the matrix multiplication.
+
+
+io_parallel
+-----------
+
+All the backends implement a ``Resource`` implementation, which explicitly iterates over the reuse factor. There are different implementations depending on whether the reuse factor is
+smaller or bigger than the input size. The two Xilinx backends and Catapult also implement a ``Latency`` implementation, which only uses the reuse factor in pragmas.
+
+io_stream
+---------
+
+The io_stream implementation only wraps the io_parallel implementation with streams or pipes for communication. The data is still transferred in parallel.
+
+Multi-dimensional Dense Layers
+==============================
+
+Multi-dimensional Dense layers are converted to pointwise convolutions, and do not directly use the above implementation
diff --git a/docs/setup.rst b/docs/setup.rst
@@ -43,23 +43,30 @@ version can be installed directly from ``git``:
 Dependencies
 ============
 
-The ``hls4ml`` library depends on a number of Python packages and external tools for synthesis and simulation. Python dependencies are automatically managed
+The ``hls4ml`` library requires python 3.10 or later, and depends on a number of Python packages and external tools for synthesis and simulation. Python dependencies are automatically managed
 by ``pip`` or ``conda``.
 
-* `TensorFlow <https://pypi.org/project/tensorflow/>`_ (version 2.4 and newer) and `QKeras <https://pypi.org/project/qkeras/>`_ are required by the Keras converter.
+* `TensorFlow <https://pypi.org/project/tensorflow/>`_ (version 2.8 to 2.14) and `QKeras <https://pypi.org/project/qkeras/>`_ are required by the Keras converter. One may want to install newer versions of QKeras from GitHub. Newer versions of TensorFlow can be used, but QKeras and hl4ml do not currently support Keras v3.
+
 * `ONNX <https://pypi.org/project/onnx/>`_ (version 1.4.0 and newer) is required by the ONNX converter.
+
 * `PyTorch <https://pytorch.org/get-started>`_ package is optional. If not installed, the PyTorch converter will not be available.
 
 Running C simulation from Python requires a C++11-compatible compiler. On Linux, a GCC C++ compiler ``g++`` is required. Any version from a recent
-Linux should work. On MacOS, the *clang*-based ``g++`` is enough.
+Linux should work. On MacOS, the *clang*-based ``g++`` is enough. For the oneAPI backend, one must have oneAPI installed, along with the FPGA compiler,
+to run C/SYCL simulations.
 
 To run FPGA synthesis, installation of following tools is required:
 
-* Xilinx Vivado HLS 2018.2 to 2020.1 for synthesis for Xilinx FPGAs
+* Xilinx Vivado HLS 2018.2 to 2020.1 for synthesis for Xilinx FPGAs using the ``Vivado`` backend.
+
+* Vitis HLS 2022.2 or newer is required for synthesis for Xilinx FPGAs using the ``Vitis`` backend.
 
-  * Vitis HLS 2022.2 or newer is required for synthesis for Xilinx FPGAs using the ``Vitis`` backend.
+* Intel Quartus 20.1 to 21.4 for the synthesis for Intel/Altera FPGAs using the ``Quartus`` backend.
 
-* Intel Quartus 20.1 to 21.4 for the synthesis for Intel FPGAs
+* oneAPI 2024.1 to 2025.0 with the FPGA compiler and recent Intel/Altara Quartus for Intel/Altera FPGAs using the ``oneAPI`` backend.
+
+Catapult HLS 2024.1_1 or 2024.2 can be used to synthesize both for ASICs and FPGAs.
 
 
 Quick Start
@@ -100,76 +107,77 @@ Done! You've built your first project using ``hls4ml``! To learn more about our
 
 If you want to configure your model further, check out our :doc:`Configuration <api/configuration>` page.
 
-Apart from our main API, we also support model conversion using a command line interface, check out our next section to find out more:
+..
+   Apart from our main API, we also support model conversion using a command line interface, check out our next section to find out more:
 
-Getting started with hls4ml CLI (deprecated)
---------------------------------------------
+   Getting started with hls4ml CLI (deprecated)
+   --------------------------------------------
 
-As an alternative to the recommended Python PI, the command-line interface is provided via the ``hls4ml`` command.
+   As an alternative to the recommended Python PI, the command-line interface is provided via the ``hls4ml`` command.
 
-To follow this tutorial, you must first download our ``example-models`` repository:
+   To follow this tutorial, you must first download our ``example-models`` repository:
 
-.. code-block:: bash
+   .. code-block:: bash
 
-   git clone https://github.com/fastmachinelearning/example-models
+      git clone https://github.com/fastmachinelearning/example-models
 
-Alternatively, you can clone the ``hls4ml`` repository with submodules
+   Alternatively, you can clone the ``hls4ml`` repository with submodules
 
-.. code-block:: bash
+   .. code-block:: bash
 
-   git clone --recurse-submodules https://github.com/fastmachinelearning/hls4ml
+      git clone --recurse-submodules https://github.com/fastmachinelearning/hls4ml
 
-The model files, along with other configuration parameters, are defined in the ``.yml`` files.
-Further information about ``.yml`` files can be found in :doc:`Configuration <api/configuration>` page.
+   The model files, along with other configuration parameters, are defined in the ``.yml`` files.
+   Further information about ``.yml`` files can be found in :doc:`Configuration <api/configuration>` page.
 
-In order to create an example HLS project, first go to ``example-models/`` from the main directory:
+   In order to create an example HLS project, first go to ``example-models/`` from the main directory:
 
-.. code-block:: bash
+   .. code-block:: bash
 
-   cd example-models/
+      cd example-models/
 
-And use this command to translate a Keras model:
+   And use this command to translate a Keras model:
 
-.. code-block:: bash
+   .. code-block:: bash
 
-   hls4ml convert -c keras-config.yml
+      hls4ml convert -c keras-config.yml
 
-This will create a new HLS project directory with an implementation of a model from the ``example-models/keras/`` directory.
-To build the HLS project, do:
+   This will create a new HLS project directory with an implementation of a model from the ``example-models/keras/`` directory.
+   To build the HLS project, do:
 
-.. code-block:: bash
+   .. code-block:: bash
 
-   hls4ml build -p my-hls-test -a
+      hls4ml build -p my-hls-test -a
 
-This will create a Vivado HLS project with your model implementation!
+   This will create a Vivado HLS project with your model implementation!
 
-**NOTE:** For the last step, you can alternatively do the following to build the HLS project:
+   **NOTE:** For the last step, you can alternatively do the following to build the HLS project:
 
-.. code-block:: Bash
+   .. code-block:: Bash
 
-   cd my-hls-test
-   vivado_hls -f build_prj.tcl
+      cd my-hls-test
+      vivado_hls -f build_prj.tcl
 
-``vivado_hls`` can be controlled with:
+   ``vivado_hls`` can be controlled with:
 
-.. code-block:: bash
+   .. code-block:: bash
 
-   vivado_hls -f build_prj.tcl "csim=1 synth=1 cosim=1 export=1 vsynth=1"
+      vivado_hls -f build_prj.tcl "csim=1 synth=1 cosim=1 export=1 vsynth=1"
 
-Setting the additional parameters from ``1`` to ``0`` disables that step, but disabling ``synth`` also disables ``cosim`` and ``export``.
+   Setting the additional parameters from ``1`` to ``0`` disables that step, but disabling ``synth`` also disables ``cosim`` and ``export``.
 
-Further help
-^^^^^^^^^^^^
+   Further help
+   ^^^^^^^^^^^^
 
-* For further information about how to use ``hls4ml``\ , do: ``hls4ml --help`` or ``hls4ml -h``
-* If you need help for a particular ``command``\ , ``hls4ml command -h`` will show help for the requested ``command``
-* We provide a detailed documentation for each of the command in the :doc:`Command Help <../command>` section
+   * For further information about how to use ``hls4ml``\ , do: ``hls4ml --help`` or ``hls4ml -h``
+   * If you need help for a particular ``command``\ , ``hls4ml command -h`` will show help for the requested ``command``
+   * We provide a detailed documentation for each of the command in the :doc:`Command Help <advanced/command>` section
 
 Existing examples
 -----------------
 
-* Examples of model files and weights can be found in `example_models <https://github.com/fastmachinelearning/example-models>`_ directory.
 * Training codes and examples of resources needed to train the models can be found in the `tutorial <https://github.com/fastmachinelearning/hls4ml-tutorial>`__.
+* Examples of model files and weights can be found in `example_models <https://github.com/fastmachinelearning/example-models>`_ directory.
 
 Uninstalling
 ------------