Skip to content

Latest commit

 

History

History
243 lines (203 loc) · 13.2 KB

README.md

File metadata and controls

243 lines (203 loc) · 13.2 KB

FADO

DOI

Floorplan-Aware Directive Optimization for HLS designs on Multi-die FPGAs



Important Notice (Under Artifact Evaluation!)

Thanks for using our FADO framework! FADO is developed by the Reconfiguration Computing System Lab @ HKUST, and to appear as a regular paper (oral) in the International Symposium FPGA 2023.

For personal usage, not redistribution, you can refer to the pre-print...

Linfeng Du, Tingyuan Liang, Sharad Sinha, Zhiyao Xie, and Wei Zhang. 2022. FADO: Floorplan-Aware Directive Optimization for High-Level Synthesis Designs on Multi-Die FPGAs. In Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA ’23), February 12–14, 2023, Monterey, CA, USA. ACM, New York, NY, USA, 11 pages.


Environment Requirements

  • Step 0: System Checking (only verified versions are listed, other versions are not guaranteed)
    • Ubuntu OS: 20.04.4 LTS / 20.04.5 LTS
    • Linux: 5.4.0-050400-generic / 5.14.0-1054-oem
    • Vitis/Vitis_HLS/Vivado 2020.2
    • $\geq$ 64GB DDR4 for back-end implementation using Vitis, as suggested by Xilinx document UG1301
  • Step 1: Apt: a single command: bash step1-install-apt-packages.sh, or separate commands: sudo apt install <the following packages>
    • faketime
    • iverilog
    • swig pre-requisite for pip install oapackage
  • Step 2: Python 3.9: a single command: pip install -r step2-pip-requirements.txt, or separate commands: pip install <the following packages>
    • OApackage==2.7.1 for ploting pareto front
    • matplotlib==3.5.1
    • defaultlist==1.0.0
    • graphviz==0.20
    • anytree==2.8.0
    • pyverilog==1.3.0
    • mip==1.14.0
  • Step 3: Packages for Alveo U250 Board (Please notice that (2) and (3) in our environment are too old and abandoned on the current Xilinx website. Hence, please use our archive to have the same experiment environment.)


Artifact Evaluation

For artifact evaluation, if you come across any difficulty about the environment or experiments, or if you need us to provide a remote environment for you, please do not hesitate to contact Linfeng Du @ [email protected]. We will get back to you ASAP (most likely within 24 hours).

To reproduce the results shown in the FADO paper, to be specific, mainly the last two rows of Table 5 and the whole Table 6, we design the following three experiments. Plesae find below the:

  1. working directory and the corresponding data entry in Table 5 and/or Table 6
  2. command used in terminal
  3. the explanation about generated results and output log
  4. the uncertainty analysis: whether you can reproduce the same or very close results as shown in the paper -- same results will be reproduced for some experiments, while others could vary because of the uncertainty shown in the workflow figure below.
    • Uncertainty 1: the initial "AutoBridge Floorplanner" using MILP solver could give various initial solutions
    • Uncertainty 2: iterative calling "AutoBridge Floorplanner" could lead to more uncertainty in the resulting QoR output
    • Uncertainty 3: randomness during back-end placement and routing (P&R)
    • About runtime:
      • CPU performance difference
      • Operating system process scheduling
      • random convergence time of MILP solver
        • Notice: although we can set random seeds to keep the solver's performance stable, it could limit the optimality of results generated. Instead, we run experiments multiple times and reported the most common observation of latency and resource in our paper.
      • ...

Workflow


Experiment 1: Latency, Resource and DSE Runtime

  • Command:

    python main.py 2 9
    • "2" for AutoBridge Floorplanner
    • "9" for various choices of directive optimization (and iterative floorplan legalization)
  • Working directories:

    • ./benchmarks/.*/latency_fp_do:
      • Corresponding data entry in the paper
        • Table 5: "Initial FP -> Iterative DO" (the second line)
      • Uncertainty analysis: (factor: Uncertainty 1)
        • latency: almost always the same
        • resource: almost always the same
        • runtime: could vary
    • ./benchmarks/.*/latency_ab:
      • Corresponding data entry in the paper
        • Table 5: "Iterative (DO + AutoBridge FP)" (the third line)
      • Uncertainty analysis: (factors: Uncertainty 1, Uncertainty 2)
        • latency: almost always the same
        • resource: almost always the same
        • runtime: could vary, especially because of MILP solver's convergence randomness
    • ./benchmarks/.*/latency_fado:
      • Corresponding data entry in the paper
        • Table 5: "Original (no directive)" (the first line)
        • Table 5: "Iterative (DO + Incr FP) (Ours)" (the fourth line)
        • Table 6 (the whole table)
      • Uncertainty analysis: (factor: Uncertainty 1)
        • latency: almost always the same
        • resource: almost always the same
        • runtime: could vary
        • specially, the "mttkrp_cov" benchmark could have larger randomness because the final utilization is very close to the upper limit of available resource on the FPGA. Except for the most common results reported in our paper, other common results include:

          ======== DSE Stages (Table 6) MTTKRP*2+COV*2 ========
          Stage 0: Online
            Resource: 57.10%, Latency (thousand cycles): 160062.3
          Stage 1: Online+Offline
            Resource: 57.10%, Latency (thousand cycles): 160062.3
          Stage 2: Online+Offline+Ahead
            Resource: 63.45%, Latency (thousand cycles): 101763.6
          Stage 3: Online+Offline+Ahead+Back
            Resource: 64.39%, Latency (thousand cycles): 101755.4
          or

          ======== DSE Stages (Table 6) MTTKRP*2+COV*2 ========
          Stage 0: Online
            Resource: 63.15%, Latency (thousand cycles): 163241.1
          Stage 1: Online+Offline
            Resource: 64.67%, Latency (thousand cycles): 153927.2
          Stage 2: Online+Offline+Ahead
            Resource: 63.26%, Latency (thousand cycles): 129184.0
          Stage 3: Online+Offline+Ahead+Back
            Resource: 63.25%, Latency (thousand cycles): 128104.0

  • Output log:

    • in ./benchmarks/.*/output/latency_resource_runtime.log
    • Example log of test ./benchmark/cnn_2mm/latency_fado:

      Iterative (DO + Incr FP) (Our FADO) directive search result (Table 5):
      Runtime (s): 1.7685
      Latency (thousand cycles): 91.164
      Resource: 55%
      ============ DSE Stages (Table 6) ============
      Original (no directive):
        Resource: 28.27%, Latency (thousand cycles): 8933.0
      Stage 0: Online
        Resource: 28.27%, Latency (thousand cycles): 734.6
      Stage 1: Online+Offline
        Resource: 40.12%, Latency (thousand cycles): 131.8
      Stage 2: Online+Offline+Ahead
        Resource: 55.01%, Latency (thousand cycles): 91.4
      Stage 3: Online+Offline+Ahead+Back
        Resource: 54.56%, Latency (thousand cycles): 91.2

  • Explanation:

    • Experiment 1 is designed for you to get almost the same latency and resource, and proportional runtime for every test case, as reported in our paper.

Experiment 2: Frequency Test Only

  • Command:

    python main.py 3 4
    • "3" for exporting RTL design, and packing XO
    • "4" for running Vitis flow (v++)
  • Working directories:

    • ./benchmarks/.*/freq_fp_do:
      • Corresponding data entry in the paper
        • Table 5: "Initial FP -> Iterative DO" (the second line)
      • Uncertainty analysis: (factor: Uncertainty 3)
        • frequency: almost always the same
    • ./benchmarks/.*/freq_ab:
      • Corresponding data entry in the paper
        • Table 5: "Iterative (DO + AutoBridge FP)" (the third line)
      • Uncertainty analysis: (factors: Uncertainty 3)
        • frequency: almost always the same
    • ./benchmarks/.*/freq_fado:
      • Corresponding data entry in the paper
        • Table 5: "Iterative (DO + Incr FP) (Ours)" (the fourth line)
      • Uncertainty analysis: (factor: Uncertainty 3)
        • frequency: almost always the same
  • Output:

    • Please check the post-implementation Fmax using the script ./script/get_freq.py, e.g., starting from the currect base directory:
      cd ./benchmarks/cnn_2mm/freq_fado/
      python ../../../script/get_freq.py .
    • Example output in the terminal:

      Usage: python get_freq.py $(realpath [benchmark base]) Relative path: ./vitis_run/top_xilinx_u250_xdma_201830_2.temp/reports Full vitis report path: ./vitis_run/top_xilinx_u250_xdma_201830_2.temp/reports/link/imp Timing report found: ./vitis_run/top_xilinx_u250_xdma_201830_2.temp/reports/link/imp/> impl_1_xilinx_u250_xdma_201830_2_bb_locked_timing_summary_postroute_physopted.rpt

      Fmax: 274.10

  • Explanation:

    • Experiment 2 is designed for you to get almost the same frequency for every test case as reported in paper.

Experiment 3: Whole Flow of FADO

  • Command:

    python main.py 2 9
    python main.py 3 4
  • Working directories:

    • ./benchmarks/.*/all_ab:
      • Corresponding data entry in the paper
        • Table 5: "Iterative (DO + AutoBridge FP)" (the third line)
    • ./benchmarks/.*/freq_fado:
      • Corresponding data entry in the paper
        • Table 5: "Iterative (DO + Incr FP) (Ours)" (the fourth line)
  • Output:

    • Latency, Resource, and Runtime in ./benchmarks/.*/output/latency_resource_runtime.log.
    • Fmax using the script ./script/get_freq.py.
  • Explanation:

    • Experiment 3 is designed for you to test the functionality of FADO' whole workflow.
    • Since all uncertainties mentioned are included in this test, the QoR output could vary a little bit more than previous experiments.