v1.2.1 Update

v1.2.1 Added Zynq Ultrascale Plus Whole App examples Updated U50 XRT and shell to Xilinx-u50-gen3x4-xdma-2-202010.1-2902115 Updated docker launch instructions Updated TRD makefile instructions
Xilinx · Jul 30, 2020 · 0353caa · 0353caa
1 parent 6a699e3
commit 0353caa
Show file tree

Hide file tree

Showing 55 changed files with 4,122 additions and 348 deletions.
diff --git a/AI-Model-Zoo/README.md b/AI-Model-Zoo/README.md
@@ -638,10 +638,10 @@ The following table lists the performance number including end-to-end throughput
 | 11   | ssd_traffic_pruned_0_9         | cf_ssdtraffic_360_480_0.9_11.6G             | 247.5             | 570.84                                |
 | 12   | VPGnet_pruned_0_99             | cf_VPGnet_caltechlane_480_640_0.99_2.5G     | 275               | 658.99                                |
 | 13   | FPN                            | cf_fpn_cityscapes_256_512_8.9G              | 247.5             | 552.17                                |
-| 14   | SP_net                         | cf_SPnet_aichallenger_224_128_0.54G         | 275               | 1706.95                               |
-| 15   | Openpose_pruned_0_3            | cf_openpose_aichallenger_368_368_0.3_189.7G | 220               | 39.68                                 |
-| 16   | densebox_320_320               | cf_densebox_wider_320_320_0.49G             | 275               | 2572.69                               |
-| 17   | densebox_640_360               | cf_densebox_wider_360_640_1.11G             | 275               | 1125.09                               |
+| 14   | SP_net                         | cf_SPnet_aichallenger_224_128_0.54G         | 275               | 1706.95                              |
+| 15   | Openpose_pruned_0_3            | cf_openpose_aichallenger_368_368_0.3_189.7G | 220               | 39.68                                |
+| 16   | densebox_320_320               | cf_densebox_wider_320_320_0.49G             | 275               | 2572.69                              |
+| 17   | densebox_640_360               | cf_densebox_wider_360_640_1.11G             | 275               | 1125.09                              |
 | 18   | face_landmark                  | cf_landmark_celeba_96_72_0.14G              | 275               | 12917.20                              |
 | 19   | reid                           | cf_reid_market1501_160_80_0.95G             | 275               | 5548.10                               |
 | 20   | multi_task                     | cf_multitask_bdd_288_512_14.8G              | 247.5             | 176.96                                |

diff --git a/DPU-TRD/README.md b/DPU-TRD/README.md
@@ -56,6 +56,7 @@ There are three dimensions of parallelism in the DPU convolution architecture -
 
 [DPU TRD Vitis Flow ](./prj/Vitis/README.md)
 
+
 [DPU TRD Vivado Flow](./prj/Vivado/README.md)
 
 ****

diff --git a/DPU-TRD/prj/Vivado/README.md b/DPU-TRD/prj/Vivado/README.md
@@ -58,8 +58,8 @@ Required:
   Required:
   - Vivado 2020.1 [Vivado Design Tools](https://www.xilinx.com/support/download/index.html/content/xilinx/en/downloadNav/vivado-design-tools.html)
   - Serial terminal emulator e.g. [teraterm](http://logmett.com/tera-term-the-latest-version)
-  - [Vitis AI 1.1](https://github.com/Xilinx/Vitis-AI) to run models other than Resnet50, Optional 
-  - [Vitis AI Library 1.1](https://github.com/Xilinx/Vitis-AI/tree/master/Vitis-AI-Library) to configure DPU in Vitis AI Library
+  - [Vitis AI 1.2](https://github.com/Xilinx/Vitis-AI) to run models other than Resnet50, Optional
+  - [Vitis AI Library 1.2](https://github.com/Xilinx/Vitis-AI/tree/master/Vitis-AI-Library) to configure DPU in Vitis AI Library
 
 ------
 

diff --git a/DPU-TRD/prj/Vivado/constrs/misc.xdc b/DPU-TRD/prj/Vivado/constrs/misc.xdc
@@ -15,5 +15,6 @@
 # */
 
 
+
 # compress bitstream
 set_property BITSTREAM.GENERAL.COMPRESS TRUE [current_design]
diff --git a/README.md b/README.md
@@ -204,3 +204,5 @@ For more information, please refer to [Vitis AI User Guide](https://www.xilinx.c
 [Models]: https://www.xilinx.com/products/boards-and-kits/alveo/applications/xilinx-machine-learning-suite.html#gettingStartedCloud
 [whitepaper here]: https://www.xilinx.com/support/documentation/white_papers/wp504-accel-dnns.pdf
 [Performance Whitepaper]: https://www.xilinx.com/support/documentation/white_papers/wp504-accel-dnns.pdf
+
+   ```
diff --git a/VART/Whole-App-Acceleration/README.md b/VART/Whole-App-Acceleration/README.md
@@ -0,0 +1,126 @@
+# Whole Application Acceleration: Accelerating ML Preprocessing for Classification and Detection networks
+
+## Introduction
+
+This application demonstrates how Xilinx® [Vitis Vision library](https://github.com/Xilinx/Vitis_Libraries/tree/master/vision) functions can be integrated with deep neural network (DNN) accelerator to achieve complete application acceleration. This application focuses on accelerating the pre-processing involved in inference of object detection networks.
+
+## Background
+
+Input images are preprocessed  before being fed for inference of different deep neural networks. The pre-processing steps vary from network to network. For example, for classification networks like Resnet-50 the input image is resized to 224 x 224 size and then channel-wise mean subtraction is performed before feeding the data to the DNN accelerator. For detection networks like YOLO v3 the input image is resized to 256 x 512 size using letterbox before feeding the data to the DNN accelerator. 
+
+
+[Vitis Vision library](https://github.com/Xilinx/Vitis_Libraries/tree/master/vision) provides functions optimized for FPGA devices that are drop-in replacements for standard OpenCV library functions. This application demonstrates how Vitis Vision library functions can be used to accelerate pre-processing.
+
+## Resnet50
+
+Currently, applications accelerating pre-processing for classification networks (Resnet-50) is provided and  can only run on ZCU102 board (device part  xczu9eg-ffvb1156-2-e). In this application, software JPEG decoder is used for loading input image. Three processes are created one for image loading , one for running pre-processing kernel and one for running the ML accelerator. JPEG decoder transfer input image data to pre-processing kernel over queue and the pre-processed data is transferred to the ML accelerator over a queue. Below image shows the inference pipeline.
+
+
+<div align="center">
+  <img width="75%" height="75%" src="./doc_images/block_dia_classification.PNG">
+</div>
+
+## ADAS detection
+
+ADAS (Advanced Driver Assistance Systems) application
+using YOLO-v3 network model is an example for object detection.
+Accelerating pre-processing for YOLO-v3 is provided and can only run on ZCU102 board (device part xczu9eg-ffvb1156-2-e). In this application, software JPEG decoder is used for loading input image. Three processes are created one for image loading , one for running pre-processing kernel and one for running the ML accelerator. JPEG decoder transfer input image data to pre-processing kernel over queue and the pre-processed data is transferred to the ML accelerator over a queue. Below image shows the inference pipeline.
+
+<div align="center">
+  <img width="75%" height="75%" src="./doc_images/block_dia_adasdetection.PNG">
+</div>
+
+
+## Running the Application
+### Setting Up the Target
+**To improve the user experience, the Vitis AI Runtime packages have been built into the board image. Therefore, user does not need to install Vitis AI
+Runtime packages on the board separately.**
+
+1. Installing a Board Image.
+	* Download the SD card system image files from the following links:  
+
+		[ZCU102](https://www.xilinx.com/bin/public/openDownload?filename=xilinx-zcu102-dpu-v2020.1-v1.2.0.img.gz)  
+
+      	Note: The version of the board image should be 2020.1 or above.
+	* Use Etcher software to burn the image file onto the SD card.
+	* Insert the SD card with the image into the destination board.
+	* Plug in the power and boot the board using the serial port to operate on the system.
+	* Set up the IP information of the board using the serial port.
+	You can now operate on the board using SSH.
+
+2. Update the system image files.
+	* Download the [waa_system_v1.2.0.tar.gz](https://www.xilinx.com/bin/public/openDownload?filename=waa_system_v1.2.0.tar.gz).	
+	* Copy the `waa_system_v1.2.0.tar.gz` to the board using scp.
+	```
+	scp waa_system_v1.2.0.tar.gz root@IP_OF_BOARD:~/
+	```
+	* Update the system image files on the target side
+	```
+	cd ~
+	tar -xzvf waa_system_v1.2.0.tar.gz
+	cp waa_system_v1.2.0/sd_card/* /mnt/sd-mmcblk0p1/
+	cp /mnt/sd-mmcblk0p1/dpu.xclbin /usr/lib/
+	ln -s /usr/lib/dpu.xclbin /mnt/dpu.xclbin
+	cp waa_system_v1.2.0/lib/* /usr/lib/
+	reboot
+	```
+	**Note that `waa_system_v1.2.0.tar.gz` can only be used for ZCU102.**
+
+### Running The Examples
+Before running the examples on the target, please copy the examples and images to the target.
+
+1. Copy the examples to the board using scp.
+```
+scp -r Vitis-AI/VART/Whole-App-Acceleration root@IP_OF_BOARD:~/
+```
+2. Prepare the images for the test
+
+For resnet50_mt_py_waa example, download the images at http://image-net.org/download-images and copy 1000 images to `Vitis-AI/VART/Whole-App-Acceleration/resnet50_mt_py_waa/images` 
+
+For adas_detection_waa example, download the images at https://cocodataset.org/#download and copy the images to `Vitis-AI/VART/Whole-App-Acceleration/adas_detection_waa/data`
+
+3. Compile and run the program on the target
+
+For resnet50_mt_py_waa example, please refer to [resnet50_mt_py_waa readme](./resnet50_mt_py_waa/readme) 
+
+For adas_detection_waa example, please refer to [adas_detection_waa readme](./adas_detection_waa/readme) 
+
+### Performance:
+Below table shows the comparison of througput achieved by acclerating the pre-processing pipeline on FPGA. 
+For `Resnet-50`, the performance numbers are achieved by running 1K images randomly picked from ImageNet dataset.
+For `YOLO v3`, the performance numbers are achieved by running 5K images randomly picked from COCO dataset. 
+
+FPGA: ZCU102
+
+
+<table style="undefined;table-layout: fixed; width: 534px">
+<colgroup>
+<col style="width: 119px">
+<col style="width: 136px">
+<col style="width: 145px">
+<col style="width: 134px">
+</colgroup>
+  <tr>
+    <th rowspan="2">Network</th>
+    <th colspan="2">E2E Throughput (fps)</th>
+    <th rowspan="2"><span style="font-weight:bold">Percentage improvement in throughput</span></th>
+  </tr>
+  <tr>
+    <td>with software Pre-processing</td>
+    <td>with hardware Pre-processing</td>
+  </tr>
+
+  <tr>
+    <td>Resnet-50</td>
+    <td>52.60</td>
+    <td>62.94</td>
+    <td>19.66%</td>
+  </tr>
+
+  <tr>
+   <td>YOLO v3</td>
+    <td>7.6</td>
+    <td>14.9</td>
+        <td>96.05%</td>
+  </tr>
+</table>
diff --git a/VART/Whole-App-Acceleration/adas_detection_waa/build.sh b/VART/Whole-App-Acceleration/adas_detection_waa/build.sh
@@ -0,0 +1,37 @@
+#
+# Copyright 2019 Xilinx Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+CXX=${CXX:-g++}
+name=$(basename $PWD)
+$CXX -O2 -w\
+  -fno-inline \
+  -I. \
+  -o $name \
+  -std=c++17 \
+  src/main.cc \
+  src/common.cpp \
+  src/xcl2.cpp \
+  -lvart-runner \
+  -lopencv_videoio \
+  -lopencv_imgcodecs \
+  -lopencv_highgui \
+  -lopencv_imgproc \
+  -lopencv_core \
+  -lpthread \
+  -lxilinxopencl \
+  -lglog \
+  -lunilog \
+  -lxir 
diff --git a/VART/Whole-App-Acceleration/adas_detection_waa/readme b/VART/Whole-App-Acceleration/adas_detection_waa/readme
@@ -0,0 +1,37 @@
+/*
+ * Copyright 2019 Xilinx Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+  1. build & run adas_detection_waa
+./build.sh
+export XILINX_XRT=/usr 
+mkdir output #Will be written to the picture after processing
+
+# ./adas_detection yolov3_adas_pruned_0_9.elf \
+#   0(Use software preprocessing, 1-use hardware preprocessing) 
+# e.g.
+
+  sample : ./adas_detection_waa yolov3_adas_pruned_0_9.elf 0
+  output :
+Performance:7.6 FPS
+
+  sample : ./adas_detection_waa yolov3_adas_pruned_0_9.elf 1
+  output :
+Found Platform
+Platform Name: Xilinx
+INFO: Reading /usr/lib/dpu.xclbin
+Loading: '/usr/lib/dpu.xclbin'
+Performance:14.9 FPS
+
diff --git a/VART/Whole-App-Acceleration/adas_detection_waa/src/common.cpp b/VART/Whole-App-Acceleration/adas_detection_waa/src/common.cpp
@@ -0,0 +1,91 @@
+
+/*
+ * Copyright 2019 Xilinx Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "common.h"
+
+#include <cassert>
+#include <numeric>
+int getTensorShape(vart::Runner* runner, GraphInfo* shapes, int cntin,
+                   int cntout) {
+  auto outputTensors = runner->get_output_tensors();
+  auto inputTensors = runner->get_input_tensors();
+  if (shapes->output_mapping.empty()) {
+    shapes->output_mapping.resize((unsigned)cntout);
+    std::iota(shapes->output_mapping.begin(), shapes->output_mapping.end(), 0);
+  }
+  for (int i = 0; i < cntin; i++) {
+    auto dim_num = inputTensors[i]->get_dim_num();
+    if (dim_num == 4) {
+      shapes->inTensorList[i].channel = inputTensors[i]->get_dim_size(3);
+      shapes->inTensorList[i].width = inputTensors[i]->get_dim_size(2);
+      shapes->inTensorList[i].height = inputTensors[i]->get_dim_size(1);
+      shapes->inTensorList[i].size =
+          inputTensors[i]->get_element_num() / inputTensors[0]->get_dim_size(0);
+    } else if (dim_num == 2) {
+      shapes->inTensorList[i].channel = inputTensors[i]->get_dim_size(1);
+      shapes->inTensorList[i].width = 1;
+      shapes->inTensorList[i].height = 1;
+      shapes->inTensorList[i].size =
+          inputTensors[i]->get_element_num() / inputTensors[0]->get_dim_size(0);
+    }
+  }
+  for (int i = 0; i < cntout; i++) {
+    auto dim_num = outputTensors[shapes->output_mapping[i]]->get_dim_num();
+    if (dim_num == 4) {
+      shapes->outTensorList[i].channel =
+          outputTensors[shapes->output_mapping[i]]->get_dim_size(3);
+      shapes->outTensorList[i].width =
+          outputTensors[shapes->output_mapping[i]]->get_dim_size(2);
+      shapes->outTensorList[i].height =
+          outputTensors[shapes->output_mapping[i]]->get_dim_size(1);
+      shapes->outTensorList[i].size =
+          outputTensors[shapes->output_mapping[i]]->get_element_num() /
+          outputTensors[shapes->output_mapping[0]]->get_dim_size(0);
+    } else if (dim_num == 2) {
+      shapes->outTensorList[i].channel =
+          outputTensors[shapes->output_mapping[i]]->get_dim_size(1);
+      shapes->outTensorList[i].width = 1;
+      shapes->outTensorList[i].height = 1;
+      shapes->outTensorList[i].size =
+          outputTensors[shapes->output_mapping[i]]->get_element_num() /
+          outputTensors[shapes->output_mapping[0]]->get_dim_size(0);
+    }
+  }
+  return 0;
+}
+
+static int find_tensor(std::vector<const xir::Tensor*> tensors,
+                       const std::string& name) {
+  int ret = -1;
+  for (auto i = 0u; i < tensors.size(); ++i) {
+    if (tensors[i]->get_name().find(name) != std::string::npos) {
+      ret = (int)i;
+      break;
+    }
+  }
+  assert(ret != -1);
+  return ret;
+}
+int getTensorShape(vart::Runner* runner, GraphInfo* shapes, int cntin,
+                   std::vector<std::string> output_names) {
+  for (auto i = 0u; i < output_names.size(); ++i) {
+    auto idx = find_tensor(runner->get_output_tensors(), output_names[i]);
+    shapes->output_mapping.push_back(idx);
+  }
+  getTensorShape(runner, shapes, cntin, (int)output_names.size());
+  return 0;
+}
Original file line number	Diff line number	Diff line change
Expand Up		@@ -56,6 +56,7 @@ There are three dimensions of parallelism in the DPU convolution architecture -

		[DPU TRD Vitis Flow ](./prj/Vitis/README.md)


		[DPU TRD Vivado Flow](./prj/Vivado/README.md)

		****
Expand Down
Original file line number	Diff line number	Diff line change
Expand Up		@@ -15,5 +15,6 @@
		# */



		# compress bitstream
		set_property BITSTREAM.GENERAL.COMPRESS TRUE [current_design]