Skip to content

Commit

Permalink
v1.2.1 Update
Browse files Browse the repository at this point in the history
v1.2.1
Added Zynq Ultrascale Plus Whole App examples
Updated U50 XRT and shell to Xilinx-u50-gen3x4-xdma-2-202010.1-2902115
Updated docker launch instructions
Updated TRD makefile instructions
  • Loading branch information
kamranjk authored Jul 30, 2020
1 parent 6a699e3 commit 0353caa
Show file tree
Hide file tree
Showing 55 changed files with 4,122 additions and 348 deletions.
8 changes: 4 additions & 4 deletions AI-Model-Zoo/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -638,10 +638,10 @@ The following table lists the performance number including end-to-end throughput
| 11 | ssd_traffic_pruned_0_9 | cf_ssdtraffic_360_480_0.9_11.6G | 247.5 | 570.84 |
| 12 | VPGnet_pruned_0_99 | cf_VPGnet_caltechlane_480_640_0.99_2.5G | 275 | 658.99 |
| 13 | FPN | cf_fpn_cityscapes_256_512_8.9G | 247.5 | 552.17 |
| 14 | SP_net | cf_SPnet_aichallenger_224_128_0.54G | 275 | 1706.95 |
| 15 | Openpose_pruned_0_3 | cf_openpose_aichallenger_368_368_0.3_189.7G | 220 | 39.68 |
| 16 | densebox_320_320 | cf_densebox_wider_320_320_0.49G | 275 | 2572.69 |
| 17 | densebox_640_360 | cf_densebox_wider_360_640_1.11G | 275 | 1125.09 |
| 14 | SP_net | cf_SPnet_aichallenger_224_128_0.54G | 275 | 1706.95 |
| 15 | Openpose_pruned_0_3 | cf_openpose_aichallenger_368_368_0.3_189.7G | 220 | 39.68 |
| 16 | densebox_320_320 | cf_densebox_wider_320_320_0.49G | 275 | 2572.69 |
| 17 | densebox_640_360 | cf_densebox_wider_360_640_1.11G | 275 | 1125.09 |
| 18 | face_landmark | cf_landmark_celeba_96_72_0.14G | 275 | 12917.20 |
| 19 | reid | cf_reid_market1501_160_80_0.95G | 275 | 5548.10 |
| 20 | multi_task | cf_multitask_bdd_288_512_14.8G | 247.5 | 176.96 |
Expand Down
1 change: 1 addition & 0 deletions DPU-TRD/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ There are three dimensions of parallelism in the DPU convolution architecture -

[DPU TRD Vitis Flow ](./prj/Vitis/README.md)


[DPU TRD Vivado Flow](./prj/Vivado/README.md)

****
Expand Down
4 changes: 2 additions & 2 deletions DPU-TRD/prj/Vivado/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,8 +58,8 @@ Required:
Required:
- Vivado 2020.1 [Vivado Design Tools](https://www.xilinx.com/support/download/index.html/content/xilinx/en/downloadNav/vivado-design-tools.html)
- Serial terminal emulator e.g. [teraterm](http://logmett.com/tera-term-the-latest-version)
- [Vitis AI 1.1](https://github.com/Xilinx/Vitis-AI) to run models other than Resnet50, Optional
- [Vitis AI Library 1.1](https://github.com/Xilinx/Vitis-AI/tree/master/Vitis-AI-Library) to configure DPU in Vitis AI Library
- [Vitis AI 1.2](https://github.com/Xilinx/Vitis-AI) to run models other than Resnet50, Optional
- [Vitis AI Library 1.2](https://github.com/Xilinx/Vitis-AI/tree/master/Vitis-AI-Library) to configure DPU in Vitis AI Library

------

Expand Down
1 change: 1 addition & 0 deletions DPU-TRD/prj/Vivado/constrs/misc.xdc
Original file line number Diff line number Diff line change
Expand Up @@ -15,5 +15,6 @@
# */



# compress bitstream
set_property BITSTREAM.GENERAL.COMPRESS TRUE [current_design]
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -204,3 +204,5 @@ For more information, please refer to [Vitis AI User Guide](https://www.xilinx.c
[Models]: https://www.xilinx.com/products/boards-and-kits/alveo/applications/xilinx-machine-learning-suite.html#gettingStartedCloud
[whitepaper here]: https://www.xilinx.com/support/documentation/white_papers/wp504-accel-dnns.pdf
[Performance Whitepaper]: https://www.xilinx.com/support/documentation/white_papers/wp504-accel-dnns.pdf
```
126 changes: 126 additions & 0 deletions VART/Whole-App-Acceleration/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
# Whole Application Acceleration: Accelerating ML Preprocessing for Classification and Detection networks

## Introduction

This application demonstrates how Xilinx® [Vitis Vision library](https://github.com/Xilinx/Vitis_Libraries/tree/master/vision) functions can be integrated with deep neural network (DNN) accelerator to achieve complete application acceleration. This application focuses on accelerating the pre-processing involved in inference of object detection networks.

## Background

Input images are preprocessed before being fed for inference of different deep neural networks. The pre-processing steps vary from network to network. For example, for classification networks like Resnet-50 the input image is resized to 224 x 224 size and then channel-wise mean subtraction is performed before feeding the data to the DNN accelerator. For detection networks like YOLO v3 the input image is resized to 256 x 512 size using letterbox before feeding the data to the DNN accelerator.


[Vitis Vision library](https://github.com/Xilinx/Vitis_Libraries/tree/master/vision) provides functions optimized for FPGA devices that are drop-in replacements for standard OpenCV library functions. This application demonstrates how Vitis Vision library functions can be used to accelerate pre-processing.

## Resnet50

Currently, applications accelerating pre-processing for classification networks (Resnet-50) is provided and can only run on ZCU102 board (device part xczu9eg-ffvb1156-2-e). In this application, software JPEG decoder is used for loading input image. Three processes are created one for image loading , one for running pre-processing kernel and one for running the ML accelerator. JPEG decoder transfer input image data to pre-processing kernel over queue and the pre-processed data is transferred to the ML accelerator over a queue. Below image shows the inference pipeline.


<div align="center">
<img width="75%" height="75%" src="./doc_images/block_dia_classification.PNG">
</div>

## ADAS detection

ADAS (Advanced Driver Assistance Systems) application
using YOLO-v3 network model is an example for object detection.
Accelerating pre-processing for YOLO-v3 is provided and can only run on ZCU102 board (device part xczu9eg-ffvb1156-2-e). In this application, software JPEG decoder is used for loading input image. Three processes are created one for image loading , one for running pre-processing kernel and one for running the ML accelerator. JPEG decoder transfer input image data to pre-processing kernel over queue and the pre-processed data is transferred to the ML accelerator over a queue. Below image shows the inference pipeline.

<div align="center">
<img width="75%" height="75%" src="./doc_images/block_dia_adasdetection.PNG">
</div>


## Running the Application
### Setting Up the Target
**To improve the user experience, the Vitis AI Runtime packages have been built into the board image. Therefore, user does not need to install Vitis AI
Runtime packages on the board separately.**

1. Installing a Board Image.
* Download the SD card system image files from the following links:

[ZCU102](https://www.xilinx.com/bin/public/openDownload?filename=xilinx-zcu102-dpu-v2020.1-v1.2.0.img.gz)

Note: The version of the board image should be 2020.1 or above.
* Use Etcher software to burn the image file onto the SD card.
* Insert the SD card with the image into the destination board.
* Plug in the power and boot the board using the serial port to operate on the system.
* Set up the IP information of the board using the serial port.
You can now operate on the board using SSH.

2. Update the system image files.
* Download the [waa_system_v1.2.0.tar.gz](https://www.xilinx.com/bin/public/openDownload?filename=waa_system_v1.2.0.tar.gz).
* Copy the `waa_system_v1.2.0.tar.gz` to the board using scp.
```
scp waa_system_v1.2.0.tar.gz root@IP_OF_BOARD:~/
```
* Update the system image files on the target side
```
cd ~
tar -xzvf waa_system_v1.2.0.tar.gz
cp waa_system_v1.2.0/sd_card/* /mnt/sd-mmcblk0p1/
cp /mnt/sd-mmcblk0p1/dpu.xclbin /usr/lib/
ln -s /usr/lib/dpu.xclbin /mnt/dpu.xclbin
cp waa_system_v1.2.0/lib/* /usr/lib/
reboot
```
**Note that `waa_system_v1.2.0.tar.gz` can only be used for ZCU102.**

### Running The Examples
Before running the examples on the target, please copy the examples and images to the target.

1. Copy the examples to the board using scp.
```
scp -r Vitis-AI/VART/Whole-App-Acceleration root@IP_OF_BOARD:~/
```
2. Prepare the images for the test

For resnet50_mt_py_waa example, download the images at http://image-net.org/download-images and copy 1000 images to `Vitis-AI/VART/Whole-App-Acceleration/resnet50_mt_py_waa/images`

For adas_detection_waa example, download the images at https://cocodataset.org/#download and copy the images to `Vitis-AI/VART/Whole-App-Acceleration/adas_detection_waa/data`

3. Compile and run the program on the target

For resnet50_mt_py_waa example, please refer to [resnet50_mt_py_waa readme](./resnet50_mt_py_waa/readme)

For adas_detection_waa example, please refer to [adas_detection_waa readme](./adas_detection_waa/readme)

### Performance:
Below table shows the comparison of througput achieved by acclerating the pre-processing pipeline on FPGA.
For `Resnet-50`, the performance numbers are achieved by running 1K images randomly picked from ImageNet dataset.
For `YOLO v3`, the performance numbers are achieved by running 5K images randomly picked from COCO dataset.

FPGA: ZCU102


<table style="undefined;table-layout: fixed; width: 534px">
<colgroup>
<col style="width: 119px">
<col style="width: 136px">
<col style="width: 145px">
<col style="width: 134px">
</colgroup>
<tr>
<th rowspan="2">Network</th>
<th colspan="2">E2E Throughput (fps)</th>
<th rowspan="2"><span style="font-weight:bold">Percentage improvement in throughput</span></th>
</tr>
<tr>
<td>with software Pre-processing</td>
<td>with hardware Pre-processing</td>
</tr>

<tr>
<td>Resnet-50</td>
<td>52.60</td>
<td>62.94</td>
<td>19.66%</td>
</tr>

<tr>
<td>YOLO v3</td>
<td>7.6</td>
<td>14.9</td>
<td>96.05%</td>
</tr>
</table>
37 changes: 37 additions & 0 deletions VART/Whole-App-Acceleration/adas_detection_waa/build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
#
# Copyright 2019 Xilinx Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

CXX=${CXX:-g++}
name=$(basename $PWD)
$CXX -O2 -w\
-fno-inline \
-I. \
-o $name \
-std=c++17 \
src/main.cc \
src/common.cpp \
src/xcl2.cpp \
-lvart-runner \
-lopencv_videoio \
-lopencv_imgcodecs \
-lopencv_highgui \
-lopencv_imgproc \
-lopencv_core \
-lpthread \
-lxilinxopencl \
-lglog \
-lunilog \
-lxir
37 changes: 37 additions & 0 deletions VART/Whole-App-Acceleration/adas_detection_waa/readme
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
/*
* Copyright 2019 Xilinx Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

1. build & run adas_detection_waa
./build.sh
export XILINX_XRT=/usr
mkdir output #Will be written to the picture after processing

# ./adas_detection yolov3_adas_pruned_0_9.elf \
# 0(Use software preprocessing, 1-use hardware preprocessing)
# e.g.

sample : ./adas_detection_waa yolov3_adas_pruned_0_9.elf 0
output :
Performance:7.6 FPS

sample : ./adas_detection_waa yolov3_adas_pruned_0_9.elf 1
output :
Found Platform
Platform Name: Xilinx
INFO: Reading /usr/lib/dpu.xclbin
Loading: '/usr/lib/dpu.xclbin'
Performance:14.9 FPS

91 changes: 91 additions & 0 deletions VART/Whole-App-Acceleration/adas_detection_waa/src/common.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@

/*
* Copyright 2019 Xilinx Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include "common.h"

#include <cassert>
#include <numeric>
int getTensorShape(vart::Runner* runner, GraphInfo* shapes, int cntin,
int cntout) {
auto outputTensors = runner->get_output_tensors();
auto inputTensors = runner->get_input_tensors();
if (shapes->output_mapping.empty()) {
shapes->output_mapping.resize((unsigned)cntout);
std::iota(shapes->output_mapping.begin(), shapes->output_mapping.end(), 0);
}
for (int i = 0; i < cntin; i++) {
auto dim_num = inputTensors[i]->get_dim_num();
if (dim_num == 4) {
shapes->inTensorList[i].channel = inputTensors[i]->get_dim_size(3);
shapes->inTensorList[i].width = inputTensors[i]->get_dim_size(2);
shapes->inTensorList[i].height = inputTensors[i]->get_dim_size(1);
shapes->inTensorList[i].size =
inputTensors[i]->get_element_num() / inputTensors[0]->get_dim_size(0);
} else if (dim_num == 2) {
shapes->inTensorList[i].channel = inputTensors[i]->get_dim_size(1);
shapes->inTensorList[i].width = 1;
shapes->inTensorList[i].height = 1;
shapes->inTensorList[i].size =
inputTensors[i]->get_element_num() / inputTensors[0]->get_dim_size(0);
}
}
for (int i = 0; i < cntout; i++) {
auto dim_num = outputTensors[shapes->output_mapping[i]]->get_dim_num();
if (dim_num == 4) {
shapes->outTensorList[i].channel =
outputTensors[shapes->output_mapping[i]]->get_dim_size(3);
shapes->outTensorList[i].width =
outputTensors[shapes->output_mapping[i]]->get_dim_size(2);
shapes->outTensorList[i].height =
outputTensors[shapes->output_mapping[i]]->get_dim_size(1);
shapes->outTensorList[i].size =
outputTensors[shapes->output_mapping[i]]->get_element_num() /
outputTensors[shapes->output_mapping[0]]->get_dim_size(0);
} else if (dim_num == 2) {
shapes->outTensorList[i].channel =
outputTensors[shapes->output_mapping[i]]->get_dim_size(1);
shapes->outTensorList[i].width = 1;
shapes->outTensorList[i].height = 1;
shapes->outTensorList[i].size =
outputTensors[shapes->output_mapping[i]]->get_element_num() /
outputTensors[shapes->output_mapping[0]]->get_dim_size(0);
}
}
return 0;
}

static int find_tensor(std::vector<const xir::Tensor*> tensors,
const std::string& name) {
int ret = -1;
for (auto i = 0u; i < tensors.size(); ++i) {
if (tensors[i]->get_name().find(name) != std::string::npos) {
ret = (int)i;
break;
}
}
assert(ret != -1);
return ret;
}
int getTensorShape(vart::Runner* runner, GraphInfo* shapes, int cntin,
std::vector<std::string> output_names) {
for (auto i = 0u; i < output_names.size(); ++i) {
auto idx = find_tensor(runner->get_output_tensors(), output_names[i]);
shapes->output_mapping.push_back(idx);
}
getTensorShape(runner, shapes, cntin, (int)output_names.size());
return 0;
}
Loading

0 comments on commit 0353caa

Please sign in to comment.