English | 中文
Each model in the model repository must contain a configuration that provides required and optional information about the model. The configuration information is generally written in ModelConfig protobuf format in file config.pbtxt.
Please see the official website for detailed general configuration: model_configuration. Minimum model configuration of Triton must include: attribute platform or backend, attribute max_batch_size and input and output of the model.
For example, the minimum configuration of a Paddle model should be (with two inputs input0 and input1, one output output0, both inputs and outputs are tensors of type float32, and the maximum batch is 8):
backend: "fastdeploy"
max_batch_size: 8
input [
{
name: "input0"
data_type: TYPE_FP32
dims: [ 16 ]
},
{
name: "input1"
data_type: TYPE_FP32
dims: [ 16 ]
}
]
output [
{
name: "output0"
data_type: TYPE_FP32
dims: [ 16 ]
}
]
The attribute instance_group allows you to configure hardware resource and model inference instances number.
Here's an example of CPU deployment:
instance_group [
{
# Create two CPU instances
count: 2
# Use CPU for deployment
kind: KIND_CPU
}
]
Another example of deploying two instances on GPU 0, and one instance each on GPU1 and GPU:
instance_group [
{
# Create tow GPU instances
count: 2
# Use GPU for inference
kind: KIND_GPU
# Deploy on GPU 0
gpus: [ 0 ]
},
{
count: 1
kind: KIND_GPU
# Deploy on GPU 1,2
gpus: [ 1, 2 ]
}
]
The attribute name is optional. If the model is not specified in the configuration, then the name is the model's directory name. When the name is specified, it should match the directory name.
Set fastdeploy backend. You should not configure attribute platform, but please instead configure attribute backend to fastdeploy.
backend: "fastdeploy"
Currently FastDeploy backend supports cpu and gpu inference, with paddle, onnxruntime and openvino inference engines supported on cpu, and paddle, onnxruntime and tensorrt engines supported on gpu.
In addition to configuring Instance Groups, deciding whether the model runs on CPU or GPU, the Paddle engine can be configured as follows. You can see more specific examples in A PP-OCRv3 example for Runtime configuration.
optimization {
execution_accelerators {
# CPU inference configuration, used with KIND_CPU.
cpu_execution_accelerator : [
{
name : "paddle"
# Set parallel inference computing threads number to 4.
parameters { key: "cpu_threads" value: "4" }
# Set mkldnn acceleration on, or off when set to 0.
parameters { key: "use_mkldnn" value: "1" }
}
],
# GPU inference configuration, used with KIND_GPU.
gpu_execution_accelerator : [
{
name : "paddle"
# Set parallel inference computing threads number to 4.
parameters { key: "cpu_threads" value: "4" }
# Set mkldnn acceleration on, or off when set to 0.
parameters { key: "use_mkldnn" value: "1" }
}
]
}
}
In addition to configuring Instance Groups, deciding whether the model runs on CPU or GPU, the ONNXRuntime engine can be configured as follows. You can see more specific examples in A YOLOv5 example for Runtime configuration.
optimization {
execution_accelerators {
cpu_execution_accelerator : [
{
name : "onnxruntime"
# Set parallel inference computing threads number to 4.
parameters { key: "cpu_threads" value: "4" }
}
],
gpu_execution_accelerator : [
{
name : "onnxruntime"
}
]
}
}
The OpenVINO engine only supports inferring on CPU, which can be configured as:
optimization {
execution_accelerators {
cpu_execution_accelerator : [
{
name : "openvino"
# Set parallel inference computing threads number to 4 (total number of threads for all instances).
parameters { key: "cpu_threads" value: "4" }
# Set num_streams in OpenVINO (usually the same as instances number)
parameters { key: "num_streams" value: "1" }
}
]
}
}
The TensorRT engine only supports inferring on GPU, which can be configured as:
optimization {
execution_accelerators {
gpu_execution_accelerator : [
{
name : "tensorrt"
# Use FP16 inference in TensorRT. You can also choose: trt_fp32
# If the loaded model is a quantized model, this precision will be int8 automatically
parameters { key: "precision" value: "trt_fp16" }
}
]
}
}
You can configure the TensorRT dynamic shape in the following format, and refer to A PaddleCls example for Runtime configuration:
optimization {
execution_accelerators {
gpu_execution_accelerator : [ {
# use TRT engine
name: "tensorrt",
# use fp16 on TRT engine
parameters { key: "precision" value: "trt_fp16" }
},
{
# Configure the minimum shape of dynamic shape
name: "min_shape"
# All input name and minimum shape
parameters { key: "input1" value: "1 3 224 224" }
parameters { key: "input2" value: "1 10" }
},
{
# Configure the optimal shape of dynamic shape
name: "opt_shape"
# All input name and optimal shape
parameters { key: "input1" value: "2 3 224 224" }
parameters { key: "input2" value: "2 20" }
},
{
# Configure the maximum shape of dynamic shape
name: "max_shape"
# All input name and maximum shape
parameters { key: "input1" value: "8 3 224 224" }
parameters { key: "input2" value: "8 30" }
}
]
}}