Triton TensorRT-LLM

Deploy an AI Coding Assistant with NVIDIA TensorRT-LLM and NVIDIA Triton | NVIDIA Technical Blog

模型格式先转为FasterTransformer；再用TensorRT-LLM将其compile为TensorRT格式；然后可用TensorRT-LLM来跑推理（或者模型放到Triton Repo上，并指定TensorRT-LLM为backend）

Input的Tokenizing和Output的De-Tokenizing，视作前处理、后处理，创建"Python Model"；整个流程用一个"Ensemble Model"来表示，包含以上两个"Model"以及真正的GPT-Model;

https://github.com/triton-inference-server/tutorials/blob/main/Conceptual_Guide/Part_1-model_deployment/README.md

1. 想用onnx-runtime来做推理backend；因此先要将模型转换为onnx格式；

2. model repo: 新建一个目录（本地目录、远程目录、Azure Blob都可）；存放所有模型的名称(text_detection、text_recognition)、版本（1、2）、配置文件（config.pbtxt)、模型文件(model.onnx)。例如：

model_repository/
├── text_detection
│   ├── 1
│   │   └── model.onnx
│   ├── 2
│   │   └── model.onnx
│   └── config.pbtxt
└── text_recognition
    ├── 1
    │   └── model.onnx
    └── config.pbtxt

3. config.pbtxt格式

name: "text_detection"
backend: "onnxruntime"
max_batch_size : 256
input [
  {
    name: "input_images:0"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1, 3 ]
  }
]
output [
  {
    name: "feature_fusion/Conv_7/Sigmoid:0"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1, 1 ]
  }
]
output [
  {
    name: "feature_fusion/concat_3:0"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1, 5 ]
  }
]

backend、max_batch_size要写; input、output应该可以由triton从模型文件里自动获取，也可不写；

4. 拉取和启动nvcr.io的triton server镜像：

docker run --gpus=all -it --shm-size=256m --rm -p8000:8000 -p8001:8001 -p8002:8002 -v $(pwd)/model_repository:/models nvcr.io/nvidia/tritonserver:<yy.mm>-py3

5. 启动triton server

tritonserver --model-repository=/models

启动成功后，显示如下信息：（哪几个模型READY了；版本、内存、显存等信息；2个推理用的端口和1个状态查询端口）

I0712 16:37:18.246487 128 server.cc:626]
+------------------+---------+--------+
| Model            | Version | Status |
+------------------+---------+--------+
| text_detection   | 1       | READY  |
| text_recognition | 1       | READY  |
+------------------+---------+--------+

I0712 16:37:18.267625 128 metrics.cc:650] Collecting metrics for GPU 0: NVIDIA GeForce RTX 3090
I0712 16:37:18.268041 128 tritonserver.cc:2159]
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                        |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                       |
| server_version                   | 2.23.0                                                                                                                                                                                       |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
| model_repository_path[0]         | /models                                                                                                                                                                                      |
| model_control_mode               | MODE_NONE                                                                                                                                                                                    |
| strict_model_config              | 1                                                                                                                                                                                            |
| rate_limit                       | OFF                                                                                                                                                                                          |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                    |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                     |
| response_cache_byte_size         | 0                                                                                                                                                                                            |
| min_supported_compute_capability | 6.0                                                                                                                                                                                          |
| strict_readiness                 | 1                                                                                                                                                                                            |
| exit_timeout                     | 30                                                                                                                                                                                           |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0712 16:37:18.269464 128 grpc_server.cc:4587] Started GRPCInferenceService at 0.0.0.0:8001
I0712 16:37:18.269956 128 http_server.cc:3303] Started HTTPService at 0.0.0.0:8000
I0712 16:37:18.311686 128 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002

6. 可以使用裸curl来发推理请求，也可使用封装的对象来发；

例如使用triton自带的python包tritionclient里的httpclient类（先要pip install tritionclient):

import tritonclient.http as httpclient

client = httpclient.InferenceServerClient(url="localhost:8000")

raw_image = cv2.imread("./img2.jpg")
preprocessed_image = detection_preprocessing(raw_image)

detection_input = httpclient.InferInput("input_images:0", preprocessed_image.shape, datatype="FP32")
detection_input.set_data_from_numpy(preprocessed_image, binary_data=True)

detection_response = client.infer(model_name="text_detection", inputs=[detection_input])

scores = detection_response.as_numpy('feature_fusion/Conv_7/Sigmoid:0')
geometry = detection_response.as_numpy('feature_fusion/concat_3:0')
cropped_images = detection_postprocessing(scores, geometry, preprocessed_image)

7. 再将第一个模型输出的cropped_images作为第二个模型的输入；

# Create input object for recognition model
recognition_input = httpclient.InferInput("input.1", cropped_images.shape, datatype="FP32")
recognition_input.set_data_from_numpy(cropped_images, binary_data=True)

# Query the server
recognition_response = client.infer(model_name="text_recognition", inputs=[recognition_input])

# Process response from recognition model
text = recognition_postprocessing(recognition_response.as_numpy('308'))

print(text)

https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_2-improving_resource_utilization

仅修改config.pbtxt，即可enable拼batch和多instance功能；

1. 拼batch

小batch拼成大batch，在throughput和latency上，都可能提升；

可以指定最多等多久就执行已有的；

dynamic_batching {
    max_queue_delay_microseconds: 100
}

2. 多model instance

在0和1号GPU上启动，每个GPU启动2个实例；

instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [ 0, 1 ]
  }
]

https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_3-optimizing_triton_configuration

Model Analyzer

其实就是profiler；

用户给出变量，该工具网格搜索遍历每个配置，在Triton上“试跑”；

跑出的结果，画成图或表格；供用户去分析并选取符合他产品需求throughput、latency、硬件资源的最优配置。

主要试跑参数：batching等待延迟上限；model instance数目；

GPU上放4个Model instance时，Latency和Throughput达到最优：

PerfAnalyzer:
perf_analyzer -m densenet_onnx --concurrency-range 1:4
...
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 265.147 infer/sec, latency 3769 usec
Concurrency: 2, throughput: 890.793 infer/sec, latency 2243 usec
Concurrency: 3, throughput: 937.036 infer/sec, latency 3199 usec
Concurrency: 4, throughput: 965.21 infer/sec, latency 4142 usec
PerfAnalyzer和Model Analyzer的区别：

PerfAnalyzer是在模型已经在Triton上部署好了之后，通过尝试不同的并发请求数目，来查看吞吐量、延迟的；

Model Analyzer是为了找到一个符合用户延迟、吞吐量、硬件资源要求下的最优配置（主要是Dynamic batching延迟、model instance per GPU参数）

tutorials/Conceptual_Guide/Part_4-inference_acceleration at main · triton-inference-server/tutorials · GitHub

Torch TensorRT VS. TensorRT(胜出）；

TensorRT使用ONNX格式的模型做输入；

https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_4-inference_acceleration

TensorRT的招数：

1. fusing layers（计算图layer融合）

2. quantization (INT8)

1. 性能最优的方式：model --> ONNX格式 --> TensorRT
trtexec --onnx=model.onnx \
        --saveEngine=model.plan \
        --explicitBatch
把model.plan放到Triton的Model repository目录下；在Triton的配置文件config.pbtxt里，指定backend为tensorrt;

如果自己的model里用到的某些operator，TensorRT当前不支持，则可选以下方法：

2. 使用Torch-TensorRT、Tensorflow-TensorRT，这种框架紧密集成的TensorRT功能；

3. 使用ONNX Runtime，集成了TensorRT功能；

4. TensorRT-plugin：自己实现operator代码；

2. Torch-TensorRT:
把pytorch模型compile并保存到文件后，直接在Triton的Model Repository里放上该model文件即可，在config.pbtxt里写上文件名和"platform":"pytorch_libtorch"
# Compile with Torch TensorRT;
trt_model = torch_tensorrt.compile(model,
    inputs= [torch_tensorrt.Input((1, 3, 224, 224))],
    enabled_precisions= { torch.half} # Run with FP32
)
# Save the model
torch.jit.save(trt_model, "model.pt")

3. ONNX Runtime

GPU后端支持： TensorRT(快)；CUDA(慢)

CPU后端支持：OpenVINO

编译时，TensorRT支持的operators会使用TensorRT，不支持的就用CUDA;

浅层模型：Triton提供Forest Inference Library这个backend；

LLM: Triton提供Faster Transformer Backend（已挂，不维护不更新了）；TensorRT-LLM是主力

torch2trt库：

不需要先转为ONNX格式；直接从pytorch一步转到TensorRT；

import torch
from torch2trt import torch2trt
from torchvision.models.alexnet import alexnet

# create some regular pytorch model...
model = alexnet(pretrained=True).eval().cuda()

# create example data
x = torch.ones((1, 3, 224, 224)).cuda()

# convert to TensorRT feeding sample data as input
model_trt = torch2trt(model, [x])

y_trt = model_trt(x)

Ensemble Model

https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_5-Model_Ensembles

把多个model连接成pipeline，各个预处理、后处理脚本，都在Triton Server里运行；

好处：减少各个模型之间的中间结果数据在Triton Server和Client之间的传输延迟！

把每个python预处理或后处理环节，视为一个python model(其实没有model，只有一堆代码），还按格式放到Triton的Model Repo里面：

my_python_model/
├── 1
│   └── model.py
└── config.pbtxt

这个Pipeline视为一个"ensemble model"：

ensemble_model/
├── 1
└── config.pbtxt

config.pbtxt:

name: "ensemble_model"
platform: "ensemble"
max_batch_size: 256
input [
  {
    name: "input_image"
    data_type: TYPE_UINT8
    dims: [ -1 ]
  }
]
output [
  {
    name: "recognized_text"
    data_type: TYPE_STRING
    dims: [ -1 ]
  }
]

ensemble_scheduling {
  step [
    {
      model_name: "detection_preprocessing"
      model_version: -1
      input_map {
        key: "detection_preprocessing_input"
        value: "input_image"
      }
      output_map {
        key: "detection_preprocessing_output"
        value: "preprocessed_image"
      }
    },
    {
      model_name: "text_detection"
      model_version: -1
      input_map {
        key: "input_images:0"
        value: "preprocessed_image"
      }
      output_map {
        key: "feature_fusion/Conv_7/Sigmoid:0"
        value: "Sigmoid:0"
      },
      output_map {
        key: "feature_fusion/concat_3:0"
        value: "concat_3:0"
      }
    },
    {
      model_name: "detection_postprocessing"
      model_version: -1
      input_map {
        key: "detection_postprocessing_input_1"
        value: "Sigmoid:0"
      }
      input_map {
        key: "detection_postprocessing_input_2"
        value: "concat_3:0"
      }
      input_map {
        key: "detection_postprocessing_input_3"
        value: "preprocessed_image"
      }
      output_map {
        key: "detection_postprocessing_output"
        value: "cropped_images"
      }
    },
    {
      model_name: "text_recognition"
      model_version: -1
      input_map {
        key: "INPUT__0"
        value: "cropped_images"
      }
      output_map {
        key: "OUTPUT__0"
        value: "recognition_output"
      }
    },
    {
      model_name: "recognition_postprocessing"
      model_version: -1
      input_map {
        key: "recognition_postprocessing_input"
        value: "recognition_output"
      }
      output_map {
        key: "recognition_postprocessing_output"
        value: "recognized_text"
      }
    }
  ]
}

Iterative Sequence / Iterative Scheduling

tutorials/Conceptual_Guide/Part_7-iterative_scheduling at main · triton-inference-server/tutorials · GitHub

Iterative scheduling is a technique that allows the Triton Inference Server to schedule the same request multiple times with the same input. This is useful for models that have an auto-regressive loop. Iterative scheduling enables Triton Server to implement inflight batching for your models and gives you the ability to combine new sequences as they are arriving with inflight sequences.

有趣有效：新prompt来了以后，能和老prompt（虽然已经吐了一些output-tokens），放到同一个batch里；不用等待老prompt完成后再跑新prompt；

详细介绍：

Model Configuration — NVIDIA Triton Inference Server