TensorRT and the Jetson Nano
This blog post is a continuation of my previous article on Self Driving RC Cars.
After setting up my RC car with a Jetson Nano i figured that I should go beyond the standard ML models. I wanted to try some Transfer learning based on Resnet 18
.
Preliminary Benchmarks
The first step was going to be benchmarking the existing ML model to see how much headroom I had. Resnet 18
was going to be a more complicated model and I needed to make sure that the Jetson Nano was capable enough.
The standard Donkey Car model is based on the Nvidia End to End Learning for Self-Driving Cars paper.
I was using a Camera resolution of 224px x 224px
and upon profiling my ML model I determined that the standard Donkey Car model was doing inferences at the rate of 25 Hz
or about 25
inferences per second. It's important to remember that the RC car needs to do inferences at 20 Hz
at the very least for it to be capable of self-driving. I was currently ~5 Hz
faster than the required minimum. Not great.
Performance Mode
The Jetson Nano is also capable of running in performance
mode. To run a Nano in performance
mode you need to run the following script. This script runs the Nano in maximum clock speed. I also set the fan to high given I had a Noctua fan installed. This would ensure that my Nano would not get thermal throttled.
#!/bin/bash
sleep 2
echo 'Maximum performance.'
sudo /usr/bin/jetson_clocks
sudo sh -c 'echo 255 > /sys/devices/pwm-fan/target_pwm'
Once I set the Nvidia Jetson to run in performance mode, my inference rate went up to 40-42 Hz
. Much better !
Hello TensorRT
TensorRT
is a framework from Nvidia for high-performance inference. The Nvidia Jetson Nano supports TensorRT
via the Jetpack
SDK. To make inferences faster, I realized that I was going to have to convert my Keras
model to a TensorRT
model.
Freezing the Keras Model
The first step in converting a Keras
model to a TensorRT
model is freezing the model.
'''
Usage:
freeze_model.py --model="mymodel.h5" --output="frozen_model.pb"
Note:
This requires that TensorRT is setup correctly. For more instructions, take a look at
https://docs.nvidia.com/deeplearning/sdk/tensorrt-install-guide/index.html
'''
import os
from docopt import docopt
import json
from pathlib import Path
import tensorflow as tf
args = docopt(__doc__)
in_model = os.path.expanduser(args['--model'])
output = os.path.expanduser(args['--output'])
output_path = Path(output)
output_meta = Path('%s/%s.metadata' % (output_path.parent.as_posix(), output_path.stem))
# Reset session
tf.keras.backend.clear_session()
tf.keras.backend.set_learning_phase(0)
model = tf.keras.models.load_model(in_model, compile=False)
session = tf.keras.backend.get_session()
input_names = sorted([layer.op.name for layer in model.inputs])
output_names = sorted([layer.op.name for layer in model.outputs])
# Store additional information in metadata, useful when creating a TensorRT network
meta = {'input_names': input_names, 'output_names': output_names}
graph = session.graph
# Freeze Graph
with graph.as_default():
# Convert variables to constants
graph_frozen = tf.compat.v1.graph_util.convert_variables_to_constants(session, graph.as_graph_def(), output_names)
# Remove training nodes
graph_frozen = tf.compat.v1.graph_util.remove_training_nodes(graph_frozen)
with open(output, 'wb') as output_file, open(output_meta.as_posix(), 'w') as meta_file:
output_file.write(graph_frozen.SerializeToString())
meta_file.write(json.dumps(meta))
print ('Inputs = [%s], Outputs = [%s]' % (input_names, output_names))
print ('Writing metadata to %s' % output_meta.as_posix())
print ('To convert use: \n `convert-to-uff %s`' % (output))
The 2 main steps are:
- Converting variables to constants in a
Tensorflow
graph definition. - Removing of training nodes.
Here is the full source code on GitHub.
The above script stores the frozen graph definition as a protobuf
. I could now use convert-to-uff
to convert the frozen graph definition to UFF
format.
# Linear.pb is a frozen Tensorflow graph.
# Converts it and saves the result in Linear.uff
convert-to-uff ../../models/Linear.pb
Inference
Now that I had the UFF model, I needed to use the tensorrt
python API and pycuda
to run inferences.
from collections import namedtuple
from donkeycar.parts.keras import KerasPilot
import json
import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit
from pathlib import Path
import tensorflow as tf
import tensorrt as trt
HostDeviceMemory = namedtuple('HostDeviceMemory', 'host_memory device_memory')
class TensorRTLinear(KerasPilot):
'''
Uses TensorRT to do the inference.
'''
def __init__(self, cfg, *args, **kwargs):
super(TensorRTLinear, self).__init__(*args, **kwargs)
self.logger = trt.Logger(trt.Logger.WARNING)
self.cfg = cfg
self.engine = None
self.inputs = None
self.outputs = None
self.bindings = None
self.stream = None
def compile(self):
print('Nothing to compile')
def load(self, model_path):
uff_model = Path(model_path)
metadata_path = Path('%s/%s.metadata' % (uff_model.parent.as_posix(), uff_model.stem))
with open(metadata_path.as_posix(), 'r') as metadata, trt.Builder(self.logger) as builder, builder.create_network() as network, trt.UffParser() as parser:
metadata = json.loads(metadata.read())
# Configure inputs and outputs
print('Configuring I/O')
input_names = metadata['input_names']
output_names = metadata['output_names']
for name in input_names:
parser.register_input(name, (self.cfg.TARGET_D, self.cfg.TARGET_H, self.cfg.TARGET_W))
for name in output_names:
parser.register_output(name)
# Parse network
print('Parsing TensorRT Network')
parser.parse(uff_model.as_posix(), network)
print('Building CUDA Engine')
self.engine = builder.build_cuda_engine(network)
# Allocate buffers
print('Allocating Buffers')
self.inputs, self.outputs, self.bindings, self.stream = TensorRTLinear.allocate_buffers(self.engine)
print('Ready')
def run(self, image):
# Channel first image format
image = image.transpose((2,0,1))
# Flatten it to a 1D array.
image = image.ravel()
# The first input is the image. Copy to host memory.
image_input = self.inputs[0]
np.copyto(image_input.host_memory, image)
with self.engine.create_execution_context() as context:
[throttle, steering] = TensorRTLinear.infer(context=context, bindings=self.bindings, inputs=self.inputs, outputs=self.outputs, stream=self.stream)
return steering[0], throttle[0]
@classmethod
def allocate_buffers(cls, engine):
inputs = []
outputs = []
bindings = []
stream = cuda.Stream()
for binding in engine:
size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
dtype = trt.nptype(engine.get_binding_dtype(binding))
# Allocate host and device buffers
host_memory = cuda.pagelocked_empty(size, dtype)
device_memory = cuda.mem_alloc(host_memory.nbytes)
bindings.append(int(device_memory))
if engine.binding_is_input(binding):
inputs.append(HostDeviceMemory(host_memory, device_memory))
else:
outputs.append(HostDeviceMemory(host_memory, device_memory))
return inputs, outputs, bindings, stream
@classmethod
def infer(cls, context, bindings, inputs, outputs, stream, batch_size=1):
# Transfer input data to the GPU.
[cuda.memcpy_htod_async(inp.device_memory, inp.host_memory, stream) for inp in inputs]
# Run inference.
context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
# Transfer predictions back from the GPU.
[cuda.memcpy_dtoh_async(out.host_memory, out.device_memory, stream) for out in outputs]
# Synchronize the stream
stream.synchronize()
# Return only the host outputs.
return [out.host_memory for out in outputs]
Here is the full source code on GitHub.
Benchmarks
When using the TensorRT
based UFF model the Jetson Nano, I could do inferences at a frequency of 100-105Hz
.
This meant that I had the headroom to build a Resnet 18
based model.
Final Comparison
Here is the final comparison of all techniques used to speed up ML inference. Using TensorRT
, I had improved the rate of inference by 2.5x
the previous result.
Model | Inference Frequency |
---|---|
Keras [TF-GPU] | 25 Hz |
Keras [Performance Mode] | 40-42 Hz |
TensorRT | 100-105 Hz |