Serving with GPU#

Most popular deep learning frameworks (TensorFlow, PyTorch, ONNX, etc.) have supports for GPU, both for training and inference. This guide demonstrates how to serve models with BentoML on GPU.

Docker Images Options#

See Docker options for all options related to setting up docker image options related to GPU. Here’s a sample bentofile.yaml config for serving with GPU:

service: "service:svc"
include:
- "*.py"
python:
    packages:
    - torch
    - torchvision
    - torchaudio
    extra_index_url:
    - "https://download.pytorch.org/whl/cu113"
docker:
    distro: debian
    python_version: "3.8.12"
    cuda_version: "11.6.2"

When containerize a saved bento with a cuda_version configured, BentoML will install the corresponding cuda version onto the docker image created:

$ bentoml containerize MyTFService:latest -t tf_svc

If the desired cuda_version is not natively supported by BentoML, users can still customize the installation of cuda driver and libraries via the system_packages, setup_script, or base_image options under the Bento build docker options.

Running Docker with GPU#

The NVIDIA Container Toolkit is required for running docker containers with Nvidia GPU. NVIDIA provides detailed instructions for installing both Docker CE and nvidia-docker.

Start bento generated image and check for GPU usages:

$ docker run --gpus all ${DEVICE_ARGS} -p 3000:3000 tf_svc:latest --workers=2

See also

For more information, check out the nvidia-docker wiki.

Note

It is recommended to append device location to --device when running the container:

$ docker run --gpus all --device /dev/nvidia0 \
               --device /dev/nvidia-uvm --device /dev/nvidia-uvm-tools \
               --device /dev/nvidia-modeset --device /dev/nvidiactl <docker-args>

Tip

In order to check for GPU usage, one can run nvidia-smi to check whether BentoService is using GPU. e.g

» nvidia-smi
Thu Jun 10 15:30:28 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.31       Driver Version: 465.31       CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   49C    P8     6W /  N/A |    753MiB /  6078MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    179346      C   /opt/conda/bin/python             745MiB |
+-----------------------------------------------------------------------------+