Performance Guide#

This guide is intended to aid advanced BentoML users with a better understanding of the costs and performance overhead of their model serving workload. This guide will also demonstrate BentoML’s architecture and provide insights into how users can fine-tune its performance.

Todo

Performance Guide Todo items:

  • basic load testing with locust

  • load testing tips:
    • production mode vs development mode

    • enable/disable logging

    • always run locust client on a separate machine

  • performance best practices:
    • bentoml serve options: –api-worker, –backlog, –timeout

    • configure runner resources

    • configure adaptive batching (max_latency, max_batch_size)

    • embedded runner

  • existing benchmark results and comparisons

  • advanced topics:
    • alternative load testing with grafana k6

    • setup tracing and dashboard

    • setup tracing for Yatai and distributed Runner

    • instrument tracing for user service and runner code

Help us improve the project!

Found an issue or a TODO item? You’re always welcome to make contributions to the project and its documentation. Check out the BentoML development guide and documentation guide to get started.