
Ray is a popular open-source compute framework that makes it easy to scale Python workloads. BentoML integrates natively with Ray Serve, a library built to scale API services on a Ray cluster, to enable users to deploy Bento applications in a Ray cluster without modifying code or configuration.

The central API in the Ray Serve integration in BentoML is bentoml.ray.deployment, it seamless converts a Bento into a Ray Serve Deployment. At the simpliest form, only a bento tag is required to create a Deployment.

import bentoml

classifier = bentoml.ray.deployment('iris_classifier:latest')

The Ray Serve Deployment can then be deployed locally or to a Ray cluster using the Ray Serve’s run command.

serve run bento_ray:classifier

Scaling Resources and Autoscaling#

The bentoml.ray.deployment API also supports configuring scaling resources and autoscaling behaviors. In addition to the Bento tag, service_deployment_config and runner_deployment_config arguments can be passed in to configure the Deployments of API Server and Runners respectively. All parameters allowed in Ray Serve Deployment can be specified in the service_deployment_config and runner_deployment_config. The Runner name should be specified as the key in the runner_deployment_config.

import bentoml

classifier = bentoml.ray.deployment(
      "route_prefix": "/classify",
      "num_replicas": 3,
      "ray_actor_options": {
          "num_cpus": 1
      "iris_clf": {
          "num_replicas": 1,
          "ray_actor_options": {
              "num_cpus": 5


Arguments in the service_deployment_config and runner_deployment_config dictionaries are passed through directly to Deployment. Please refer to Resource Allocation and Ray Serve Autoscaling for the full list of supported arguments.


Batching behaviors can be configured through the enable_batching and batching_config arguments. Using Runner name as the key, both max_batch_size and batch_wait_timeout_s can be configured for each Runner independently through batching_config.

import bentoml

deploy = bentoml.ray.deployment(
      "iris_clf": {
          "predict": {
              "max_batch_size": 1024,
              "batch_wait_timeout_s": 0.2


Arguments in the batching_config dictionary are passed through directly to Ray Serve. Please refer to Ray Serve Batching for the full list of supported arguments.


See the API references to learn more about the Ray Serve integration in BentoML.