TensorFlow#

TensorFlow is an open source machine learning library focusing on deep neural networks. BentoML provides native support for serving and deploying models trained from TensorFlow.

Preface#

Even though bentoml.tensorflow supports Keras model, we recommend our users to use bentoml.keras for better development experience.

If you must use TensorFlow for your Keras model, make sure that your Keras model inference callback (such as predict) is decorated with function.

Note

Keras is not optimized for production inferencing. There are known reports of memory leaks during serving at the time of BentoML 1.0 release. The same issue applies to bentoml.keras as it heavily relies on the Keras APIs.
Running Inference with tensorflow usually halves the time comparing with using bentoml.keras.
bentoml.keras performs input casting that resembles the original Keras model input signatures.

Note

Remarks: We recommend users apply model optimization techniques such as distillation or quantization. Alternatively, Keras models can also be converted to ONNX models and leverage different runtimes.

Compatibility#

BentoML requires TensorFlow version 2.0 or higher. For TensorFlow version 1.0, consider using a Custom Runner.

Saving a Trained Model#

bentoml.tensorflow supports saving tf.Module, keras.models.Sequential, and keras.Model.

tf.Module

train.py#

# models created from the tf native API

class NativeModel(tf.Module):
    def __init__(self):
        super().__init__()
        self.weights = np.asfarray([[1.0], [1.0], [1.0], [1.0], [1.0]])
        self.dense = lambda inputs: tf.matmul(inputs, self.weights)

    @tf.function(
        input_signature=[
            tf.TensorSpec(shape=[None, 5], dtype=tf.float64, name="inputs")
        ]
    )
    def __call__(self, inputs):
        return self.dense(inputs)

model = NativeModel()

loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.Adam()

EPOCHS = 10
for epoch in range(EPOCHS):
    with tf.GradientTape() as tape:
        predictions = model(train_x)
        loss = loss_object(train_y, predictions)

    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

bentoml.tensorflow.save_model(
    "my_tf_model",
    model,
    signatures={"__call__": {"batchable": True, "batch_dim": 0}}
)

keras.Model

train.py#

class Model(keras.Model):
    def __init__(self):
        super().__init__()
        self.dense = keras.layers.Dense(1)

    @tf.function(
        input_signature=[
            tf.TensorSpec(shape=[None, 5], dtype=tf.float64, name="inputs")
        ]
    )
    def call(self, inputs):
        return self.dense(inputs)

model = Model()
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
model.fit(train_x, train_y, epochs=10)

bentoml.tensorflow.save_model(
    "my_keras_model",
    model,
    signatures={"__call__": {"batchable": True, "batch_dim": 0}}
)

keras.model.Sequential

train.py#

model = keras.models.Sequential(
    (
        keras.layers.Dense(
            units=1,
            input_shape=(5,),
            dtype=tf.float64,
            use_bias=False,
            kernel_initializer=keras.initializers.Ones(),
        ),
    )
)
opt = keras.optimizers.Adam(0.002, 0.5)
model.compile(optimizer=opt, loss="binary_crossentropy", metrics=["accuracy"])
model.fit(train_x, train_y, epochs=10)

bentoml.tensorflow.save_model(
    model,
    "my_keras_model",
    signatures={"__call__": {"batchable": True, "batch_dim": 0}}
)

Functional keras.Model

train.py#

x = keras.layers.Input((5,), dtype=tf.float64, name="x")
y = keras.layers.Dense(
    6,
    name="out",
    kernel_initializer=keras.initializers.Ones(),
)(x)
model = keras.Model(inputs=x, outputs=y)
opt = keras.optimizers.Adam(0.002, 0.5)
model.compile(optimizer=opt, loss="binary_crossentropy", metrics=["accuracy"])
model.fit(train_x, train_y, epochs=10)

bentoml.tensorflow.save_model(
    "my_keras_model",
    model,
    signatures={"__call__": {"batchable": True, "batch_dim": 0}}
)

bentoml.tensorflow also supports saving models that take multiple tensors as input:

train.py#

class MultiInputModel(tf.Module):
    def __init__(self):
        ...

    @tf.function(
        input_signature=[
            tf.TensorSpec(shape=[None, 5], dtype=tf.float64, name="x1"),
            tf.TensorSpec(shape=[None, 5], dtype=tf.float64, name="x2"),
            tf.TensorSpec(shape=(), dtype=tf.float64, name="factor"),
        ]
    )
    def __call__(self, x1: tf.Tensor, x2: tf.Tensor, factor: tf.Tensor):
        ...

model = MultiInputModel()
... # training

bentoml.tensorflow.save_model(
    "my_tf_model",
    model,
    signatures={"__call__": {"batchable": True, "batch_dim": 0}}
)

Note

save_model has two parameters: tf_signature and signatures.

Use the following arguments to define the model signatures to ensure consistent model behaviors in a Python session and from the BentoML model store:

tf_signatures is an alias to tf.saved_model.save signatures field. This optional signatures controls which methods in a given obj will be available to programs that consume SavedModel’s, for example, serving APIs. Read more about TensorFlow’s signatures behavior from their API documentation.
signatures refers to a general Model signatures that dictates which methods can be used for inference in the Runner context. This signatures dictionary will be used during the creation process of a Runner instance.

Note: The signatures used for creating a Runner is {"__call__": {"batchable": False}}.

This means BentoML’s Adaptive Batching is disabled when using save_model().

If you want to utilize adaptive batching behavior and know your model’s dynamic batching dimension, make sure to pass in signatures as follow:

bentoml.tensorflow.save_model("my_model", model, signatures={"__call__": {"batch_dim": 0, "batchable": True}})

Building a Service#

Create a BentoML service with the previously saved my_tf_model pipeline using the tensorflow framework APIs.

service.py#

runner = bentoml.tensorflow.get("my_tf_model").to_runner()

svc = bentoml.Service(name="test_service", runners=[runner])

@svc.api(input=JSON(), output=JSON())
async def predict(json_obj: JSONSerializable) -> JSONSerializable:
    batch_ret = await runner.async_run([json_obj])
    return batch_ret[0]

Adaptive Batching#

Most TensorFlow models can accept batched data as input. If batch inference is supported, it is recommended to enable batching to take advantage of the adaptive batching capability to improve the throughput and efficiency of the model.

Enable adaptive batching by overriding signatures argument with the method name and providing batchable and batch_dim configurations when saving the model to the model store:

batch.diff#

diff --git a/train.py b/train_batched.py
index 3b4bf11f..2d0ea09c 100644
--- a/train.py
+++ b/train_batched.py
@@ -3,15 +3,24 @@ import bentoml
class NativeModel(tf.Module):
    @tf.function(
        input_signature=[
-            tf.TensorSpec(shape=[1, 5], dtype=tf.int64, name="inputs")
+            tf.TensorSpec(shape=[None, 5], dtype=tf.float64, name="inputs")
        ]
    )
    def __call__(self, inputs):
        ...

model = NativeModel()
-bentoml.tensorflow.save_model("test_model", model)
+bentoml.tensorflow.save_model(
+    "test_model",
+    model,
+    signatures={"__call__": {"batchable": True, "batch_dim": 0}},
+)

runner = bentoml.tensorflow.get("test_model")
runner.init_local()
+
+#client 1
runner.run([[1,2,3,4,5]])
+
+#client 2
+runner.run([[6,7,8,9,0]])

From the diff above, when multiple clients send requests to a given server running this model, BentoML will automatically batched inbound request and invoke model([[1,2,3,4,5], [6,7,8,9,0]])