Plug & Play Deployments of Machine Learning Models Using TensorFlow Serving

Plug & Play Deployments of Machine Learning Models Using TensorFlow Serving


In previous posts we have already explored several ways to bring your machine learning models into production (serverless with Algorithmia, or using FastAPI). Each approach has its advantages and disadvantages and the applicability strongly depends on your use case. In this post we will look at another tool to wrap (TensorFlow) models as an API: TensorFlow Serving.

TensorFlow Serving is directly developed by the TensorFlow team which describes it as a “[…] flexible, high-performance serving system for machine learning models, designed for production environments”. Sounds pretty impressive, so where does it fit into the landscape of model serving options? Compared to a serverless solution like Algorithmia and a FastAPI based custom implementation of API endpoints, TensorFlow Serving sits somewhere in between. It abstracts away the actual implementation of an API for your models, but the infrastructure is still your responsibility. A typical use case could be a microservice application where your model will be used by multiple other downstream services for many different purposes. In this case you probably just want to provide a direct, standardized and high-performance interface to your model without bothering about writing customized endpoints by yourself. With TensorFlow Serving you get just that while maintaining the flexibility to deploy your models to any server infrastructure you like. So, let’s try it out.

Train a TensorFlow Model

To understand the functionality of TensorFlow serving we will implement a minimal viable product based on a simple TensorFlow model which predicts a heart disease based on input variables like the age, sex or chest pain type of a person. Since the purpose of this post is to demonstrate TensorFlow Serving we use this repository as our starting point to train a small Feed Forward Neuronal Network on a data set provided by the Cleveland Clinic Foundation for Heart Disease. We follow this tutorial up to the model training part. Let’s suppose we are not quite sure yet how many epochs we need to train our model until we get reliable results. So, we start by just one epoch for now…

## Train the model
""", epochs=1, validation_data=val_ds)


8/8 [==============================] - 1s 55ms/step - loss: 0.7170 - accuracy: 0.5688 - val_loss: 0.5961 - val_accuracy: 0.6885
<tensorflow.python.keras.callbacks.History at 0x14bdd4760>

Not bad for the moment. Since our stakeholders (probably another team which depends on our results) are really impatient we decide to bring this first version into production right away. To realize this, we simply export the current state in the SavedModel format., "/Path/to/local/project-root-folder/models/tf_model/1/")

The resulting folder basically contains all necessary information about your model and can be consumed by TensorFlow Serving out of the box. You can find more information about the SavedModel format here. In our case we make sure to choose a folder structure which allows us to version the models, since we already have a subconscious feeling that our first approach can be improved in the future. But for now, we are ready to go. Your folder structure should look like this:

              |____ assets
              |____ variables
                    |____ variables.index
              |____ saved_model.pb

Tensorflow Serving Docker Version

To make our model available through the help of TensorFlow Serving the easiest way is to use a docker container with everything preconfigured for us. Fortunately, such an image already exists, and we just have to mount our model folder in the container to make it work. Just open a terminal and execute the following docker command:

docker run -p 8501:8501 --mount type=bind,source=/Path/to/local/project-root-folder/models/tf_model/,target=/models/tf_model/ -e MODEL_NAME=tf_model -t tensorflow/serving

That’s it. The container automatically detects your model in version 1, loads it and exposes a standardized POST endpoint. You can check the status of your model by browsing to http://localhost:8501/v1/models/tf_model. If everything works fine you should see something like this:

TensorFlow Serving Model Status
TensorFlow Serving Model Status

Request your Model

To get predictions from your model you have to follow the data format specified by the implementation of TensorFlow Serving. Let’ say we want to use the same sample data as in the provided example.

sample = {
    "age": 60,
    "sex": 1,
    "cp": 1,
    "trestbps": 145,
    "chol": 233,
    "fbs": 1,
    "restecg": 2,
    "thalach": 150,
    "exang": 0,
    "oldpeak": 2.3,
    "slope": 3,
    "ca": 0,
    "thal": "fixed",

To achieve this, we bring the data to the right format and send a POST request with a corresponding json object to our model.

input_dict = {"instances": [{name: [value] for name, value in sample.items()}]}

input_dict looks like this:

{'instances': [{'age': [60],
   'sex': [1],
   'cp': [1],
   'trestbps': [145],
   'chol': [233],
   'fbs': [1],
   'restecg': [2],
   'thalach': [150],
   'exang': [0],
   'oldpeak': [2.3],
   'slope': [3],
   'ca': [0],
   'thal': ['fixed']}]}

Now we make a request:

import requests
res ="http://localhost:8501/v1/models/tf_model:predict", json=input_dict)


{'predictions': [[0.302037973]]}

And there is our result. Seems like about a 30% probability of a heart disease. But since this was our first try, we really should invest some time to train a better model. So, let’s train a new version with 20 epochs and save it as version 2., epochs=20, validation_data=val_ds), "/Path/to/local/project-root-folder/models/tf_model/2/")


Epoch 20/20
8/8 [==============================] - 0s 8ms/step - loss: 0.3758 - accuracy: 0.8192 - val_loss: 0.3910 - val_accuracy: 0.8361
<tensorflow.python.keras.callbacks.History at 0x14bd1cd00>

We can see that the validation accuracy increases drastically which indicates that it was a good idea to train the model a bit longer. If you look at the terminal where you have executed the docker container you see that the model is automatically detected and updated to version 2. You can send the same request again to the endpoint which results in an increased probability of about 40% now. Let’s have a look at the status of our model.

TensorFlow Serving Two Versions of a Model
TensorFlow Serving Two Versions of a Model

As you can see, we have two versions now, but the state of version 1 is “END” which means the endpoint only serves the newest model which is the default behavior of TensorFlow Serving. But what about a situation where a downstream service depends on a certain version of our model? For example, if you change the number of input parameters, each service which uses your endpoint have to migrate their implementation to your new model structure. In such situations you typically want to deploy a new version with breaking changes while keeping the old version online as well until every consumer of you API has migrated.

To realize this, we have to adjust the default configuration of TensorFlow Serving by mounting a config file to the container as well. We name the file models.config and save it in our ./project-root-folder/models directory. To serve both versions, the content of models.config looks like this:

model_config_list {
  config {
    name: 'tf_model'
    base_path: '/models/tf_model/'
    model_platform: 'tensorflow'
    model_version_policy {
      specific {
        versions: 1
        versions: 2

Now we restart the container with the new settings. Unfortunately, aborting TensorFlow Serving is a bit inconvenient. The easiest way is to just kill the current container from another terminal and execute the new docker run command:

docker kill <CONTAINER ID>
docker run -p 8501:8501 --mount type=bind,source=/Path/to/local/project-root-folder/models/tf_model/,target=/models/tf_model/ --mount type=bind,source=/Path/to/local/project-root-folder/models/models.config,target=/models/models.config -t tensorflow/serving --model_config_file=/models/models.config

The status at http://localhost:8501/v1/models/tf_model now shows that version 1 and 2 are available. In this setup we now can send request to the endpoint http://localhost:8501/v1/models/tf_model/versions/1:predict and will still get the results of model version 1. Requests to the base url are still served by the newest version, so you should think about the expected behavior of your endpoints before you distribute the links to the developers of the downstream services.


In this post we have discussed another option to bring your machine learning model into production through the help of TensorFlow Serving. In future posts we will implement more professional setups with monitoring services or load balancers. Monitoring for example can be useful to know when it is safe to turn off an old version of your model (no traffic is detected on this endpoint for certain time) and load balancers can distribute the traffic of incoming request to multiple instances of your endpoints. It is also worth mentioning that TensorFlow models work out of the box with TensorFlow Serving, but other model types should work as well according to the official website. This will also be tested in a future post.

Leave a Reply