In Part I of the series, we converted a Keras models into a Tensorflow servable saved_model format and serve and test the model locally using tensorflow_model_server. Now we should put it in a Docker container and launch it to outer space AWS Sagemaker. The process will be propelled by lots of Bash scripts and config files. And a successful launch will be supported by verification tests written in Python.

Contain It

Inferencing on AWS Sagemaker has two endpoints on port 8080 - /invocations and /ping. /invocations invokes the model and /ping shows the health status of the endpoint. However, Tensorflow Serving uses port of 8500 for gRPC and 8501 for REST API. To bridge the gap, we use NGINX to proxy the internal ports 8500/8501 to external port 8080. Here’s our nginx.conf to be copied onto our Docker image.

With that, we can construct a dockerfile that extends the reference tensorflow/serving image, install nginx and git to the image, copy our model directory and nginx.conf to the image, and start the NGINX server piping it to the tensorflow_model_server command. At runtime, the Docker container will execute tensorflow_model_serving on localhost:8501 and proxy the REST API port to external port 8080 as specified in the nginx.conf file above.

Now we can build the Docker image using command line:

docker build -t <<name your image here>> .

After building the image, it can be run using this command:

docker run --rm -p 8080:8080 <<name your image here>>

Similar to the first part of the tutorial, it would be prudent to write test code to determine if the Docker container and the model inside Docker container is behaving the same way as the local model.

Push It

In order to deploy a Docker image on AWS Sagemaker, ECS, EKS, etc, we would have to store the Docker image on AWS Elastic Container Registry. We have made a post previously on how to do psuh Docker images to ECR. If you have AWS CLI installed, we have created a convenienece script to automate the process of pushing a local Docker image to ECR.

Serve It

In order for AWS Sagemasker to successfully work and deploy a model, we need to create an AmazonSageMaker-ExecutionRolefrom the IAM console and give it both AmazonSagemakerFullAccess and AmazonS3FullAccess. It can be done in AWS IAM console. The role allows the container to have full access to both Sagemaker for inferencing and S3 for storing model artifacts.

After that’s done, here’s a bash script to create a Sagemaker model using the Docker image we pushed to ECR.

Now all there’s left is for us to create the actual Sagemaker inference endpoint. Here’s a bash script to create the actual endpoint. We will be using the el cheapo ml.c4.large in the example. As of 2019-12-20, it costs $0.14 per hour on US-West2

Your model is now deployed on AWS Sagemaker Inference and online! You can quickly verify from the AWS Sagemaker console -> Inference -> Endpoints.

Use It

WARNING: AWS Sagemaker Inferece has a JSON payload size limit of 5MB. However, a typical 100kB image when converted to an image array string will be ~5.5MB. AWS Sagemaker will elegantly throw a ConnectionResetError: [Errno 104] Connection reset by peer and tell you to go away. Due to this payload limit, Sagemaker or Elastic Inferencing is not a suitable choice at the moement for deployment for anything involves images or large inputs without preprocessing steps. And at the same time including the preprocessing steps tend to slow down the service or blocks the potential for either Sagemaker Inference API or Tensorflow Serving for batch inferencing.

There are two options to send and test requests to the newly airborn endpoint. First, there’s the always useful aws cli. The following bash script invokes the endpoint with a bogus payload request of [1.0,2.0,3.0].

We can also invoke the endpoint programmatically using Python or Node via the boto3 library. Since Python is the weapon of choice for machine learning engineers, here’s a short snippet on how to run the same bogus request as above using Python.

predict_request = f'{{"instances" : [1.0,2.0,3.0]}}'

response = client.invoke_endpoint(EndpointName=ENDPOINT_NAME,
                                  ContentType='application/json',
                                  Body=json.dumps(predict_request))
prediction=response['body'].read()

That’s it. We have shown how to convert Keras models into Tensorflow servable format in our last post. And in this post, we have demonstrated a quick workflow to package the model into Docker, push it to AWS ECR, and create the model on Sagemaker, and deploy it using Sagemaker Inference. As a reference, we have implemented this architecture in our Memefly Project REST API Endpoint.


To cite this content, please use:

@article{
    leehanchung,
    author = {Lee, Hanchung},
    title = {Serve Keras Model with Tensorflow Serving On AWS Sagemaker},
    year = {2019},
    howpublished = {\url{https://leehanchung.github.io}},
    url = {https://leehanchung.github.io/blogs/2019/12/20/TFServing-Sagemaker/}
}