We recently trained a series of deep learning models for image classification and decided to host them on AWS Lambda. In this 2-part series, we’ll look at why and how we went Serverless and try to impart some lessons we learned.
This article should be useful to anyone considering deploying larger models like CNNs in a Serverless environment. It assumes a basic understanding of Serverless computing and Deep Learning.
Our customers have told us they want to find influencers by gender and age. In researching the problem of age and gender classification, we came across an excellent paper called DEX: Deep EXpectation of apparent age from a single image which proposes a deep learning approach to the problem. They take the popular VGG-16 CNN architecture with weights pretrained on ImageNet and repurpose it by training the last layers on a labelled set of cropped and aligned faces. They used a HOG face detector to extract faces from images. This process of repurposing models for new datasets is commonly referred to as transfer learning.
Fig 1. Image from DEX: Deep EXpectation of apparent age from a single image by Rasmus Rothe and Radu Timofte and Luc Van Gool, 2015
Though the DEX paper uses VGG-16, our research found newer CNN architectures like Google’s Inception or similar architectures like Xception – which we also liked because it’s small and fast – were more accurate and faster to train on our internal dataset. We used the models and pretrained weights available in the amazing Keras library, with Tensorflow as the backend engine.
To productionise the models, we needed a solution that could fit into our existing RabbitMQ-backed data pipeline, which consists of inbound servers that push blog and social media data into a queue, and worker servers that process the data and store results into Elasticsearch.
Fig 2. Simplified version of Scrunch’s data pipeline.
Our worker servers are quite light and don’t have capacity to host 3+ new CNN models.
We considered just building bigger worker servers and hosting the models alongside our other code, or spinning up new servers and creating an interface between our other code, but we instead chose to offload the compute to Serverless. A number of things were appealing to us about Serverless:
- The promise of less Ops, especially since we don’t have a dedicated team and each dev shares the burden.
- Though we mainly run predictable batch processing jobs, we do have some ad-hoc requests, so we liked the idea of not paying for compute when nothing’s happening but having it quickly available.
- We thought common issues with Serverless like dealing with cold requests wasn’t such a problem for us since the majority of our requests were expected to be backend generated – we can handle the odd slow request.
We chose AWS Lambda over other Serverless providers because our queue and worker servers are already hosted in AWS.
Lambda Architecture Considerations
The first draft of our Lambda architecture was a single Lambda function that would:
- Accept an image url and some metadata over HTTP.
- Download the image.
- Find faces using the HOG face detector in Dlib and crop and align with Pillow.
- Load and call our pretrained models to return the final set of results.
We had used services like Kairos in the past, and it seemed sensible to build a drop in replacement. However, we quickly learned that this isn’t the Lambda way for a number of reasons:
- A Lambda function only has 500MB of disk space to play with including any code dependencies. Since our trained Xception model was around 88MB and Tensorflow itself is 150MB, it didn’t give us much room for multiple libraries and models.
- The maximum package size you can upload is 50MB. To work around this limitation, you upload your large dependencies to S3, then download them to an in-memory directory (
/tmp) on first request which are cached for future invocations. Since downloading lots of stuff from S3 can be quite slow, you want to do as little of this as possible.
- Since you pay per 100ms of execution on Lambda, it would be preferable not to pay for network IO-bound stuff like downloading images in Lambda.
Instead, we took an approach of building a number of small Lambda functions that served just one model; one to find and crop faces from an image, which are returned as base64 encoded strings, then another set of Lambda functions which input the face strings and output age or gender predictions. We then used our worker servers to coordinate calls to the various Lambda functions and also to do network IO-bound stuff like downloading images.
Fig 3. Scrunch’s proposed Serverless pipeline
Turns out there are a number of advantages to doing it this way, which may be familiar to the engineers building microservices:
- We can separately improve upon each model and very easily deploy replacements – new ML engineers can train and ship models without being across our entire code base.
- We can pick whatever technology we want to solve a problem without them stepping on each other’s toes. This meant we can use Dlib for the face detector, Tensorflow/Keras for inference and potentially Pytorch or any other ML toolkit down the track.
- Components can be tested and monitored in isolation and failures are isolated to a single part of the pipeline.
Key takeaway: Lambda functions should be small and usually just do one thing.
Performance and Costs
To justify introducing Lambda, we felt the costs had to be somewhat on par with just running EC2 instances. That said, we were willing to let Lambda be a little more expensive if it meant less future Ops overhead.
We expect to process between 6-8 million images per model per month and figured we could do it on a single c3.2xlarge. Since we do occasionally need to process results on demand for customers, we need to ensure they’re always available, so an additional server would be required for redundancy.
I’m sure we could do it much cheaper using autoscaling, spot pricing, reserved instances, and even testing way smaller instances, but it means infrastructure testing and management.
To evaluate our cost estimates on Lambda, we wrote a script that called our function 150 times, after warming it with a single request, then another to parse our Cloudwatch logs to determine exactly how long AWS thought our request was. We then tried a number of experiments to reduce the request time.
Memory size testing
We first tried to find the optimal memory size/cost ratio. Adjusting memory size in Lambda is said to cause “proportional CPU power and other resources” to be allocated, so more memory = faster inference. The cost per 100ms of execution time scales linearly with increased capacity, however, the smaller the instance, the more free seconds you get, so there’s definitely incentive to keep Lambda functions small. We found the best cost/price ratio was a 2.1 second warm request on a 1536MB Lambda function, which would cost us around $413 for around 8M requests and around $1239 for requests to all 3 endpoints. In the chart below, we show the cost per 175k predictions alongside the execution time. We choose 175k arbirarily because it fits within the scale of the chart. You can see that smaller sizes are generally cheaper, even though inference is slower, due to the larger volume of free seconds.
Fig 4. Lambda execution time for an Xception model with the cost per 175k requests in cyan.
Inspiration for plot format came from the paper Serving deep learning models in a serverless platform
At more than 2x at price of EC2, it was going to be hard to justify Serverless to our CFO.
Custom CPU extensions
When Tensorflow starts, it provides a list of advanced vector extensions that your CPU supports but which Tensorflow wasn’t compiled to support:
2018-01-23 14:54:21.061552: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX
According to our logs, Lambda CPUs has support for SSE4.2, SSE4.1 and AVX, so we experimented with compiling Tensorflow to support them, as follows:
git clone https://github.com/tensorflow/tensorflow
git reset --hard v1.4.0
CC_OPT_FLAGS="-msse4.2 -msse4.1 -mavx" \
TF_NEED_JEMALLOC=1 TF_NEED_GCP=0 TF_NEED_HDFS=0 TF_ENABLE_XLA=0 \
TF_NEED_OPENCL=0 TF_NEED_CUDA=0 TF_NEED_VERBS=0 TF_NEED_GDR=0 TF_NEED_S3=0 TF_NEED_MPI=0 \
bazel build --config=opt //tensorflow/tools/pip_package:build_pip_package --local_resources 2048,.5,1.0
This significantly improved performance, shaving off almost 2 seconds on the smaller instances and 800ms on our target size, down to 1.3 billable seconds. With a 1536MB instance, 8M requests cost about $253 per month for a single model or $760 for all 3 models. Still much more expansive than the EC2 baseline, but definitely heading in the right direction.
Fig 5. Lambda execution time for an Xception model with Tensorflow compiled using CPU extensions
Warning: this could break things badly
There is a huge caveat to this approach. AWS do not guarantee that any CPU extensions will be available on the underlying hardware and they can change any time. See this thread for an idea of what can go wrong. We are willing to wear the risk of an occasional outage to these functions, since it’s mostly backend generated. We plan to have a few different builds of Tensorflow on hand if the extensions change. Obviously this isn’t going to be for everyone.
Midway through last year, Google release their MobileNet  architecture, which targets mobile-devices and other environments with limited computing resources. Since Lambda is a limited compute resource environment, we thought it’d be worth giving it a shot.
We found that MobileNet was able to achieve 93.4% accuracy in training which is only 0.6% lower than our best results on Xception. We determined this to be a reasonable trade off for our use case. Aftering deploying our model to Lambda, we got warm inference time down to 400ms on the 1536MB instance. The lowest recorded time was 194ms on a 3008MB instance. Since the model is much smaller, at around 25M, our cold request time was also significantly lower.
Fig 6. Lambda execution time for an MobileNet model with Tensorflow compiled using CPU extensions
Total costs for 8M requests with this configuration is $73 and $220 for 24M requests across 3 endpoints. Better than our EC2 baseline, but of course, we could probably now use significantly cheaper EC2 instances.
Smaller input sizes
We also considered reducing our image sizes down from 224×224 to speed up inference. However, we found that the performance penalties were significant enough during training to not warrant further investigation, though I will admit we didn’t have time to explore this thoroughly.
We show the cost comparison for 24M requests (8M across 3 models) in Fig 7. As you can see, Lambda can be cheaper than EC2 at this scale, but it requires a sacrifice. I have also included the Reserved Instance price for these 2 instances, which is still cheaper than the cheapest Lambda. Again, we’re willing to pay a little extra if it means less Ops. Lastly, I have include 8 million predictions on 2 popular face-recognition-as-a-service tools, which combine face detection and predictions in a single request, for cost comparison.
Fig 7. Cost comparison for 24M predictions (or 8 million request on Kairos and AWS Rekognition). Kairos is calculated as 2x the business plan at $3000 per month, which provides 5M predictions.
One other small thing to note, I’m not including cold time in the Lambda calculations because we expect the majority of our requests to be warm – not super confident about this assumption though.
Key takeaway: image classification at scale on Lambda can be comparable to equivalent EC2 if you’re willing to sacrifice some model accuracy.
In this article, we’ve covered the architectual and performance tuning that made image classification on Lambda work for us. We found that environmental limitations call for multiple small Lambda functions, instead of a single large function and coordinating calls between functions should be performed outside of Lambda. We also tested a number of library and model configurations and found that compiling Tensorflow to support available CPU extensions and switching to lightweight CNN architectures like MobileNet allowed us to find a price point on par and sometimes cheaper than EC2 instances.
In the next article on the series, we’re going to look at how we went about solving issues around deployment, local development, and testing on Lambda.
 – Rasmus Rothe and Radu Timofte and Luc Van Gool. Deep expectation of real and apparent age from a single image without facial landmarks. International Journal of Computer Vision (IJCV)}, 2016
 – Karen Simonyan, Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. 2014
 – M. Mathias, R. Benenson, M. Pedersoli, and L. Van Gool. Face detection without bells and whistles. In ECCV, 2014.
 – Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich. Going Deeper with Convolutions. eprint arXiv:1409.4842. 2014
 – Chollet, François. Deep Learning with Depthwise Separable Convolutions. eprint arXiv:1610.02357. 2016
 – Ishakian, Vatche; Muthusamy, Vinod; Slominski, Aleksander. Serving deep learning models in a serverless platform. eprint arXiv:1710.08460. 2017
 Howard, Andrew G.; Zhu, Menglong; Chen, Bo; Kalenichenko, Dmitry; Wang, Weijun; Weyand, Tobias; Andreetto, Marco; Adam, Hartwig. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. eprint arXiv:1704.04861. 2017