Also on Medium
Anomaly Detection on Videos of Rocket 3D Printing
During my PhD research, I did lots of high speed imaging. Most of the time, only around 0.1% or even less of the frames contain an event of interest. At that time, we had to go frame by frame among tens of thousands frames to find the target ones to be able to perform analyses and modeling on them.
As I gained more experience with machine learning and deep learning algorithms, I realized that my graduate lab and many others could have automated that tedious frame selection process with machine learning. This realization motivated me to work on this 2-week project as an Insight Data Science Fellow: using Deep Learning to detect anomalies from videos of 3D rocket printing.
Relativity Space (RS) is a company based in Los Angeles. It is creating a fully autonomous factory that 3D prints rockets (hence the “rock-et” in rock-et-ral) using the largest metal 3D printing platform in the world. Their solution can build rocket 10X faster with 90% less costs.
The challenge: automated video anomaly detection
Most printing process happens without issue, but every now and then something anomalous happens, which can be damaging to the part or the printer.
The company has identified a few known anomalous modes (shown below). Yet, there could be other types of anomalies which have not been captured, or even have not yet happened.
To prevent damages once anomalies start to happen, a human operator can monitor a video feed and stop the printer when things don’t look right. But this can be not only a waste of creative man hours, but also be challenging for human eyes to catch all anomalies instantaneously.
Therefore, an automated anomaly detection system is desired. The system that can take a stream of video, halts or send alerts if anomalous frames are detected so that it could run live during a print. Ideally, the algorithm powering the system should be able to identify all kinds of seen and unseen anomalies and the sensitivity of the algorithm can be tuned for a false positive rate which works best in production.
Solution: a deep learning approach
Given all the considerations, I need to find an algorithm based solution that fits the following challenges:
- Inputs are pure videos, with no additional meta-data provided
- Huge amount of data to be leveraged (TBs of videos gets and will be produced by RS)
- Very little labeled data (highly unbalanced set)
- Future unseen anomaly patterns (supervised model does not fit)
- High precision requirement (low precision means low productivity on human confirmation/halting)
- Extremely high recall requirement (because we are talking about printer and part that can be in $M cost range)
- The algorithm needs to be able to productionized in real-time (low latency need)
These considerations lead to a deep learning based unsupervised anomaly detection solution. There have been a lot of research with deep learning models applied on video problems, and this paper by researchers at Harvard University “Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning” particularly fits the said challenges. They showed that a unsupervised (self-supervised) deep net is capable of predicting future frames in complex natural image streams, like car-mounted camera videos. This approach can be generalized to other more interesting scenarios, like cat videos, or for this project, 3D rocket printing videos.
The model: an CNN+LSTM based autoencoder
An autoencoder is a deep neuralnetwork that is trained with the goal to learn a representation (encoding) of input, and to reconstruct the representation (decoding) to outputs that is as close as inputs. Weirdly but surprisingly effective, for an autoencoder, the input and output of the model is the same. Since there are no labels involved, it is suitable for unsupervised setups.
To use an autoencoder for an anomaly detection problem, we first train the model using “normal” inputs only. Take the following white cat for example, we want to use the best model setup (best model structure and parameters) so that we can reconstruct the cat perfectly.
Once the model is trained, we move to the prediction stage. When there comes a new input that is very different from training examples, the model will not be able to reconstruct it well. It will throws a big reconstruction error, thus signals anomaly from input data. More intuitively, when you feed a black cat’s image through the trained network for prediction, the network can only reconstruct a blurry cat.
Following the same logic as above image examples, in this video anomaly detection setup, I stacked my network to have convolutional neural nets (CNN) that accounts for the high-dimensional images input, as well as recurrent neural nets (LSTMs) (hence the “ral” in rock-et-ral)that accounts for the sequence of image frames. For reconstruction error measurement, I simply used the mean square error between predicted video frames and actual video frames. Following, I will go through the actual modeling process.
The videos are stored on Azure Blob storage and have a resolution of 1280 x 1024 at frame rate of 55 fps. The duration varies from a few seconds to ~2hrs. The alignment, orientations and lighting conditions of videos were different across different dates.
I used OpenCV to realign and resize the videos. My script reads in the video data frame by frame, resize and downsample the videos to a resolution of 320 x 256 and 1 frame per second. This script can be easily adjusted to read in live stream videos and produce data that can be used for real-time evaluation.
The deep neural net is written in Keras with a Tensorflow backend. I sampled about 8 hours of normal videos across multiple days to train the model. Given the model structure and input image & sequence setup, the network has a total of 6,909,818 parameters.
Not surprisingly, to power up the training of this model, I turned for help from GPUs on the cloud. It was ran on a single AWS G2.xlarge boxes that has a Tesla K90 GPU with 11GB of memories. The training process took roughly 10 hours on 150 training epochs.
Prediction and Detection
Now model is trained, it’s time to make predictions! This time let’s walk through a prediction on a piece of abnormal video.
The setup is that when an image frame comes in at time t_n, the model will predict the next frame t_n+1 based on the frames it has seen before. In the picture below, we can see the actual vs. predicted image frames broken down by time sequences. At t_0, we ask the model to predict what is going to happen at t1, based on what it sees at that point of time. At t1,the model predicts what is going to happen at t2, but this time, in addition to what it sees at t1, it uses the memory preserved from t0 as well (what you would expect from a stateful computation). This process goes on as the video streams.
The next thing we need to do is to compute the reconstruction error between predicted and actual frames. There are a lot of choices for the error computation, but mean square error is one of the most straightforward solution that fits the problem – amplify the bigger differences.
In the figure below, you can see the reconstruction error (Y axis) vs. time plot (X axis) plot. Initially, the error was really low as printing has not started and our model recognizes nothing was wrong. At time t1, the error suddenly peaks at a huge value when it first sees anomalous frame predicted t1. This is because the prediction it was making is actually based on an normalactual t0. At the next frame, t2, the error drops to a lower value, which is still significantly higher than those of normal videos. This valuable behavior can be attributed to the fact that model, when trained, did not see any videos like these abnormal sparks and thus is incapable of predicting next frame accurately. Therefore, the reconstruction error will be high, exactly like what we saw from the black cat example earlier. Lastly, we can see a simple threshold can be drawn to distinguish normal and anomalous frames perfectly.
Anomaly detection in action!
Now let’s see the algorithms running in action! In the demo below, the top left video is the real stream that comes in, the right hand side one is predicted video from the model, and the bottom is the time series of reconstruction error in sync with the video frames. Despite highs and lows, a moving avg of the error is constantly above a simple thresholding line we draw.
Performance of the model
So far, we have visually inspected that this model can detect anomalies on the example I demonstrated. But how does the model perform in reality? I manually labelled 100 videos (50 positives & 50 negatives) and plotted out the precision-recall curve. A 100% recall at 90% precision which looks very promising.
When more labelled data is available, we can have more confidence about the model performance on set that follows a true distribution. And this can be easily realized for by sampling from different prediction result bins of the model, and having a human export to label the videos.
In this project, I utilized a unsupervised deep learning framework that detects anomaly from videos of 3D printing process. The model, constructed from CNN+RNN(LSTM), predicts future frames from input image sequences. I leveraged reconstruction error to signal anomaly and achieved 90% precision at 100% recall on a 50/50 set. Relativity Space is very excited about the results and we are in the process of discussing platforms for productionisation and deployment.
In the end, I want to give thanks to Ed Mehr, my contact at Relativity Space, who has been very responsive to my requests and provide many useful information that help me better understand the problem. I am also very grateful to Alex Gude, data science advisor at Relativity Space, who has been providing tremendously helpful advice to me during different stages of this project. It has been great pleasure interacting and learning from them. I look forward to see the model being used in real production of rocket printing!