In this post, you will learn how easily you can add new deep learning training Workflows in Onepanel and use it directly from CVAT to train models on annotated data. We will use a recently released model from Facebook Research, DEtection TRansformer, as an example.
Object Detection models are one of the most widely used models among other computer vision tasks. Consequently, there is a lot of research going on in this area and we have some really good models for this task. One of the most popular model for object detection is Faster R-CNN. Faster R-CNN builds on top of the Fast R-CNN and R-CNN to further improve the speed and accuracy of the model. However, Faster RCNN consists of multiple components which does not make it end-to-end model. By now, it has been well established that end-to-end models make a lot of things easier and in fact it has led to some great results in NLP tasks such as machine translation. In this blog, we are going to see how DEtection TRansformer, a recently introduced model from facebook research, provides a great alternative to Faster R-CNN as an end-to-end model and how you can use it on Onepanel with just a few clicks.
This paper isn't the first one to propose an end-to-end model for object detection models. Previous approaches used sequential models such as recurrent neural networks to predict bounding boxes, but the result wasn't on par with state-of-the-art models. Of course, we can use conventional fully connected network for a fixed set of boxes, but it isn't usually the case. And to address the issue of permutation-invariance (i.e predicted boxes can be in any order) they used bipartite matching loss.
DETR uses bipartite matching loss as well, but turns to Transformers instead of recurrent neural networks. Below image shows the architecture of DEtection Transformer.
As the paper mentions, DETR views object detection as a direct set prediction problem. Here is a brief summary of how it works.
The first step is straight forward. They use ResNet-50 or ResNet-101 pre-trained on ImageNet to generate the feature maps. This can be achieved in a few lines of code using torchvision. Since the detailed explanation of Transformer is beyond the scope of this blog, following sections attempts to explain it briefly. For more information on Transformer, check out this excellent blog post by Jay Alammar on Transformer.
Transformer is an encoder-decoder based architecture which leverages self-attention layers to gather information from the whole sequence. Transformers have gained a lot of popularity lately and they are being used in many state-of-the-art models for NLP tasks such as machine translation.
For DETR, this encoder takes in feature map combined with positional encoding as an input. Positional encoding allows Transformer to know the order of a given input at time stamp t in a original input. For example, in case of machine translation, it is important to know where the word "San Francisco" appears in the input sequence- San Francisco is in California. Unlike RNN, it does not accept word sequentially and hence does not know the order of input words inherently. Typically, positional encoding can be achieved by summing word embedding and positional embedding. Positional embeddings can be generated by repeating a pair of sines and cosines over time. This post provides a great explanation of position encoding in Transformers.
Combing back to the encoder, the encoder block then uses 1 x 1 convolutional kernel to reduce the dimensions of the feature map. Using 1 x 1 kernel we can essentially control the number of feature maps (or depth) without touching the size of an input. Then, the input is flattened and passed through multiple encoders. Here, encoders follow the standard structure. That is, each encoder has a self-attention layer followed by a feed forward network.
The decoding part is also very similar to the standard architecture with the major difference being the parallel decoding. DETR decodes, let's say, N objects in parallel contrary to standard approach where model like RNN is used to make prediction one time step at a time. A decoder has a self-attention layer, encoder-decoder attention, and a feed forward network. The encoder-decoder attention helps decoder focus on the relevant part of the input.
Finally, the final output is computed by feed forward network and a linear projection layer. The FFN outputs the normalized center of a bounding box along with height and width of a bounding box, whereas linear project layer predicts the class label using softmax function.
An important thing to note here is that since the model predicts a fixed-set of N objects, where N is way larger than the number of objects in ground-truth data, the author used a special class to represent 'no object was detected in this slot'.
Since the number of predicted objects is much larger than the objects in ground-truth data, they pad a vector representing ground-truth data with nulls to represent "no object". Using pair-wise matching cost, predicted boxes are then matched with target box such that the cost is minimum. As the author says, this process is similar to matching anchors to ground-truth objects in models such as SSD.
The loss function used here is negative log-likelihood for class label and a box. For box loss, a combination of l1 loss and IoU loss is used to ensure loss is scale-invariant since there could be small and big boxes.
Now that you have some understanding of how DEtection TRansformer works, you can follow these instructions in our documentation to add and train this model on your own annotated data.