Introduction
DETR (Detection Transformer) is one of method to solve the task of object detection with a transformer architecture. [1] DETR is one of the most popular methods for object detection based on end-to-end detection.
What it is. Unlike traditional computer vision techniques, DETR approaches object detection as a direct set prediction problem. It consists of a set-based global loss, which forces unique predictions via bipartite matching, and a Transformer encoder-decoder architecture. Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel. Due to this parallel nature, DETR is very fast and efficient.
End-to-end detection
End-to-end detection vs non end-to-end [2]
In object detection research, there are two deep learning-based approaches: the two-stage detector and the one-stage detector. They differ in their object detection processes.
- The two-stage detector involves multiple stages, where it first generates region proposals in the image and then performs classification on each region proposal. Some methods that fall under the two-stage detector category include Fast R-CNN, Faster R-CNN, RFCN, FPN, and Cascade R-CNN.
- The one-stage detector directly performs classification and bounding box regression without generating region proposals, and later applies Non-Maximum Suppression (NMS) to remove duplicate prediction boxes. Meanwhile, methods categorized as one-stage detectors include YOLO, SSD, and RetinaNet.
What it is. End-to-end object detection means that object detection pipeline is without any non-differentiable component, e.g., NMS. The input of network is the image and the output is direct predictions of classification on object categories or background and the box regression. The whole network is trained in an end-to-end manner with back-propagation. [2]
How it works?
End-to-end object detection generate set of prediction (classification and box regression) in parallel. Then set of prediction will be compute matching cost with (target class, target box) and get the final set of prediction (minimum cost).
And what point blue and orange in the figure above? In end-to-end it called positive sample assignment. End-to-end de-tectors apply one-to-one assignment during the training step. For one ground-truth box, only one sample with the minimum matching cost is assigned as the positive sample, others are all negative samples. The positive sample is usu-ally selected by bipartite matching (Kuhn, 1955) to avoid sample conflict, i.e., two ground-truth boxes share the same positive sample [2].
Matching cost [2]
End-to-end detection have 2 matching cost,
Location cost
\[C_{loc} = λ_{iou} . C_{loc} + λ_{L1} . C_{L1}\]- where $C_{L1}$ and $C_{iou}$ are L1-loss and IoU-loss between sample and ground-truth box, respectively.
- $λ_{L1}$ and $λ_{iou}$ are coefficients
Classification cost
\[C = λ_{cls} . C_{cls} + C_{loc}\]- where $C_{cls}$ is classification loss of predicted classifications and ground truth category labels. $C_{loc}$ is defined in equation before.
- $λ_{cls}$ is coefficient.
Explain like i'm in 10
You have image 5x5 pixels, every pixel have information (in this case: embbeding) to store all pixel data like edges, colors, etc. Embedding generated by model and have size 512.
Analogy: Like you search song in Youtube. You write song name at search box (query) and Youtube will give you result (value) based video detail like title and description (key).
Same as end-to-end detection use object queries generated by model (query) to find object in image (value) based image information in embedding (key). What if there are more than 1 object in image? You can use bipartite matching to find object in image to compute matching cost.
Transformer
I will not cover the details of transformer architecture. But I will explain what it is and summarize how transformer works.
Transformer is a self-attention network with a multi-head attention mechanism. [3] Transformer gained significant popularity in the field of natural language processing (NLP). When you hear about ChatGPT, it is Transformer based.
The Transformer model architecture is based on the self-attention mechanism, which allows the model to weigh the importance of different words in a sentence when making predictions. Unlike previous sequence-based models, such as recurrent neural networks (RNNs), the Transformer does not rely on sequential processing and can parallelize computations, making it more efficient.
Key components
Transformer Architecture [3]
Components
- Encoder: The input sequence is passed through a stack of identical encoder layers. Each encoder layer consists of two sub-layers: multi-head self-attention mechanism and a position-wise feed-forward neural network.
- Decoder: The output sequence is generated by passing the target sequence through a stack of identical decoder layers. In addition to the two sub-layers present in the encoder, the decoder also includes a third sub-layer that performs multi-head attention over the encoder’s output.
- Positional Encoding: Since the Transformer model does not have any inherent notion of word order, positional encoding is used to provide the model with positional information about the words in the input sequence. Ex: I eat fried chicken is != fried chicked eat I
- Attention Mechanism: The self-attention mechanism allows the model to attend to different words in the input sequence when making predictions. It calculates attention weights based on the similarity between each word and other words in the sequence.
- Feed-Forward Neural Network: The position-wise feed-forward neural network applies a set of fully connected layers to each position in the sequence independently.
How Transformer works
I’ll explain how a Transformer works in a simplified way.
Imagine you have a sentence like “I love playing soccer.” A Transformer is like a smart machine that can understand and generate sentences. It does this by breaking down the sentence into smaller parts called “tokens.” In this case, the tokens would be “I,” “love,” “playing,” and “soccer.”
The Transformer has two main parts called the encoder and the decoder. The encoder’s job is to understand the input sentence, while the decoder’s job is to generate a new sentence based on that understanding.
In the encoder, each token is processed one by one. The Transformer looks at each token and pays attention to the other tokens in the sentence. It figures out which tokens are important for understanding the meaning of each token. For example, it recognizes that the word “love” is related to the words “I” and “playing.”
Once the encoder has processed all the tokens, it passes the information to the decoder. The decoder then uses that information to generate a new sentence. It starts with a special token called “start,” and then it generates the next token based on what it has learned from the encoder. It continues generating tokens one by one until it reaches a special token called “end,” which marks the end of the generated sentence.
The Transformer learns how to do this by training on a lot of example sentences. It adjusts its internal parameters based on how well it can understand and generate sentences. This training process helps the Transformer become better at understanding and generating sentences over time.
In Summary
a Transformer is a machine learning model that can understand and generate sentences. It breaks down sentences into tokens, uses an encoder to understand the input, and a decoder to generate new sentences based on that understanding. It learns how to do this by training on lots of example sentences.
References
Jay Alamar explain transformer with good visualization here jalammar.github.io/illustrated-transformer/
How Attention works in Transformer
I will explain attention with example from Jay Alamar, but with costumization
Step 1
- There are 2 input words, “Pembelajaran” and “Mesin”
- Every words has embedding $X$. So “Pembelajaran” has shape (1x4) and “Mesin” has same shape. Transformer have default $d_{model}=512$. In this case, $d_{model}=4$
- Multiply embedding with weight (random shape)
- Multiply embedding with $W^Q$ and produce $Q$ (query)
- Multiply embedding with $W^K$ and produce $K$ (key)
- Multiply embedding with $W^V$ and produce $V$ (value)
Step 2
“Step 2” in image is how to compute Score in “Step 3”, so when: $q1$ (1x3) . $k1$ (3x1) = $score$ (1x1)
Step 3
\[\text{attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax} \left(\frac{QK^T}{\sqrt{d_k}} \right) V\]- Word “Pembelajaran” to Scale-dot product attention
- Score = $q_n$ . $k_n$ (dot-product)
- Divide with ${\sqrt{d_k}}$ from $\frac{d_{model}}{h}$. Where default value $d_{model}=512$ and $h=8$
- Softmax from result before
- Multiply with $V$
Why multiply softmax with $V$? $V$ is the value. From this case we know in word “Pembelajaran”, model will add more attention to $V_1$ or it self. Like this explanation Explain like I’m 10
Step 4
- Step 4 is result from $head_1$, it is $Z_0$ mean.
- Transformer has 8 heads.
- Repeat Step 1 - 3 with different value of $W$ in every head
Step 5
- After every head compute $Z$ (3x8), concatenate every head (8) so we have $Z$ (2x24)
Step 6
- Multiply $Z$ with $W_0$ (24x4)
- So model will get input size again (2x4)
DETR
How DETR Works?
If you want to know workflow of DETR in English, you can watch this Youtube video from Nicolas Carion (one of the authors of DETR)
Architecture
DETR (Detection Transformer) have simple pipeline to detect object in image. DETR predicts all objects at once, and is trained end-to-end with a set loss function which performs bipartite matching between predicted and ground-truth objects. DETR simplifies the detection pipeline by dropping multiple hand-designed components that encode prior knowledge, like spatial anchors or non-maximal suppression. Unlike most existing detection methods, DETR doesn’t require any customized layers, and thus can be reproduced easily in any framework that contains standard CNN and transformer classes.
Training
- Process image to CNN backbone to get feature maps
- Process to Transformer Encoder-Decoder & FFN (Classifier and Bounding Box) to get set of N prediction (class, bounding-box)
- Calculate set of prediction to a set loss function which performs bipartite matching between predicted and ground-truth objects to remove false or extra detection
References
Aman Arora cover details of DETR here amaarora.github.io/posts/2021-07-26-annotateddetr.html. I will explain to simplify the DETR architecture and process
Input Data
- Batch of images trained all of once. $[(image, image), (anno, anno)]$
- The input images are batched together, applying 0-padding adequately to ensure they all have the same dimensions ($H_0$, $W_0$) as the largest image of the batch and saved to NestedTensor. This “Annotated DETR” website explain how NestedTensor works. [4]
- So, the shape of input will be,
- Tensors $(batch, 3, H_0, W_0)$
- Mask $(batch, 3, H_0, W_0)$ stores $True$ and $False$
CNN Backbone
- Paper use ResNet50 as backbone.
- DETR can use any CNN backbone
- Shape of output will be,
- $f ∈ R^{C×H×W}$. $C=2048$
- Tensors $(batch, 2048, h_0, w_0)$
- Mask $(batch, 2048, h_0, w_0)$
Positional Encoding
- Great explanation about Positional Encoding [4]
- Positional Encoding generate shape output,
- $pos ∈ R^{256×H×W}$
Transformer
DETR use same transformer architecture as “Attention all you need” paper.
Transformer Encoder
- Before input passed to encoder
- First, a 1x1 convolution reduces the channel dimension of the high-level activation map $f$ from $C$ to a smaller dimension $d$ creating a new feature map $z_0 ∈ R^{d×H×W}$.
- Here, $d = 256$, therefore, the new feature map size becomes $f ∈ R^{256×H×W}$ which matches that of the positional encodings.
- The encoder expects a sequence as input, hence we collapse the spatial dimensions of into one dimension, resulting in a $d \times HW$ feature map.
- Each encoder layer has a standard architecture and consists of a multi-head self-attention module and a feed forward network (FFN).
- Input shape:
- $HW \times batch \times d$
- Output shape:
- $HW \times batch \times d$
Object Queries
- Where object located in image. I explain about object queries here
- N Object queries will be generated N posibble locations of predictions (class, bounding box) after decoder
- Object queries generated by NN.Embedding
- $N$ in paper $=100$, or maximum number of object detected is 100
- Input shape:
- $N \times batch \times d$
Transformer Decoder
- DETR use object queries as an input to decoder
- DETR decoder run in parallel, different than normal encoder transformer
- DETR use 2 input
- First input is from object queries
- Input shape:
- $HW \times batch \times d$
- Input shape:
- In middle is from output of encoder. Processed to multi head attention
- Input shape:
- $N \times batch \times d$
- Input shape:
- First input is from object queries
- Output shape:
- $N \times batch \times d$
Explain like i'm in 10
Why decoder need 2 input?
Analogy: When preparing for an exam tomorrow, it is important to follow a structured study plan. You learn about topic A first, then topic B, and etc (it called self-attention). After all topic covered, you can move on to solve quiz exercise that is related to topic A-C (multi-head attention).
Same as decoder, model need to learn possible where object is located or object queries (self-attention) and correlation object queries with output of encoder (multi-head attention).
FFN Classifier & Bounding Box
- Output from decoder go to FFN classifier and bounding box
- Output FFN Classifier shape:
- $batch \times N \times (total class+1)$
- Output Bounding Box shape:
- $batch \times N \times 4$
- Bounding box = $C_x, C_y, H, W$
Loss Function
Loss function in DETR is pretty simple. There are 2 steps in loss function:
- Calculate the best match of predictions with respect to given ground truths using a graph technique (bipartite matching) with a cost function.
- Define a loss to penalize the class and box predictions from the best match (minimum cost from bipartite matching).
What is bipartite matching?
From GeeksForGeeks,
- A matching in a Bipartite Graph is a set of the edges chosen in such a way that no two edges share an endpoint. A maximum matching is a matching of maximum size (maximum number of edges).
- In a maximum matching, if any edge is added to it, it is no longer a matching. There can be more than one maximum matching’s for a given Bipartite Graph.
Note
In DETR, we have to find best set of predicted box (class, bounding box) for a given ground truth. This part eliminates the need of Non Maximum Suppression(NMS) and the anchors used in the another object detectors of today.
Example of Bipartite Matching
This is example of bipartite matching how to find worker-jobs assignment based cost of work.
This is example of bipartite matching in object detection case. How we can find prediction based on ground truth.
Bipartite Matching in object detection
But, how we calculate matching cost in DETR?
Bipartite Matching
Find best match set of permutation (lowest cost)
\[\hat{\sigma}=\underset{\sigma \in \mathfrak{S}_N}{\arg \min } \sum_i^N \mathcal{L}_{\operatorname{match}}\left(y_i, \hat{y}_{\sigma(i)}\right)\]- $\hat{\sigma}$ is best match between prediction and GT with lowest cost
- $y_i$ is set of ground truth (GT)
- $\hat{y}_{\sigma(i)}$ is set of prediction
- $N$ is number of prediction
Matching cost function
\[-\mathbb{1}_{\left\{c_i \neq \varnothing\right\}} \hat{p}_{\sigma(i)}\left(c_i\right)+\mathbb{1}_{\left\{c_i \neq \varnothing\right\}} \mathcal{L}_{\text {box }}\left(b_i, \hat{b}_{\sigma(i)}\right)\]- $\hat{p}_{\sigma(i)}\left(c_i\right)$ is class probability
- $-1{\left{c_i \neq \varnothing\right}} \hat{p}{\sigma(i)}\left(c_i\right)$ is when class not empty (non no-object), then multiply with -1. Why -1? because we want to find lowest / minimum cost.
Box cost function
\[\mathcal{L}_{b o x}\left(b_i, \hat{b}_{\sigma(i)}\right)=\lambda_{\text {iou }} \mathcal{L}_{\text {iou }}\left(b_i, \hat{b}_{\sigma(i)}\right)+\lambda_{L 1}\left\|b_i-\hat{b}_{\sigma(i)}\right\|_1\]- $L_{IoU}(b_i, \hat{b}_{\sigma(i)})$ is GIoU
- $\left|b_i-\hat{b}_{\sigma(i)}\right|_1$ is L1-Norm
DETR Loss
After model get set of best match, we can calculate loss.
\[\mathcal{L}_{\text {Hungarian }}(y, \hat{y})=\sum_{i=1}^N\left[-\log \hat{p}_{\hat{\sigma}(i)}\left(c_i\right)+\mathbb{1}_{\left\{c_i \neq \varnothing\right\}} \mathcal{L}_{\text {box }}\left(b_i, \hat{b}_{\hat{\sigma}(i)}\right)\right]\]- $\hat{\sigma}$ is best match from previous
- $-\log \hat{p}_{\hat{\sigma}(i)}\left(c_i\right)$ Negative Log Likehood (NLL) of prediction
- $L_{\text{box}}\left(b_i,\hat{b}_{\hat{\sigma}(i)}\right)$ is box loss
Box loss function
\[\mathcal{L}_{b o x}\left(b_i, \hat{b}_{\sigma(i)}\right)=\lambda_{\text {iou }} \mathcal{L}_{\text {iou }}\left(b_i, \hat{b}_{\sigma(i)}\right)+\lambda_{L 1}\left\|b_i-\hat{b}_{\sigma(i)}\right\|_1\]- $L_{\text {iou }}\left(b_i, \hat{b}_{\sigma(i)}\right)$ is GIoU Loss (1-GIoU)
- $\left|b_i-\hat{b}_{\sigma(i)}\right|_1$ is L1-Norm Loss
References
[1] Carion N, Massa F., Synnaeve G., Usunier N., Kirillov A., and Zagoruyko S. (2020, May 26). End-to-End Object Detection with Transformers. ArXiv.org. https://arxiv.org/abs/2005.12872
[2] P. Sun et al., “What Makes for End-to-End Object Detection?” arXiv, 2020. doi: 10.48550/ARXIV.2012.05780. https://doi.org/10.48550/arXiv.1811.06626
[3] A. Vaswani et al., “Attention Is All You Need.” arXiv, 2017. doi: 10.48550/ARXIV.1706.03762.
[4] A. Arora., The Annotated DETR. https://amaarora.github.io/posts/2021-07-26-annotateddetr.html