DETR set prediction loss

DETR 논문 바로가기

DETR 논문을 읽었다. ‘vision 문제중 하나인 object detector를 Transformer로 풀었다 NMS도 필요없다. 또 panoptic segmentation 에까지 확장 가능하다.’ 라는 내용인데, 읽으면서 몇가지 의문이 들었다. 구현을 염두해둔 의문(positional encoding은 어떻게 코드로 적용할까나? 번역, 챗봇과 다른 vision문제에서 Transformer는 Attention map을 어떻게 그릴까나?)과 DETR에서 가장 중요한 내용인 hungarian algorithm 이다. hungarian algorithm에 관련된 의문을 제외하곤 페이스북의 공식 DETR 깃허브의 코드들을 보면서 의문점을 해결했다.

DETR Attention 참고

DETR positional encoding 참고

따라서 본 글에서는 논문에서 Hungarian algorithm 내용이 언급된 3.1 Object detection set prediction loss 부분을 심층 탐구 한다.(시간이 흘러 나중에 이 부분을 다시보면 이해가 잘 안될 것 같아서…)

Hungarian algorithm 알고리즘에 대한 글은 여기를 보시라.(사실 Hungarian algorithm 알고리즘을 정확히 알아야 아래의 내용을 제대로 이해할 수 있다.)

3.1 Object detection set prediction loss

To find a bipartite matching between these two sets we search for a permutation of N elements with $\sigma\epsilon\mathfrak{S}_N$ the lowest cost:

\[\hat{\sigma}=\underset {\sigma\epsilon\mathfrak{S}_N}{\operatorname{arg min}}\sum_{i}^N\zeta_{match}(y_i,\hat{y}_\sigma(i))\]

where $\zeta_{match}(y_i,\hat{y}_\sigma(i))$ is pair-wise mathcing cost between ground truth $y_i$ and a prediction with index $\sigma(i)$ This optimal assignment is computed efficiently with the Hungarian algorithm

정리하자면, predicted object 와 ground truth간의 Optimal Bipartite Matching을 찾기 위해, 가장 낮은 비용을 갖는 N 요소들의 순서($\sigma\epsilon\mathfrak{S}_{N}$)를 찾는 것이 위 식의 목적이고, 그 순서($\hat{\sigma}$)를 Hungarian algorithm을 사용해서 찾겠다는 말이다.

참고 - 식의 의미?

ground truth와 prediction이 매칭되었을 때 비용이 낮아진다. 그 때의 index(arg)을 구하는 것이다. 식을 보면 일단 최소 비용을 찾고(min) 이 비용에 대한 index(arg)를 찾는다.

$\sigma\epsilon\mathfrak{S}_{N}$ 는 $\sigma$(index) 가 $\mathfrak{S}_N$(N 요소들의 순서 집합)의 원소라는 얘기이다.

그 외 참고

scipy Hungarian algorithm 문서 참고

MathJax 작성 참고

DETR set prediction loss

3.1 Object detection set prediction loss

참고 - 식의 의미?

그 외 참고

CTC Loss의 이해

Attention Is ALL You Need

Beam Search Decoder

태그