Recognition

Feature Matching

An object as a collection of bag of features

deals well with occlusion
scale invariant
rotation invariant
position of parts is not important
Works well for object instances
Not great for generic object categories

Pros

Simple
Efficient algorithms
Robust to deformations

Cons

No spatial reasoning

Bag Of Words

Works pretty well for image-level classification, less okay for object detection because spatial layout gets lost and objects are with shape and boundaries

Present a data item (document, texture, image) as a histogram over features

Term Frequency Inverse Document Frequency (TFIDF)

$TF = n(w_i, d) = { \#~of~w_i~appearance \over \#~of~all~word~d}$

$IDF = log { {D \over \sum_d’ 1[w_i \in d’]}} = log { {\#~of~all~documents \over \#~of~all~documents ~has~this~word~w_i } }$

$TFIDF = n(w_i, d)\alpha_i = n(w_i, d) log { {D \over \sum_d’ 1[w_i \in d’]}} = TF * IDF$

Texture recognition

Characterized by repetition of basic elements or textons

Assumes that each texture is sufficiently represented purely by the occurrence distribution of the textons.

Standard BOW Pipeline

Dictionary Learning: Learn Visual Words using clustering

Which Features to Extract

BOW aggregates local information

SIFT (the descriptor) is preferred due to its discriminativeness

To find specific textured objects, sparse sampling from interest points often more reliable.
Multiple complementary interest operators offer more image coverage.
For object categorization, dense sampling offers better coverage.

Clustering is a common method for learning a visual vocabulary or codebook

If the training set is sufficiently representative, the codebook will be universal

Encode: Build Bags-of-Words (BOW) vectors for each image

Quantization: associate each image feature to a visual word (nearest cluster center)
Histogram: count the number of visual word occurrences

Classify: Train and test data using BOWs

Pros

flexible to geometry / deformations / viewpoint
compact summary of image content
provides fixed dimensional vector representation for sets
very good results in practice

Cons

background and foreground mixed when bag covers whole image
optimal vocabulary formation remains unclear
basic model ignores geometry – must verify afterwards, or encode it somehow within via features

kNN

Important to normalize: dimensions have different scales

Hyper-parameters are highly problem-dependent, need to choose via validation data.

Pros

simple yet effective

Cons

search is expensive (can be speed-up) - O(mn)
storage requirements - O(m)
difficulties with high-dimensional data - comparison of distance measure becomes more difficult with higher-dimensional data

Spatial Reasoning

Extract features
Match features
Spatial verification

Pros

Retains spatial constraints
Robust to deformations

Cons

Computationally expensive
Generalization to large inter-class variation (e.g., modeling chairs)

Window Classification

Template Matching

Get image window
Extract features
Compare with template

Pros

Retains spatial constraints
Can be highly discriminative

Cons

Many many possible windows to evaluate
Requires large amounts of data
Sometimes (very) slow to train

Window-Based Object Detection Framework

Build/train object model
- Choose a representation
- Learn or fit parameters of model / classifier
Generate candidate windows in new image
- test all locations and scales
Scale the image instead of the template
Score the candidate
Resolve detections (NMS)
- Rescore candidates based on entire set
Threshold & apply non-maximum suppression

Why scale the image and not the template?

Template is complicated and scale-specific; unlikely to survive a resizing operation, so we scale input image instead.

When it is simple, e.g. Viola-Jones, it is okay (and actually more efficient) to rescale the detector / template / features.

Evaluation

Area of Overlap

$AO(B_{gt}, B_p) = {B_{gt} \cap B_p \over B_{gt} \cup B_p}$
Average Percision

Averages precision over the entire range of recall

A good score requires both high recall and high precision

Deal with different viewpoint

Often train different models for a few different viewpoints

Support Vector Machines (SVM)

Maximum Margin solution: most stable to perturbations of data

Margin: the gap between parallel hyperplanes $2\over \parallel w \parallel$ is maximized

Trade-off between the margin and the mistakes

Multi-class SVMs

One vs. all

Training: learn an SVM for each class vs. the rest
Testing: apply each SVM to test example and assign to it the class of the SVM that returns the highest decision value

One vs. one

Training: learn an SVM for each pair of classes
Testing: each learned SVM “votes” for a class to assign to the test example

Convert the Classification Score to Confidence Score

$y = {1 \over 1 + e^{Ax+B}}$

$A$ and $B$ are fitted based on classifier outputs (i.e. from validation data) s.t. the same classifications result

Pros

Widely available and built into virtually any machine learning package

Kernel-based framework is very powerful, flexible
Compact model
Work very well in practice, even with very small training sample sizes

Cons

No “direct” multi-class SVM, must combine two-class SVMs
Can be tricky to select best kernel function for a problem
Computation, memory
- During training time, must compute matrix of kernel values for every pair of examples
- Learning can take a very long time for large-scale problems

Dalal-Triggs Pedestrian Detector

applying SVMs to window-based object detection using HoG (histogram of oriented graidents).

Object model = sum of scores of features at fixed positions!

Extract fixed-sized (64x128 pixel) window at each position and scale

Normalize gamma & color
Compute HOG (histogram of gradient) features within each window

Orientations: 9 bins (for unsigned angles 0 - 180)
- Votes weighted by magnitude
- Bilinear interpolation between cells
- Histograms over k x k pixel cells (similar to SIFT)
Contrast normalized over overlapping spatial blocks
- Normalize wrt surrounding cells
- Concatenate all cell responses from block into vector
- Normalize vector
- Extract responses from cell of interest
- Do this 4x for each overlapping block
$#features= # cells \times # orientations \times # normalizations~by ~neighboring~cells$
Score the window with a linear SVM classifier
Perform non-maxima suppression to remove overlapping detections with lower scores

Pedestrian detection with HOG

Learn a pedestrian template using a SVM
At test time, compare feature map with template over sliding windows.
Find local maxima of response
Multi-scale template: repeat over multiple levels of a HOG pyramid

Pros

Sliding window detection and global appearance descriptors follow a simple implementation protocol
Good feature choices critical
Past successes for certain classes

Cons

High computational complexity
- $#locations \times #orientations \times #scales$ evaluations!
- If training binary detectors independently, cost increases linearly with number of classes
FP rate can be high

Limitations

Some objects are not box shaped
Non-rigid, deformable objects not captured well with representations assuming a fixed 2d structure; or must assume fixed viewpoint
Objects with less-regular textures not captured well with holistic appearance-based descriptions
If considering windows in isolation, context is lost
In practice, often entails large, cropped training set (expensive)
Requiring good match to a global appearance description can lead to sensitivity to partial occlusions

Tricks of the trade

Template size
- Typical choice is size of smallest detectable object
“Jittering” to create synthetic positive examples
- Create slightly rotated, translated, scaled, mirrored versions as extra positive examples
Bootstrapping to get hard negative examples
- Randomly sample negative examples
- Train detector
- Sample negative examples that score > -1
- Repeat until all high-scoring negative examples fit in memory

Viola-Jones Face Detector

Slow to train but detection is very fast

Integral images for fast feature evaluation

$I_\sum (x, y) = \sum_{x’≤x, y’≤y} i(x’, y’)$

So that only 3 additions are required for any size of rectangle
Boosting for feature selection

Features that are fast to compute - “Haar-like features”
- Differences of sums of intensity $Value = \sum (pixels~in~white~area) -\sum(pixels~in~black~area) $
- Computed at different positions and scales within sliding window
In each boosting round:
- Find the weak classifier that achieves the lowest weighted training error.
- Raise the weights of training examples misclassified by current weak classifier.
- Compute final classifier as linear combination of all weak classifier.
- Weight of each classifier is directly proportional to its accuracy.
$h(x) = sign(\sum^M_{j=1}\alpha_jh_j(x))$

$h$: final strong learner

$x$: window

$\alpha_j$: learner weight

$h_j$: weak learner

Choose weak learner that minimizes error on the weighted training set, then reweight
Attentional cascade for fast non-face window rejection

Reject many nagative examples but detect almost all positive examples
- Set target detection and false positive rates for each stage
- Keep adding features to current stage until target rates have been met
  - Need to lower boosting threshold to maximize detection (as opposed to minimizing total classification error)
  - Test on a validation set
- If the overall false positive rate is not low enough, add another stage
- Use false positives from current stage as the negative training examples for the next stage
Chain classifiers that are progressively more complex and have lower false positive rates

Boosting vs. SVM

Advantages of boosting

Integrates classifier training with feature selection
Complexity of training is linear instead of quadratic in the number of training examples
Flexibility in the choice of weak learners, boosting scheme
Testing is fast

Disadvantages

Needs many training examples
Training is slow
Often doesn’t work as well as SVM (especially for many-class problems)

For 1 Mpix, to avoid having a false positive in every image, our false positive rate has to be less than $10^{-6}$

Summary

Sliding window for search
Features based on differences of intensity (gradient, wavelet, etc.)
- Excellent results require careful feature design
Boosting for feature selection
Integral images, cascade for speed
Bootstrapping to deal with many negative examples

PREVIOUSImage Descriptors

NEXTMotion Estimation