RoI Pooling, RoI Align & Deformable RoI Pooling

Understanding region-based feature extraction for object detection, from quantized pooling to sub-pixel alignment and adaptive sampling

Best viewed on desktop for optimal interactive experience

What is RoI Pooling?

Region of Interest (RoI) pooling is a fundamental operation in two-stage object detectors like Faster R-CNN and Mask R-CNN. Given a CNN feature map and proposed regions of varying sizes, RoI pooling extracts fixed-size feature vectors for each region—enabling downstream classification and bounding box regression.

The challenge is that region proposals have arbitrary positions and sizes, but the detection head expects fixed-size inputs. RoI Pooling solved this with quantized max pooling, but its rounding errors became problematic for pixel-precise tasks. RoI Align eliminated quantization using bilinear interpolation, while Deformable RoI Pooling added learned offsets for shape-adaptive sampling.

The Problem: Arbitrary Regions, Fixed Networks

Two-stage detectors face a fundamental mismatch:

  1. Region proposals come in arbitrary sizes (50×30, 200×150, 80×80...)
  2. Detection heads (FC layers) require fixed-size inputs (7×7×512)

We need to extract features from each proposed region and resize them to a fixed spatial size—but how do we handle regions that don't align with the feature map grid?

RoI Pooling: The Original Approach

RoI Pooling: The Quantization Problem

Drag the RoI (blue rectangle) to see how quantization introduces misalignment

1119224848687286344857517671941044854727097102171370828910318143634971091325273649566213444466072723731487077818710554547983961011126
Actual RoI
Quantized RoI
Pool Bins
Area Error: 1.8%

Low error - good alignment

Original RoI
x:1.30
y:0.80
w:3.40
h:2.60
Quantized (floor)
x1:1
y1:0
w:3
h:3
Pooled Output (2x2)
19
22
48
57

Max pooling from each bin

Why This Matters

RoI Pooling applies two levels of quantization: once when mapping RoI coordinates to feature map cells, and again when dividing into pooling bins. For instance-level tasks like segmentation, this misalignment causes significant accuracy loss. Mask R-CNN introduced RoI Align to solve this problem.

How RoI Pooling Works

  1. Map RoI to Feature Map: Scale the RoI coordinates by the stride (e.g., 16x for VGG). A 160×96 RoI becomes 10×6 on the feature map.

  2. Quantize Coordinates: Round floating-point coordinates to integers. This is the first source of quantization error.

  3. Divide into Bins: Split the quantized region into a fixed grid (e.g., 7×7). Bin sizes are also quantized to integers.

  4. Max Pool Each Bin: Apply max pooling within each bin to produce the output feature.

The Two Levels of Quantization

RoI Pooling introduces two rounds of rounding:

  1. RoI Boundary Quantization: When mapping RoI coordinates to the feature map
  2. Bin Size Quantization: When dividing the region into pooling bins

For a 7×7 output, each level can introduce up to 0.5 cell error, compounding to potentially 1 full cell of misalignment. At 16× downsampling, this means up to 16 pixels of error in the original image!

RoI Align: Eliminating Quantization

RoI Align: Bilinear Interpolation

Click anywhere on the grid to see how values are interpolated from the 4 nearest neighbors

10305070902545658554060800205575951535709010305060.50123401234
Sample Point
x = 2.30
y = 1.70
Interpolated = 60.5
Neighbor Weights
(2, 1) = 65
w = 0.210
(3, 1) = 85
w = 0.090
(2, 2) = 80
w = 0.490
(3, 2) = 0
w = 0.210
Bilinear Formula
f(x,y) = Σ wij × Qij
= 0.21×65 + 0.09×85 + 0.49×80 + 0.21×0
= 60.5
No Quantization

Unlike RoI Pooling, RoI Align samples at exact floating-point positions using bilinear interpolation. The value at any point is a weighted combination of the 4 nearest feature map values, with weights inversely proportional to distance. This eliminates misalignment and is critical for Mask R-CNN.

RoI Align (Mask R-CNN, 2017) eliminates both quantization steps:

  1. No rounding of RoI coordinates
  2. Floating-point bin boundaries
  3. Regular sampling points within each bin (typically 4 per bin)
  4. Bilinear interpolation to compute values at exact positions

Bilinear Interpolation Formula

def bilinear_interpolate(feature_map, x, y): """ Compute feature value at floating-point position (x, y) using the 4 nearest neighbors. """ x0, y0 = int(x), int(y) x1, y1 = x0 + 1, y0 + 1 # Distance weights wa = (x1 - x) * (y1 - y) # weight for (x0, y0) wb = (x - x0) * (y1 - y) # weight for (x1, y0) wc = (x1 - x) * (y - y0) # weight for (x0, y1) wd = (x - x0) * (y - y0) # weight for (x1, y1) # Weighted sum return (wa * feature_map[y0, x0] + wb * feature_map[y0, x1] + wc * feature_map[y1, x0] + wd * feature_map[y1, x1])

Why This Matters for Segmentation

For bounding box detection, small misalignments might be acceptable. But for instance segmentation (predicting per-pixel masks), a 1-pixel shift can completely change which object a pixel belongs to. RoI Align's sub-pixel precision is essential for Mask R-CNN's performance.

Deformable RoI Pooling: Shape Adaptation

Deformable RoI Pooling: Learned Offsets

See how learned offsets adapt the sampling grid to object shapes

Regular Grid
Deformed Samples
Learned Offsets
Object Shape
Object Shape

Offsets spread horizontally to capture wide objects

Offset Statistics
Sampling Points:3×3 = 9
Max Offset:15px
Learnable Params:18
Learned via Backprop

Offsets are predicted by a small FC layer trained end-to-end. The network learns to focus on informative regions for each object class.

Shape-Adaptive Sampling

While RoI Align eliminates quantization, it still uses a fixed rectangular grid. Deformable RoI Pooling adds learnable 2D offsets to each sampling point, allowing the network to adaptively focus on irregular object boundaries, occluded parts, or semantically important regions. This is especially useful for objects with complex shapes.

Deformable RoI Pooling (Deformable ConvNets, 2017) extends RoI Align with learned offsets:

  1. A small FC layer predicts 2D offsets (Δx, Δy) for each sampling point
  2. Offsets are added to the regular grid positions
  3. Bilinear interpolation samples at the shifted locations
  4. Offsets are trained end-to-end via backpropagation

This allows the network to adaptively focus on:

  • Irregular object boundaries (non-rectangular shapes)
  • Semantically important parts (face of a person, wheels of a car)
  • Occluded regions (sampling around occlusions)

Method Comparison

Method Comparison

Click on a method to see detailed information

PropertyRoI PoolingRoI AlignDeformable RoI
AlignmentPoor (up to 2 cells off)Exact (sub-pixel)Learned (adaptive)
ComplexityO(k²)O(k² × 4)O(k² × 4 + FC)
Best Use CaseClassification, when precision not criticalInstance segmentation, precise localizationComplex shapes, articulated objects
Evolution Timeline
2015Fast R-CNN
2017Mask R-CNN
2017Deformable ConvNets
AspectRoI PoolingRoI AlignDeformable RoI
Quantization2 levelsNoneNone
Alignment ErrorUp to 16pxSub-pixelAdaptive
Shape HandlingFixed gridFixed gridLearned offsets
SpeedFastestFastModerate
Parameters00O(k²)
Best ForClassificationSegmentationComplex shapes

Use Cases

  • Object Detection: Classification and localization of objects (Faster R-CNN with RoI Align)
  • Instance Segmentation: Per-pixel masks for each object (Mask R-CNN requires RoI Align)
  • Human Pose Estimation: Detecting keypoints on articulated bodies (Deformable RoI)
  • Video Object Detection: Tracking objects across frames
  • Medical Imaging: Detecting lesions with precise boundaries
  • Autonomous Driving: Real-time detection of vehicles and pedestrians

Key Formulas

RoI Pooling (max pooling with quantization):

output[i,j] = max(features[⌊y1 + i×h/k⌋ : ⌊y1 + (i+1)×h/k⌋, ⌊x1 + j×w/k⌋ : ⌊x1 + (j+1)×w/k⌋])

RoI Align (bilinear sampling):

output[i,j] = avg(bilinear(features, sample_points[i,j]))

Deformable RoI (offset-augmented):

output[i,j] = avg(bilinear(features, sample_points[i,j] + Δ[i,j])) where Δ = offset_predictor(pooled_features)

Best Practices

  1. Use RoI Align by Default: Unless speed is critical, RoI Align's precision benefits outweigh its small overhead

  2. Match Feature Map Resolution: Ensure RoI coordinates are correctly scaled by the backbone's stride (16 or 32)

  3. Use Multiple Sampling Points: 4 samples per bin (2×2) is standard; more helps for small objects

  4. Consider FPN for Multi-Scale: Assign RoIs to appropriate FPN levels based on size

  5. Regularize Deformable Offsets: Add L2 regularization to prevent offsets from diverging

Further Reading

If you found this explanation helpful, consider sharing it with others.

Mastodon