RoI Pooling, RoI Align & Deformable RoI Pooling

What is RoI Pooling?

Region of Interest (RoI) pooling is a fundamental operation in two-stage object detectors like Faster R-CNN and Mask R-CNN. Given a CNN feature map and proposed regions of varying sizes, RoI pooling extracts fixed-size feature vectors for each region—enabling downstream classification and bounding box regression.

The challenge is that region proposals have arbitrary positions and sizes, but the detection head expects fixed-size inputs. RoI Pooling solved this with quantized max pooling, but its rounding errors became problematic for pixel-precise tasks. RoI Align eliminated quantization using bilinear interpolation, while Deformable RoI Pooling added learned offsets for shape-adaptive sampling.

The Problem: Arbitrary Regions, Fixed Networks

Two-stage detectors face a fundamental mismatch:

Region proposals come in arbitrary sizes (50×30, 200×150, 80×80...)
Detection heads (FC layers) require fixed-size inputs (7×7×512)

We need to extract features from each proposed region and resize them to a fixed spatial size—but how do we handle regions that don't align with the feature map grid?

RoI Pooling: The Original Approach

RoI Pooling: The Quantization Problem

Drag the RoI (blue rectangle) to see how quantization introduces misalignment

Actual RoI

Quantized RoI

Pool Bins

Area Error: 1.8%

Low error - good alignment

Original RoI

x:1.30

y:0.80

w:3.40

h:2.60

Quantized (floor)

x1:1

y1:0

w:3

h:3

Pooled Output (2x2)

Max pooling from each bin

Why This Matters

RoI Pooling applies two levels of quantization: once when mapping RoI coordinates to feature map cells, and again when dividing into pooling bins. For instance-level tasks like segmentation, this misalignment causes significant accuracy loss. Mask R-CNN introduced RoI Align to solve this problem.

How RoI Pooling Works

Map RoI to Feature Map: Scale the RoI coordinates by the stride (e.g., 16x for VGG). A 160×96 RoI becomes 10×6 on the feature map.
Quantize Coordinates: Round floating-point coordinates to integers. This is the first source of quantization error.
Divide into Bins: Split the quantized region into a fixed grid (e.g., 7×7). Bin sizes are also quantized to integers.
Max Pool Each Bin: Apply max pooling within each bin to produce the output feature.

The Two Levels of Quantization

RoI Pooling introduces two rounds of rounding:

RoI Boundary Quantization: When mapping RoI coordinates to the feature map
Bin Size Quantization: When dividing the region into pooling bins

For a 7×7 output, each level can introduce up to 0.5 cell error, compounding to potentially 1 full cell of misalignment. At 16× downsampling, this means up to 16 pixels of error in the original image!

RoI Align: Eliminating Quantization

RoI Align: Bilinear Interpolation

Click anywhere on the grid to see how values are interpolated from the 4 nearest neighbors

Sample Point

x = 2.30

y = 1.70

Interpolated = 60.5

Neighbor Weights

(2, 1) = 65

w = 0.210

(3, 1) = 85

w = 0.090

(2, 2) = 80

w = 0.490

(3, 2) = 0

w = 0.210

Bilinear Formula

f(x,y) = Σ w_ij × Q_ij

= 0.21×65 + 0.09×85 + 0.49×80 + 0.21×0

= 60.5

No Quantization

Unlike RoI Pooling, RoI Align samples at exact floating-point positions using bilinear interpolation. The value at any point is a weighted combination of the 4 nearest feature map values, with weights inversely proportional to distance. This eliminates misalignment and is critical for Mask R-CNN.

RoI Align (Mask R-CNN, 2017) eliminates both quantization steps:

No rounding of RoI coordinates
Floating-point bin boundaries
Regular sampling points within each bin (typically 4 per bin)
Bilinear interpolation to compute values at exact positions

Bilinear Interpolation Formula

def bilinear_interpolate(feature_map, x, y):
    """
    Compute feature value at floating-point position (x, y)
    using the 4 nearest neighbors.
    """
    x0, y0 = int(x), int(y)
    x1, y1 = x0 + 1, y0 + 1

    # Distance weights
    wa = (x1 - x) * (y1 - y)  # weight for (x0, y0)
    wb = (x - x0) * (y1 - y)  # weight for (x1, y0)
    wc = (x1 - x) * (y - y0)  # weight for (x0, y1)
    wd = (x - x0) * (y - y0)  # weight for (x1, y1)

    # Weighted sum
    return (wa * feature_map[y0, x0] +
            wb * feature_map[y0, x1] +
            wc * feature_map[y1, x0] +
            wd * feature_map[y1, x1])

Why This Matters for Segmentation

For bounding box detection, small misalignments might be acceptable. But for instance segmentation (predicting per-pixel masks), a 1-pixel shift can completely change which object a pixel belongs to. RoI Align's sub-pixel precision is essential for Mask R-CNN's performance.

Deformable RoI Pooling: Shape Adaptation

Deformable RoI Pooling: Learned Offsets

See how learned offsets adapt the sampling grid to object shapes

Regular Grid

Deformed Samples

Learned Offsets

Object Shape

Offsets spread horizontally to capture wide objects

Offset Statistics

Sampling Points:3×3 = 9

Max Offset:15px

Learnable Params:18

Learned via Backprop

Offsets are predicted by a small FC layer trained end-to-end. The network learns to focus on informative regions for each object class.

Shape-Adaptive Sampling

While RoI Align eliminates quantization, it still uses a fixed rectangular grid. Deformable RoI Pooling adds learnable 2D offsets to each sampling point, allowing the network to adaptively focus on irregular object boundaries, occluded parts, or semantically important regions. This is especially useful for objects with complex shapes.

Deformable RoI Pooling (Deformable ConvNets, 2017) extends RoI Align with learned offsets:

A small FC layer predicts 2D offsets (Δx, Δy) for each sampling point
Offsets are added to the regular grid positions
Bilinear interpolation samples at the shifted locations
Offsets are trained end-to-end via backpropagation

This allows the network to adaptively focus on:

Irregular object boundaries (non-rectangular shapes)
Semantically important parts (face of a person, wheels of a car)
Occluded regions (sampling around occlusions)

Method Comparison

Click on a method to see detailed information

Property	RoI Pooling	RoI Align	Deformable RoI
Alignment	Poor (up to 2 cells off)	Exact (sub-pixel)	Learned (adaptive)
Complexity	O(k²)	O(k² × 4)	O(k² × 4 + FC)
Best Use Case	Classification, when precision not critical	Instance segmentation, precise localization	Complex shapes, articulated objects

Evolution Timeline

2015Fast R-CNN

2017Mask R-CNN

2017Deformable ConvNets

Aspect	RoI Pooling	RoI Align	Deformable RoI
Quantization	2 levels	None	None
Alignment Error	Up to 16px	Sub-pixel	Adaptive
Shape Handling	Fixed grid	Fixed grid	Learned offsets
Speed	Fastest	Fast	Moderate
Parameters	0	0	O(k²)
Best For	Classification	Segmentation	Complex shapes

Use Cases

Object Detection: Classification and localization of objects (Faster R-CNN with RoI Align)
Instance Segmentation: Per-pixel masks for each object (Mask R-CNN requires RoI Align)
Human Pose Estimation: Detecting keypoints on articulated bodies (Deformable RoI)
Video Object Detection: Tracking objects across frames
Medical Imaging: Detecting lesions with precise boundaries
Autonomous Driving: Real-time detection of vehicles and pedestrians

Key Formulas

RoI Pooling (max pooling with quantization):

output[i,j] = max(features[⌊y1 + i×h/k⌋ : ⌊y1 + (i+1)×h/k⌋,
                          ⌊x1 + j×w/k⌋ : ⌊x1 + (j+1)×w/k⌋])

RoI Align (bilinear sampling):

output[i,j] = avg(bilinear(features, sample_points[i,j]))

Deformable RoI (offset-augmented):

output[i,j] = avg(bilinear(features, sample_points[i,j] + Δ[i,j]))
where Δ = offset_predictor(pooled_features)

Best Practices

Use RoI Align by Default: Unless speed is critical, RoI Align's precision benefits outweigh its small overhead
Match Feature Map Resolution: Ensure RoI coordinates are correctly scaled by the backbone's stride (16 or 32)
Use Multiple Sampling Points: 4 samples per bin (2×2) is standard; more helps for small objects
Consider FPN for Multi-Scale: Assign RoIs to appropriate FPN levels based on size
Regularize Deformable Offsets: Add L2 regularization to prevent offsets from diverging