What is RoI Pooling?
Region of Interest (RoI) pooling is a fundamental operation in two-stage object detectors like Faster R-CNN and Mask R-CNN. Given a CNN feature map and proposed regions of varying sizes, RoI pooling extracts fixed-size feature vectors for each region—enabling downstream classification and bounding box regression.
The challenge is that region proposals have arbitrary positions and sizes, but the detection head expects fixed-size inputs. RoI Pooling solved this with quantized max pooling, but its rounding errors became problematic for pixel-precise tasks. RoI Align eliminated quantization using bilinear interpolation, while Deformable RoI Pooling added learned offsets for shape-adaptive sampling.
The Problem: Arbitrary Regions, Fixed Networks
Two-stage detectors face a fundamental mismatch:
- Region proposals come in arbitrary sizes (50×30, 200×150, 80×80...)
- Detection heads (FC layers) require fixed-size inputs (7×7×512)
We need to extract features from each proposed region and resize them to a fixed spatial size—but how do we handle regions that don't align with the feature map grid?
RoI Pooling: The Original Approach
How RoI Pooling Works
-
Map RoI to Feature Map: Scale the RoI coordinates by the stride (e.g., 16x for VGG). A 160×96 RoI becomes 10×6 on the feature map.
-
Quantize Coordinates: Round floating-point coordinates to integers. This is the first source of quantization error.
-
Divide into Bins: Split the quantized region into a fixed grid (e.g., 7×7). Bin sizes are also quantized to integers.
-
Max Pool Each Bin: Apply max pooling within each bin to produce the output feature.
The Two Levels of Quantization
RoI Pooling introduces two rounds of rounding:
- RoI Boundary Quantization: When mapping RoI coordinates to the feature map
- Bin Size Quantization: When dividing the region into pooling bins
For a 7×7 output, each level can introduce up to 0.5 cell error, compounding to potentially 1 full cell of misalignment. At 16× downsampling, this means up to 16 pixels of error in the original image!
RoI Align: Eliminating Quantization
RoI Align (Mask R-CNN, 2017) eliminates both quantization steps:
- No rounding of RoI coordinates
- Floating-point bin boundaries
- Regular sampling points within each bin (typically 4 per bin)
- Bilinear interpolation to compute values at exact positions
Bilinear Interpolation Formula
def bilinear_interpolate(feature_map, x, y): """ Compute feature value at floating-point position (x, y) using the 4 nearest neighbors. """ x0, y0 = int(x), int(y) x1, y1 = x0 + 1, y0 + 1 # Distance weights wa = (x1 - x) * (y1 - y) # weight for (x0, y0) wb = (x - x0) * (y1 - y) # weight for (x1, y0) wc = (x1 - x) * (y - y0) # weight for (x0, y1) wd = (x - x0) * (y - y0) # weight for (x1, y1) # Weighted sum return (wa * feature_map[y0, x0] + wb * feature_map[y0, x1] + wc * feature_map[y1, x0] + wd * feature_map[y1, x1])
Why This Matters for Segmentation
For bounding box detection, small misalignments might be acceptable. But for instance segmentation (predicting per-pixel masks), a 1-pixel shift can completely change which object a pixel belongs to. RoI Align's sub-pixel precision is essential for Mask R-CNN's performance.
Deformable RoI Pooling: Shape Adaptation
Deformable RoI Pooling (Deformable ConvNets, 2017) extends RoI Align with learned offsets:
- A small FC layer predicts 2D offsets (Δx, Δy) for each sampling point
- Offsets are added to the regular grid positions
- Bilinear interpolation samples at the shifted locations
- Offsets are trained end-to-end via backpropagation
This allows the network to adaptively focus on:
- Irregular object boundaries (non-rectangular shapes)
- Semantically important parts (face of a person, wheels of a car)
- Occluded regions (sampling around occlusions)
Method Comparison
Method Comparison
Click on a method to see detailed information
| Property | RoI Pooling | RoI Align | Deformable RoI |
|---|---|---|---|
| Alignment | Poor (up to 2 cells off) | Exact (sub-pixel) | Learned (adaptive) |
| Complexity | O(k²) | O(k² × 4) | O(k² × 4 + FC) |
| Best Use Case | Classification, when precision not critical | Instance segmentation, precise localization | Complex shapes, articulated objects |
Evolution Timeline
| Aspect | RoI Pooling | RoI Align | Deformable RoI |
|---|---|---|---|
| Quantization | 2 levels | None | None |
| Alignment Error | Up to 16px | Sub-pixel | Adaptive |
| Shape Handling | Fixed grid | Fixed grid | Learned offsets |
| Speed | Fastest | Fast | Moderate |
| Parameters | 0 | 0 | O(k²) |
| Best For | Classification | Segmentation | Complex shapes |
Use Cases
- Object Detection: Classification and localization of objects (Faster R-CNN with RoI Align)
- Instance Segmentation: Per-pixel masks for each object (Mask R-CNN requires RoI Align)
- Human Pose Estimation: Detecting keypoints on articulated bodies (Deformable RoI)
- Video Object Detection: Tracking objects across frames
- Medical Imaging: Detecting lesions with precise boundaries
- Autonomous Driving: Real-time detection of vehicles and pedestrians
Key Formulas
RoI Pooling (max pooling with quantization):
output[i,j] = max(features[⌊y1 + i×h/k⌋ : ⌊y1 + (i+1)×h/k⌋, ⌊x1 + j×w/k⌋ : ⌊x1 + (j+1)×w/k⌋])
RoI Align (bilinear sampling):
output[i,j] = avg(bilinear(features, sample_points[i,j]))
Deformable RoI (offset-augmented):
output[i,j] = avg(bilinear(features, sample_points[i,j] + Δ[i,j])) where Δ = offset_predictor(pooled_features)
Best Practices
-
Use RoI Align by Default: Unless speed is critical, RoI Align's precision benefits outweigh its small overhead
-
Match Feature Map Resolution: Ensure RoI coordinates are correctly scaled by the backbone's stride (16 or 32)
-
Use Multiple Sampling Points: 4 samples per bin (2×2) is standard; more helps for small objects
-
Consider FPN for Multi-Scale: Assign RoIs to appropriate FPN levels based on size
-
Regularize Deformable Offsets: Add L2 regularization to prevent offsets from diverging
Further Reading
- Fast R-CNN - Original RoI Pooling
- Mask R-CNN - RoI Align for instance segmentation
- Deformable Convolutional Networks - Deformable RoI Pooling
- Feature Pyramid Networks - Multi-scale RoI assignment
- Cascade R-CNN - Multi-stage RoI refinement
