What is RoI Pooling?
Region of Interest (RoI) pooling is a fundamental operation in two-stage object detectors like Faster R-CNN and Mask R-CNN. Given a CNN feature map and proposed regions of varying sizes, RoI pooling extracts fixed-size feature vectors for each region—enabling downstream classification and bounding box regression.
The challenge is that region proposals have arbitrary positions and sizes, but the detection head expects fixed-size inputs. RoI Pooling solved this with quantized max pooling, but its rounding errors became problematic for pixel-precise tasks. RoI Align eliminated quantization using bilinear interpolation, while Deformable RoI Pooling added learned offsets for shape-adaptive sampling.
The Problem: Arbitrary Regions, Fixed Networks
Two-stage detectors face a fundamental mismatch:
- Region proposals come in arbitrary sizes (50×30, 200×150, 80×80...)
- Detection heads (FC layers) require fixed-size inputs (7×7×512)
We need to extract features from each proposed region and resize them to a fixed spatial size—but how do we handle regions that don't align with the feature map grid?
RoI Pooling: The Original Approach
RoI Pooling: The Quantization Problem
Drag the RoI (blue rectangle) to see how quantization introduces misalignment
Low error - good alignment
Max pooling from each bin
Why This Matters
RoI Pooling applies two levels of quantization: once when mapping RoI coordinates to feature map cells, and again when dividing into pooling bins. For instance-level tasks like segmentation, this misalignment causes significant accuracy loss. Mask R-CNN introduced RoI Align to solve this problem.
How RoI Pooling Works
-
Map RoI to Feature Map: Scale the RoI coordinates by the stride (e.g., 16x for VGG). A 160×96 RoI becomes 10×6 on the feature map.
-
Quantize Coordinates: Round floating-point coordinates to integers. This is the first source of quantization error.
-
Divide into Bins: Split the quantized region into a fixed grid (e.g., 7×7). Bin sizes are also quantized to integers.
-
Max Pool Each Bin: Apply max pooling within each bin to produce the output feature.
The Two Levels of Quantization
RoI Pooling introduces two rounds of rounding:
- RoI Boundary Quantization: When mapping RoI coordinates to the feature map
- Bin Size Quantization: When dividing the region into pooling bins
For a 7×7 output, each level can introduce up to 0.5 cell error, compounding to potentially 1 full cell of misalignment. At 16× downsampling, this means up to 16 pixels of error in the original image!
RoI Align: Eliminating Quantization
RoI Align: Bilinear Interpolation
Click anywhere on the grid to see how values are interpolated from the 4 nearest neighbors
No Quantization
Unlike RoI Pooling, RoI Align samples at exact floating-point positions using bilinear interpolation. The value at any point is a weighted combination of the 4 nearest feature map values, with weights inversely proportional to distance. This eliminates misalignment and is critical for Mask R-CNN.
RoI Align (Mask R-CNN, 2017) eliminates both quantization steps:
- No rounding of RoI coordinates
- Floating-point bin boundaries
- Regular sampling points within each bin (typically 4 per bin)
- Bilinear interpolation to compute values at exact positions
Bilinear Interpolation Formula
def bilinear_interpolate(feature_map, x, y): """ Compute feature value at floating-point position (x, y) using the 4 nearest neighbors. """ x0, y0 = int(x), int(y) x1, y1 = x0 + 1, y0 + 1 # Distance weights wa = (x1 - x) * (y1 - y) # weight for (x0, y0) wb = (x - x0) * (y1 - y) # weight for (x1, y0) wc = (x1 - x) * (y - y0) # weight for (x0, y1) wd = (x - x0) * (y - y0) # weight for (x1, y1) # Weighted sum return (wa * feature_map[y0, x0] + wb * feature_map[y0, x1] + wc * feature_map[y1, x0] + wd * feature_map[y1, x1])
Why This Matters for Segmentation
For bounding box detection, small misalignments might be acceptable. But for instance segmentation (predicting per-pixel masks), a 1-pixel shift can completely change which object a pixel belongs to. RoI Align's sub-pixel precision is essential for Mask R-CNN's performance.
Deformable RoI Pooling: Shape Adaptation
Deformable RoI Pooling: Learned Offsets
See how learned offsets adapt the sampling grid to object shapes
Offsets spread horizontally to capture wide objects
Offsets are predicted by a small FC layer trained end-to-end. The network learns to focus on informative regions for each object class.
Shape-Adaptive Sampling
While RoI Align eliminates quantization, it still uses a fixed rectangular grid. Deformable RoI Pooling adds learnable 2D offsets to each sampling point, allowing the network to adaptively focus on irregular object boundaries, occluded parts, or semantically important regions. This is especially useful for objects with complex shapes.
Deformable RoI Pooling (Deformable ConvNets, 2017) extends RoI Align with learned offsets:
- A small FC layer predicts 2D offsets (Δx, Δy) for each sampling point
- Offsets are added to the regular grid positions
- Bilinear interpolation samples at the shifted locations
- Offsets are trained end-to-end via backpropagation
This allows the network to adaptively focus on:
- Irregular object boundaries (non-rectangular shapes)
- Semantically important parts (face of a person, wheels of a car)
- Occluded regions (sampling around occlusions)
Method Comparison
Method Comparison
Click on a method to see detailed information
| Property | RoI Pooling | RoI Align | Deformable RoI |
|---|---|---|---|
| Alignment | Poor (up to 2 cells off) | Exact (sub-pixel) | Learned (adaptive) |
| Complexity | O(k²) | O(k² × 4) | O(k² × 4 + FC) |
| Best Use Case | Classification, when precision not critical | Instance segmentation, precise localization | Complex shapes, articulated objects |
Evolution Timeline
| Aspect | RoI Pooling | RoI Align | Deformable RoI |
|---|---|---|---|
| Quantization | 2 levels | None | None |
| Alignment Error | Up to 16px | Sub-pixel | Adaptive |
| Shape Handling | Fixed grid | Fixed grid | Learned offsets |
| Speed | Fastest | Fast | Moderate |
| Parameters | 0 | 0 | O(k²) |
| Best For | Classification | Segmentation | Complex shapes |
Use Cases
- Object Detection: Classification and localization of objects (Faster R-CNN with RoI Align)
- Instance Segmentation: Per-pixel masks for each object (Mask R-CNN requires RoI Align)
- Human Pose Estimation: Detecting keypoints on articulated bodies (Deformable RoI)
- Video Object Detection: Tracking objects across frames
- Medical Imaging: Detecting lesions with precise boundaries
- Autonomous Driving: Real-time detection of vehicles and pedestrians
Key Formulas
RoI Pooling (max pooling with quantization):
output[i,j] = max(features[⌊y1 + i×h/k⌋ : ⌊y1 + (i+1)×h/k⌋, ⌊x1 + j×w/k⌋ : ⌊x1 + (j+1)×w/k⌋])
RoI Align (bilinear sampling):
output[i,j] = avg(bilinear(features, sample_points[i,j]))
Deformable RoI (offset-augmented):
output[i,j] = avg(bilinear(features, sample_points[i,j] + Δ[i,j])) where Δ = offset_predictor(pooled_features)
Best Practices
-
Use RoI Align by Default: Unless speed is critical, RoI Align's precision benefits outweigh its small overhead
-
Match Feature Map Resolution: Ensure RoI coordinates are correctly scaled by the backbone's stride (16 or 32)
-
Use Multiple Sampling Points: 4 samples per bin (2×2) is standard; more helps for small objects
-
Consider FPN for Multi-Scale: Assign RoIs to appropriate FPN levels based on size
-
Regularize Deformable Offsets: Add L2 regularization to prevent offsets from diverging
Further Reading
- Fast R-CNN - Original RoI Pooling
- Mask R-CNN - RoI Align for instance segmentation
- Deformable Convolutional Networks - Deformable RoI Pooling
- Feature Pyramid Networks - Multi-scale RoI assignment
- Cascade R-CNN - Multi-stage RoI refinement
