Why Calculus in Machine Learning?
Calculus is the mathematics of change and is essential for:
- Optimization: Finding minima/maxima of loss functions
- Gradient Descent: Computing direction of steepest descent
- Backpropagation: Propagating errors through neural networks
- Understanding Dynamics: How models change during training
Interactive Visualization
Interactive Calculus Concepts
Function
Derivatives: The Foundation
What is a Derivative?
The derivative measures the rate of change of a function:
f'(x) = lim[h→0] (f(x+h) - f(x)) / h
Geometric Interpretation: Slope of the tangent line at a point
Physical Interpretation: Instantaneous rate of change
Common Derivatives
| Function | Derivative |
|---|---|
| c (constant) | 0 |
| x^n | nx^(n-1) |
| e^x | e^x |
| ln(x) | 1/x |
| sin(x) | cos(x) |
| cos(x) | -sin(x) |
| sigmoid(x) | sigmoid(x)(1 - sigmoid(x)) |
| tanh(x) | 1 - tanh²(x) |
Rules of Differentiation
- Linearity: (af + bg)' = af' + bg'
- Product Rule: (fg)' = f'g + fg'
- Chain Rule: (f∘g)' = f'(g(x)) · g'(x)
- Quotient Rule: (f/g)' = (f'g - fg')/g²
Partial Derivatives
For functions of multiple variables f(x, y, z):
∂f/∂x = Rate of change with respect to x (y, z held constant)
Gradient Vector
The gradient combines all partial derivatives:
∇f = [∂f/∂x, ∂f/∂y, ∂f/∂z]
Properties:
- Points in direction of steepest increase
- Perpendicular to level curves/surfaces
- Magnitude indicates rate of change
Chain Rule in Deep Learning
Forward Pass
Given nested functions:
z = f(g(h(x)))
Output is computed layer by layer:
a = h(x) b = g(a) z = f(b)
Backward Pass (Backpropagation)
Derivatives are computed in reverse:
dz/dx = dz/db · db/da · da/dx = f'(b) · g'(a) · h'(x)
Neural Network Example
# Forward pass z1 = W1 @ x + b1 # Linear a1 = relu(z1) # Activation z2 = W2 @ a1 + b2 # Linear loss = cross_entropy(z2, y) # Loss # Backward pass (chain rule) d_loss/d_z2 = ∂cross_entropy/∂z2 d_loss/d_W2 = d_loss/d_z2 @ a1.T d_loss/d_a1 = W2.T @ d_loss/d_z2 d_loss/d_z1 = d_loss/d_a1 * relu'(z1) d_loss/d_W1 = d_loss/d_z1 @ x.T
Optimization Fundamentals
Critical Points
Where f'(x) = 0:
- Local minimum: f''(x) > 0
- Local maximum: f''(x) < 0
- Saddle point: f''(x) = 0 or changes sign
Gradient Descent
Basic update rule:
θ(t+1) = θ(t) - α∇f(θ(t))
Where:
- θ: Parameters
- α: Learning rate
- ∇f: Gradient of loss function
Variants of Gradient Descent
- Batch GD: Use entire dataset
- Stochastic GD: Use single sample
- Mini-batch GD: Use subset of data
- Momentum: Add velocity term
- Adam: Adaptive learning rates
Common Loss Functions and Derivatives
Mean Squared Error (MSE)
L = (1/n)Σ(y_i - ŷ_i)² ∂L/∂ŷ_i = (2/n)(ŷ_i - y_i)
Cross-Entropy Loss
L = -Σ y_i log(ŷ_i) ∂L/∂ŷ_i = -y_i/ŷ_i
Binary Cross-Entropy
L = -[y log(ŷ) + (1-y)log(1-ŷ)] ∂L/∂ŷ = (ŷ - y)/(ŷ(1-ŷ))
Activation Functions and Derivatives
ReLU
f(x) = max(0, x) f'(x) = {1 if x > 0, 0 if x ≤ 0}
Sigmoid
f(x) = 1/(1 + e^(-x)) f'(x) = f(x)(1 - f(x))
Tanh
f(x) = (e^x - e^(-x))/(e^x + e^(-x)) f'(x) = 1 - f(x)²
Softmax
f(x_i) = e^(x_i) / Σe^(x_j) ∂f_i/∂x_j = f_i(δ_ij - f_j)
Automatic Differentiation
Modern frameworks compute derivatives automatically:
Computational Graph
# Forward pass builds graph x = Variable(2.0) y = Variable(3.0) z = x * y # z = 6 w = z + x # w = 8 loss = w ** 2 # loss = 64 # Backward pass computes gradients loss.backward() # x.grad = 2w(y + 1) = 2*8*4 = 64 # y.grad = 2wx = 2*8*2 = 32
Benefits
- No manual derivative calculation
- Handles complex architectures
- Efficient computation
- Exact derivatives (not numerical)
Multivariate Calculus
Jacobian Matrix
For vector function f: ℝⁿ → ℝᵐ:
J = [∂f_i/∂x_j] = [∂f₁/∂x₁ ... ∂f₁/∂xₙ] [ ⋮ ⋱ ⋮ ] [∂fₘ/∂x₁ ... ∂fₘ/∂xₙ]
Hessian Matrix
Second-order derivatives:
H = [∂²f/∂x_i∂x_j]
Uses:
- Newton's method optimization
- Analyzing convexity
- Finding saddle points
Integration in ML
Probability Distributions
P(a ≤ X ≤ b) = ∫[a,b] f(x)dx
Expected Values
E[X] = ∫ x·f(x)dx
Marginalization
p(x) = ∫ p(x,y)dy
Optimization Techniques
Newton's Method
x(n+1) = x(n) - H⁻¹∇f
Faster convergence but expensive Hessian computation.
Conjugate Gradient
Efficient for large-scale problems without computing Hessian.
L-BFGS
Approximates Hessian using gradient history.
Practical Tips
Numerical Stability
- Gradient clipping: Prevent exploding gradients
- Log-sum-exp trick: Avoid overflow in softmax
- Batch normalization: Stabilize intermediate values
Debugging Gradients
- Gradient checking: Compare with numerical gradients
grad_numerical = (f(x+ε) - f(x-ε))/(2ε)
-
Visualize gradients: Plot histogram of gradient values
-
Monitor gradient norms: Track ||∇θ|| during training
Common Pitfalls
- Vanishing gradients: Deep networks, wrong activation
- Exploding gradients: Large learning rates, RNNs
- Saddle points: Common in high dimensions
- Local minima: Non-convex optimization
- Numerical errors: Accumulation in long chains
Advanced Topics
Stochastic Calculus
For understanding:
- Stochastic gradient descent dynamics
- Diffusion models
- Brownian motion in optimization
Variational Calculus
For:
- Variational autoencoders
- Optimal control
- Physics-informed neural networks
Differential Geometry
For:
- Natural gradients
- Information geometry
- Manifold learning
Summary
Calculus provides the mathematical machinery for:
- Computing gradients for optimization
- Understanding how changes propagate
- Analyzing convergence and stability
- Developing new algorithms
Master these concepts to understand not just how to use ML algorithms, but why they work and how to improve them.
