What is Linear Algebra?
Linear algebra is the branch of mathematics concerning linear equations, linear functions, and their representations through matrices and vector spaces. It's fundamental to machine learning because:
- Data representation: Features are vectors, datasets are matrices
- Transformations: Neural networks perform linear transformations
- Optimization: Gradient descent operates in vector spaces
- Dimensionality reduction: PCA, SVD rely on linear algebra
Interactive Visualization
Interactive Linear Algebra
Vector Parameters
Core Concepts
1. Scalars
A scalar is a single number. In ML contexts:
- Learning rate (α = 0.01)
- Regularization parameter (λ = 0.1)
- Individual predictions
2. Vectors
A vector is an ordered array of numbers:
# Column vector (most common in ML) x = [x₁] [x₂] [x₃] # Row vector x = [x₁, x₂, x₃]
Properties:
- Dimension: Number of elements
- Magnitude: ||x|| = √(x₁² + x₂² + ... + xₙ²)
- Direction: Orientation in space
Operations:
- Addition: Element-wise addition
- Scalar multiplication: Multiply each element
- Dot product: x·y = Σ(xᵢ × yᵢ)
3. Matrices
A matrix is a 2D array of numbers:
A = [a₁₁ a₁₂ a₁₃] [a₂₁ a₂₂ a₂₃] [a₃₁ a₃₂ a₃₃]
Properties:
- Shape: (rows, columns)
- Rank: Number of linearly independent rows/columns
- Determinant: Scalar that describes transformation scaling
Operations:
- Addition: Element-wise (same shape required)
- Multiplication: AB ≠ BA (non-commutative)
- Transpose: Flip rows and columns
- Inverse: A⁻¹ such that AA⁻¹ = I
4. Tensors
Generalization to n-dimensional arrays:
- Scalar: 0D tensor
- Vector: 1D tensor
- Matrix: 2D tensor
- 3D+ tensor: Used in deep learning (batch × height × width × channels)
Key Operations for ML
Matrix Multiplication
Essential for neural network forward pass:
# Weight matrix × Input vector y = Wx + b # Where: # W: weight matrix (m × n) # x: input vector (n × 1) # b: bias vector (m × 1) # y: output vector (m × 1)
Dot Product
Measures similarity between vectors:
similarity = x · y = ||x|| ||y|| cos(θ) # Applications: # - Cosine similarity # - Attention mechanisms # - Feature matching
Eigendecomposition
For symmetric matrix A:
A = QΛQ^T
Where:
- Q: Matrix of eigenvectors
- Λ: Diagonal matrix of eigenvalues
Applications:
- PCA (Principal Component Analysis)
- Spectral clustering
- Network analysis
Linear Transformations
Common Transformations
- Scaling: Stretch/shrink along axes
- Rotation: Rotate around origin
- Reflection: Mirror across line
- Shearing: Slant parallel to axis
- Projection: Map to lower dimension
Transformation Matrix Examples
# Scaling by factor of 2 S = [[2, 0], [0, 2]] # Rotation by θ R = [[cos(θ), -sin(θ)], [sin(θ), cos(θ)]] # Reflection across x-axis F = [[1, 0], [0, -1]]
Vector Spaces
Basis Vectors
A set of linearly independent vectors that span the space:
# Standard basis in R² e₁ = [1, 0] e₂ = [0, 1] # Any vector can be expressed as: v = a·e₁ + b·e₂
Subspaces
Important subspaces in ML:
- Column space: Range of possible outputs
- Null space: Inputs that map to zero
- Row space: Space of possible weights
Norms and Distances
Common Norms
# L1 norm (Manhattan distance) ||x||₁ = Σ|xᵢ| # L2 norm (Euclidean distance) ||x||₂ = √(Σxᵢ²) # L∞ norm (Maximum norm) ||x||∞ = max|xᵢ|
Applications in ML
- L1 regularization: Promotes sparsity (Lasso)
- L2 regularization: Prevents large weights (Ridge)
- Distance metrics: k-NN, clustering
Matrix Decompositions
Singular Value Decomposition (SVD)
A = UΣV^T
Applications:
- Dimensionality reduction
- Recommender systems
- Image compression
- Natural language processing
LU Decomposition
A = LU
Where L is lower triangular, U is upper triangular.
Applications:
- Solving linear systems
- Computing determinants
- Matrix inversion
QR Decomposition
A = QR
Where Q is orthogonal, R is upper triangular.
Applications:
- Least squares problems
- Eigenvalue algorithms
- Gram-Schmidt process
Applications in Machine Learning
1. Neural Networks
# Forward propagation z¹ = W¹x + b¹ a¹ = σ(z¹) z² = W²a¹ + b² y = σ(z²)
2. Principal Component Analysis (PCA)
- Center the data: X - μ
- Compute covariance: C = (1/n)X^T X
- Find eigenvectors of C
- Project: X_reduced = XW_k
3. Gradient Descent
# Parameter update θ = θ - α∇J(θ) # Where: # ∇J(θ) is the gradient (vector of partial derivatives) # α is the learning rate (scalar)
4. Attention Mechanisms
# Scaled dot-product attention Attention(Q, K, V) = softmax(QK^T / √d_k)V
Computational Considerations
Time Complexity
- Vector addition: O(n)
- Dot product: O(n)
- Matrix multiplication: O(n³) naive, O(n^2.376) Strassen
- Matrix inversion: O(n³)
- SVD: O(min(m²n, mn²))
Numerical Stability
- Condition number: Measure of sensitivity to input changes
- Ill-conditioned matrices: Small changes cause large effects
- Regularization: Add small values to diagonal for stability
Python Implementation
import numpy as np # Vectors v1 = np.array([1, 2, 3]) v2 = np.array([4, 5, 6]) # Dot product dot = np.dot(v1, v2) # 32 # Matrices A = np.array([[1, 2], [3, 4]]) B = np.array([[5, 6], [7, 8]]) # Matrix multiplication C = A @ B # or np.matmul(A, B) # Eigendecomposition eigenvalues, eigenvectors = np.linalg.eig(A) # SVD U, S, Vt = np.linalg.svd(A) # Solve linear system Ax = b b = np.array([1, 2]) x = np.linalg.solve(A, b)
Common Pitfalls
- Broadcasting errors: Shape mismatches in operations
- Singular matrices: No inverse exists
- Numerical precision: Floating-point errors accumulate
- Memory issues: Large matrices exhaust RAM
- Non-conformable dimensions: Invalid multiplication
Summary
Linear algebra provides the mathematical foundation for:
- Data representation and manipulation
- Model operations (forward/backward pass)
- Optimization algorithms
- Dimensionality reduction techniques
- Understanding model behavior
Master these concepts to build intuition about how machine learning algorithms work at a fundamental level.
