Linear Algebra Fundamentals

What is Linear Algebra?

Linear algebra is the branch of mathematics concerning linear equations, linear functions, and their representations through matrices and vector spaces. It's fundamental to machine learning because:

Data representation: Features are vectors, datasets are matrices
Transformations: Neural networks perform linear transformations
Optimization: Gradient descent operates in vector spaces
Dimensionality reduction: PCA, SVD rely on linear algebra

Interactive Visualization

Interactive Linear Algebra

Vector Parameters

Vector 1 (green)

Vector 2 (blue)

Dot Product:9.00

Angle:37.9°

|v₁|:3.61

|v₂|:3.16

Vector 1

Vector 2

Core Concepts

1. Scalars

A scalar is a single number. In ML contexts:

Learning rate (α = 0.01)
Regularization parameter (λ = 0.1)
Individual predictions

2. Vectors

A vector is an ordered array of numbers:

# Column vector (most common in ML)
x = [x₁]
    [x₂]
    [x₃]

# Row vector
x = [x₁, x₂, x₃]

Properties:

Dimension: Number of elements
Magnitude: ||x|| = √(x₁² + x₂² + ... + xₙ²)
Direction: Orientation in space

Operations:

Addition: Element-wise addition
Scalar multiplication: Multiply each element
Dot product: x·y = Σ(xᵢ × yᵢ)

3. Matrices

A matrix is a 2D array of numbers:

A = [a₁₁  a₁₂  a₁₃]
    [a₂₁  a₂₂  a₂₃]
    [a₃₁  a₃₂  a₃₃]

Properties:

Shape: (rows, columns)
Rank: Number of linearly independent rows/columns
Determinant: Scalar that describes transformation scaling

Operations:

Addition: Element-wise (same shape required)
Multiplication: AB ≠ BA (non-commutative)
Transpose: Flip rows and columns
Inverse: A⁻¹ such that AA⁻¹ = I

4. Tensors

Generalization to n-dimensional arrays:

Scalar: 0D tensor
Vector: 1D tensor
Matrix: 2D tensor
3D+ tensor: Used in deep learning (batch × height × width × channels)

Key Operations for ML

Matrix Multiplication

Essential for neural network forward pass:

# Weight matrix × Input vector
y = Wx + b

# Where:
# W: weight matrix (m × n)
# x: input vector (n × 1)
# b: bias vector (m × 1)
# y: output vector (m × 1)

Dot Product

Measures similarity between vectors:

similarity = x · y = ||x|| ||y|| cos(θ)

# Applications:
# - Cosine similarity
# - Attention mechanisms
# - Feature matching

Eigendecomposition

For symmetric matrix A:

A = QΛQ^T

Where:

Q: Matrix of eigenvectors
Λ: Diagonal matrix of eigenvalues

Applications:

PCA (Principal Component Analysis)
Spectral clustering
Network analysis

Linear Transformations

Common Transformations

Scaling: Stretch/shrink along axes
Rotation: Rotate around origin
Reflection: Mirror across line
Shearing: Slant parallel to axis
Projection: Map to lower dimension

Transformation Matrix Examples

# Scaling by factor of 2
S = [[2, 0],
     [0, 2]]

# Rotation by θ
R = [[cos(θ), -sin(θ)],
     [sin(θ),  cos(θ)]]

# Reflection across x-axis
F = [[1,  0],
     [0, -1]]

Vector Spaces

Basis Vectors

A set of linearly independent vectors that span the space:

# Standard basis in R²
e₁ = [1, 0]
e₂ = [0, 1]

# Any vector can be expressed as:
v = a·e₁ + b·e₂

Subspaces

Important subspaces in ML:

Column space: Range of possible outputs
Null space: Inputs that map to zero
Row space: Space of possible weights

Norms and Distances

Common Norms

# L1 norm (Manhattan distance)
||x||₁ = Σ|xᵢ|

# L2 norm (Euclidean distance)
||x||₂ = √(Σxᵢ²)

# L∞ norm (Maximum norm)
||x||∞ = max|xᵢ|

Applications in ML

L1 regularization: Promotes sparsity (Lasso)
L2 regularization: Prevents large weights (Ridge)
Distance metrics: k-NN, clustering

Matrix Decompositions

Singular Value Decomposition (SVD)

A = UΣV^T

Applications:

Dimensionality reduction
Recommender systems
Image compression
Natural language processing

LU Decomposition

A = LU

Where L is lower triangular, U is upper triangular.

Applications:

Solving linear systems
Computing determinants
Matrix inversion

QR Decomposition

A = QR

Where Q is orthogonal, R is upper triangular.

Applications:

Least squares problems
Eigenvalue algorithms
Gram-Schmidt process

Applications in Machine Learning

1. Neural Networks

# Forward propagation
z¹ = W¹x + b¹
a¹ = σ(z¹)
z² = W²a¹ + b²
y = σ(z²)

2. Principal Component Analysis (PCA)

Center the data: X - μ
Compute covariance: C = (1/n)X^T X
Find eigenvectors of C
Project: X_reduced = XW_k

3. Gradient Descent

# Parameter update
θ = θ - α∇J(θ)

# Where:
# ∇J(θ) is the gradient (vector of partial derivatives)
# α is the learning rate (scalar)

4. Attention Mechanisms

# Scaled dot-product attention
Attention(Q, K, V) = softmax(QK^T / √d_k)V

Computational Considerations

Time Complexity

Vector addition: O(n)
Dot product: O(n)
Matrix multiplication: O(n³) naive, O(n^2.376) Strassen
Matrix inversion: O(n³)
SVD: O(min(m²n, mn²))

Numerical Stability

Condition number: Measure of sensitivity to input changes
Ill-conditioned matrices: Small changes cause large effects
Regularization: Add small values to diagonal for stability

Python Implementation

import numpy as np

# Vectors
v1 = np.array([1, 2, 3])
v2 = np.array([4, 5, 6])

# Dot product
dot = np.dot(v1, v2)  # 32

# Matrices
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

# Matrix multiplication
C = A @ B  # or np.matmul(A, B)

# Eigendecomposition
eigenvalues, eigenvectors = np.linalg.eig(A)

# SVD
U, S, Vt = np.linalg.svd(A)

# Solve linear system Ax = b
b = np.array([1, 2])
x = np.linalg.solve(A, b)

Common Pitfalls

Broadcasting errors: Shape mismatches in operations
Singular matrices: No inverse exists
Numerical precision: Floating-point errors accumulate
Memory issues: Large matrices exhaust RAM
Non-conformable dimensions: Invalid multiplication

Summary

Linear algebra provides the mathematical foundation for:

Data representation and manipulation
Model operations (forward/backward pass)
Optimization algorithms
Dimensionality reduction techniques
Understanding model behavior

Master these concepts to build intuition about how machine learning algorithms work at a fundamental level.

Table of Contents

Interactive Linear Algebra

Vector Parameters