Python Green Threads vs OS Threads

What are Threads?

Threads are the smallest unit of execution that can be scheduled by an operating system. They allow programs to perform multiple tasks concurrently, sharing the same memory space within a process. However, not all threads are created equal!

OS Threads vs Green Threads

Thread Models Comparison

OS Threads (Preemptive)

CPU Cores (4 cores)

Core 0

Core 1

Core 2

Core 3

ready

Kernel schedules threads across multiple CPU cores. True parallel execution possible.

Green Threads (Cooperative)

Single OS Thread

OS Thread

running

waiting

Runtime schedules green threads on a single OS thread. Concurrent but not parallel.

Key Insights

OS Threads:

• True parallelism on multiple cores
• Heavy memory footprint (MB per thread)
• Expensive context switches (kernel mode)
• Best for CPU-bound tasks

Green Threads:

• Concurrent but not parallel
• Lightweight (KB per thread)
• Fast context switches (user space only)
• Best for I/O-bound tasks

OS Threads (Native/Kernel Threads)

OS threads, also known as native threads or kernel threads, are managed directly by the operating system’s kernel. Each OS thread corresponds to a kernel-level thread that the OS scheduler manages.

Characteristics of OS Threads

Kernel Management: Created and scheduled by the OS kernel
True Parallelism: Can run simultaneously on multiple CPU cores
Preemptive Scheduling: OS can interrupt and switch threads at any time
Higher Overhead: Context switching involves kernel transitions
System Resources: Each thread consumes kernel resources (stack, registers, etc.)

OS Thread Implementation

import threading
import time

def cpu_intensive_task(n):
    """Simulate CPU-intensive work"""
    total = 0
    for i in range(n * 1000000):
        total += i
    return total

# Create OS threads in Python
threads = []
for i in range(4):
    thread = threading.Thread(target=cpu_intensive_task, args=(10,))
    threads.append(thread)
    thread.start()

# Wait for all threads to complete
for thread in threads:
    thread.join()

Advantages of OS Threads

True Parallelism: Can utilize multiple CPU cores effectively
System Integration: Full access to OS services and system calls
Blocking I/O Handling: One thread blocking doesn’t affect others
Language Agnostic: Supported by the OS, not language-specific

Disadvantages of OS Threads

Resource Intensive: Each thread requires significant memory (typically 1–8 MB for stack)
Context Switch Overhead: Kernel-mode transitions are expensive (~1–10 microseconds)
Limited Scalability: Creating thousands of threads can exhaust system resources
Synchronization Complexity: Requires careful handling of locks and shared state

Green Threads (User-Space Threads)

Green threads are threads that are scheduled by a runtime library or virtual machine instead of the operating system. They run entirely in user space and are invisible to the kernel.

Characteristics of Green Threads

User-Space Management: Scheduled by the language runtime or library
Cooperative or Preemptive: Depends on implementation
Lightweight: Minimal memory overhead (typically KB instead of MB)
No True Parallelism: All green threads run on a single OS thread
Fast Context Switching: No kernel transitions required

Green Thread Implementations

Python’s asyncio (Coroutines)

import asyncio

async def io_task(name, duration):
    """Simulate I/O-bound work"""
    print(f"Task {name} starting")
    await asyncio.sleep(duration)  # Cooperative yield point
    print(f"Task {name} completed")
    return f"Result from {name}"

async def main():
    # Create multiple coroutines (green threads)
    tasks = [
        io_task("A", 2),
        io_task("B", 1),
        io_task("C", 3)
    ]

    # Run concurrently on a single OS thread
    results = await asyncio.gather(*tasks)
    print(f"Results: {results}")

# Event loop manages green thread scheduling
asyncio.run(main())

Gevent (Green Thread Library)

import gevent
from gevent import monkey
monkey.patch_all()  # Patch standard library for green thread support

def fetch_url(url):
    """Simulate network request"""
    print(f"Fetching {url}")
    gevent.sleep(1)  # Yields control to other green threads
    return f"Content from {url}"

# Create green threads
greenlets = [
    gevent.spawn(fetch_url, f"http://example.com/{i}")
    for i in range(1000)  # Can create thousands easily!
]

# Wait for all to complete
gevent.joinall(greenlets)

Modern Async with Error Handling

import asyncio
import aiohttp

async def fetch_data(session, url):
    try:
        async with asyncio.timeout(10):
            async with session.get(url) as response:
                response.raise_for_status()
                return await response.text()
    except (asyncio.TimeoutError, aiohttp.ClientError) as e:
        return None  # or log and retry

async def main():
    async with aiohttp.ClientSession() as session:
        results = await asyncio.gather(*[
            fetch_data(session, f"http://api.example.com/{i}")
            for i in range(100)
        ])

Advantages of Green Threads

Lightweight: Very low memory overhead per thread
Fast Context Switching: No kernel involvement (~0.1–1 microseconds)
High Concurrency: Can create millions of green threads
Simplified Synchronization: No true parallelism means fewer race conditions
Better for I/O: Excellent for I/O-bound workloads

Disadvantages of Green Threads

No True Parallelism: Cannot utilize multiple CPU cores
Blocking Issues: A blocking system call can freeze all green threads
CPU-Bound Limitations: Poor performance for CPU-intensive tasks
Runtime Dependency: Requires specific runtime support
Debugging Complexity: Stack traces can be confusing

How the Event Loop Works

The asyncio event loop is a single-threaded scheduler for coroutine progress. It runs ready callbacks, lets each coroutine run until its next await, parks waiting work behind Futures, then asks OS primitives such as epoll or kqueue which file descriptors are ready. When I/O completes, asyncio resolves Futures and schedules task wakeups back onto the ready queue. There is no parallel Python execution in the loop itself: one coroutine makes progress at a time.

asyncio Event Loop

Step 1 of 7

Ready callbacks are selected

ready

run

await

selector

ready

Ready callbacks

Task A wakeupTask C wakeup

Running coroutine

Loop is idle

Futures

Future B: socket read pending

Selector / OS readiness

fd 18: socket read

ntodo = len(self._ready)
for i in range(ntodo):
    handle = self._ready.popleft()
    handle._run()

The loop takes a snapshot of the ready queue for this tick. Callbacks scheduled while that snapshot runs wait for the next tick, so one callback cannot keep extending the current pass forever.

The #1 asyncio Mistake: Blocking the Loop

If you call a blocking function inside a coroutine, the entire event loop freezes. No other coroutines can run until the blocking call completes.

# BAD: blocks the entire event loop for 5 seconds
async def bad_handler():
    time.sleep(5)           # WRONG — use await asyncio.sleep(5)
    requests.get(url)       # WRONG — use aiohttp
    data = db.query(sql)    # WRONG — use asyncpg or aiomysql

# GOOD: run blocking code in a thread pool
async def good_handler():
    loop = asyncio.get_event_loop()
    # Offload blocking call to thread pool
    result = await loop.run_in_executor(None, requests.get, url)

Python’s GIL: The Plot Twist

Python’s Global Interpreter Lock means OS threads cannot execute Python bytecode in parallel. For CPU-bound work, adding threads makes performance worse — not better — because threads fight over the GIL and add context-switch overhead with zero parallelism gain.

For I/O-bound work, threads still help because the GIL is released during system calls (read, write, recv, send). While one thread waits for a network response, another can run Python code.

GIL in Action: Threads Slower Than Sequential

import threading
import time

# Even with multiple OS threads, the GIL prevents true parallelism
def count(n):
    while n > 0:
        n -= 1

# These threads won't run in parallel due to GIL
t1 = threading.Thread(target=count, args=(100000000,))
t2 = threading.Thread(target=count, args=(100000000,))

start = time.time()
t1.start()
t2.start()
t1.join()
t2.join()
print(f"Time with threads: {time.time() - start}")

# Often slower than sequential due to GIL contention!

Python 3.13: Free Threading (PEP 703)

Python 3.13 introduces an experimental free-threaded build (python3.13t) that removes the GIL entirely. For the first time, Python threads can execute in true parallel on multiple cores.

import sys
print(sys._is_gil_enabled())  # False on python3.13t

# CPU-bound threads now actually run in parallel
# 4 threads on 4 cores → ~4x speedup (finally!)

What Changes

CPU-bound threading: True parallelism. The GILContentionDemo above shows the dramatic difference.
Reference counting: Replaced with biased reference counting + deferred RC for thread safety.
C extensions: Must be updated to be thread-safe. Many popular packages (NumPy, etc.) are not yet compatible.

What Doesn’t Change

asyncio is still better for I/O: Thousands of coroutines are still lighter than thousands of threads.
multiprocessing still works: For CPU parallelism on Python < 3.13.
The ecosystem needs time: Free-threading is experimental in 3.13 and opt-in.

# Install free-threaded Python
# On macOS: brew install python@3.13 --with-freethreading
# On Ubuntu: sudo apt install python3.13-nogil

# Run with free threading
python3.13t your_script.py

Key Differences

Aspect	OS Threads	Green Threads
Management	Kernel/OS	User-space runtime
Memory per Thread	1–8 MB	1–64 KB
Context Switch Time	1–10 μs	0.1–1 μs
True Parallelism	Yes	No
Number of Threads	100s–1000s	10,000s–1,000,000s
CPU Cores Utilized	Multiple	Single
Blocking System Calls	Thread-local	Global impact
Scheduling	Preemptive	Cooperative/Preemptive
Best For	CPU-bound tasks	I/O-bound tasks

Hybrid Approaches

Some systems combine both models for optimal performance.

M:N Threading (Erlang/Go Model)

// Go example - goroutines are green threads mapped to OS threads
func main() {
    // Create thousands of goroutines (green threads)
    for i := 0; i < 10000; i++ {
        go func(id int) {
            // Go runtime maps these to a pool of OS threads
            fmt.Printf("Goroutine %d\n", id)
        }(i)
    }
}

Python multiprocessing + asyncio

import multiprocessing
import asyncio

async def async_worker(data):
    """Green thread worker for I/O"""
    await asyncio.sleep(0.1)
    return data * 2

def process_worker(chunk):
    """OS process for CPU work"""
    # Run event loop in each process
    async def process_chunk():
        tasks = [async_worker(item) for item in chunk]
        return await asyncio.gather(*tasks)

    return asyncio.run(process_chunk())

# Combine multiprocessing (true parallelism) with asyncio (green threads)
if __name__ == "__main__":
    data = range(1000)
    chunks = [data[i:i+100] for i in range(0, len(data), 100)]

    with multiprocessing.Pool() as pool:
        results = pool.map(process_worker, chunks)

Choosing the Right Model

Decision Guide

Is your bottleneck CPU or I/O?
- CPU → multiprocessing (or Python 3.13t threads)
- I/O → asyncio (or threads for blocking libraries)
Do you need >1000 concurrent connections?
- Yes → asyncio (threads can’t scale to 10K+)
- No → threading is simpler and fine
Are you using blocking libraries (requests, psycopg2)?
- Yes → threading or loop.run_in_executor()
- No (aiohttp, asyncpg) → asyncio native
Do you need both CPU parallelism and high I/O concurrency?
- Yes → multiprocessing + asyncio in each process

Production Patterns

Thread Pool Sizing

import os

# For I/O-bound thread pools
max_workers = min(32, (os.cpu_count() or 1) + 4)

# For CPU-bound (use multiprocessing instead, but if you must)
max_workers = os.cpu_count() or 1

Asyncio Backpressure with Semaphores

import asyncio

# Limit concurrent requests to prevent overwhelming the server
sem = asyncio.Semaphore(50)

async def limited_fetch(session, url):
    async with sem:
        async with session.get(url) as resp:
            return await resp.text()

# Even with 10,000 URLs, only 50 run concurrently
tasks = [limited_fetch(session, url) for url in urls]
results = await asyncio.gather(*tasks)

Structured Concurrency (Python 3.11+)

async def main():
    async with asyncio.TaskGroup() as tg:
        task1 = tg.create_task(fetch_users())
        task2 = tg.create_task(fetch_orders())
        task3 = tg.create_task(fetch_inventory())
    # All tasks guaranteed complete (or all cancelled on error)
    users, orders, inventory = task1.result(), task2.result(), task3.result()

Graceful Shutdown

import signal

async def shutdown(loop, signal=None):
    tasks = [t for t in asyncio.all_tasks() if t is not asyncio.current_task()]
    for task in tasks:
        task.cancel()
    await asyncio.gather(*tasks, return_exceptions=True)
    loop.stop()

loop = asyncio.get_event_loop()
for sig in (signal.SIGTERM, signal.SIGINT):
    loop.add_signal_handler(sig, lambda s=sig: asyncio.create_task(shutdown(loop, s)))

Performance Comparison

Context Switch Overhead

# Measuring context switch time
import threading
import asyncio
import time

# OS Thread context switch
def thread_switch_test():
    event1 = threading.Event()
    event2 = threading.Event()
    switches = 100000

    def thread1():
        for _ in range(switches):
            event2.set()
            event1.wait()
            event1.clear()

    def thread2():
        for _ in range(switches):
            event2.wait()
            event2.clear()
            event1.set()

    t1 = threading.Thread(target=thread1)
    t2 = threading.Thread(target=thread2)

    start = time.time()
    t1.start()
    t2.start()
    t1.join()
    t2.join()

    total_time = time.time() - start
    return total_time / (switches * 2)

# Green thread (coroutine) context switch
async def coro_switch_test():
    switches = 100000
    counter = 0

    async def coro1():
        nonlocal counter
        for _ in range(switches):
            counter += 1
            await asyncio.sleep(0)

    async def coro2():
        nonlocal counter
        for _ in range(switches):
            counter += 1
            await asyncio.sleep(0)

    start = time.time()
    await asyncio.gather(coro1(), coro2())
    total_time = time.time() - start

    return total_time / (switches * 2)

# Results typically show:
# OS Thread switch: ~5-10 microseconds
# Green thread switch: ~0.1-0.5 microseconds

Real-World Examples

Web Servers

Traditional (OS Threads) — Apache:

One thread per connection
Limited to ~10,000 concurrent connections
High memory usage

Modern (Green Threads) — Node.js/Python asyncio:

Single-threaded event loop
Can handle 100,000+ concurrent connections
Low memory footprint

Database Connection Pools

OS Threads:

from concurrent.futures import ThreadPoolExecutor
import psycopg2

def query_database(query):
    conn = psycopg2.connect("postgresql://...")
    cursor = conn.cursor()
    cursor.execute(query)
    result = cursor.fetchall()
    conn.close()
    return result

# Limited by thread overhead
with ThreadPoolExecutor(max_workers=100) as executor:
    futures = [executor.submit(query_database, f"SELECT * FROM table_{i}")
               for i in range(100)]

Green Threads:

import asyncio
import asyncpg

async def query_database(pool, query):
    async with pool.acquire() as conn:
        return await conn.fetch(query)

async def main():
    # Can handle thousands of concurrent queries
    pool = await asyncpg.create_pool("postgresql://...")

    tasks = [query_database(pool, f"SELECT * FROM table_{i}")
             for i in range(10000)]

    results = await asyncio.gather(*tasks)
    await pool.close()
    return results

Key Takeaways

Green threads for I/O, OS threads/processes for CPU — asyncio handles 100K+ concurrent connections on one core; multiprocessing uses all cores for computation.
The GIL makes CPU-bound threading worse than sequential — threads fight over the lock, adding overhead with zero parallelism. Use multiprocessing or wait for Python 3.13t.
Never block the event loop — one time.sleep() or requests.get() inside a coroutine freezes everything. Use run_in_executor() for blocking calls.
Python 3.13 changes everything for CPU threading — free-threading removes the GIL, enabling true thread parallelism for the first time. The ecosystem is catching up.
Production asyncio needs backpressure — use asyncio.Semaphore to limit concurrency, TaskGroup for structured concurrency, and proper signal handlers for graceful shutdown.