Skip to main content

Python Garbage Collection

Summary
Understand CPython garbage collection: reference counting, generational GC for circular references, weak references, and gc module tuning strategies.

What garbage collection means in CPython

Garbage collection answers one question: can the running program still reach this object? If yes, it stays alive; if no, the runtime can reclaim it.

CPython uses two mechanisms together. Reference counting is the fast path — most objects vanish the instant their last reference disappears. The cyclic collector is the backup — it finds groups of objects that point at each other but are no longer reachable from the program.

The explorer below is real: every refcount is a measured sys.getrefcount() value and the collection numbers are real gc.collect() results. Step through it.

CPython garbage collection — real refcounts

Every number below is measured CPython — real sys.getrefcount() values and real gc.collect() results. Each act steps on its own; scroll the story from reference counting down to the generational collector.

1

Reference counting

Every object carries a count of how many references point at it. Bind a new name and it ticks up; drop one and it ticks down. At zero, CPython frees it on the spot.

rootsdatalist[10, 20, 30]1
refcount(data)
1
data = [10, 20, 30]
create data = [10, 20, 30]

data now has 1 reference. Each binding that points at the list is one count.

Step 1 / 6
2

The reference cycle

Reference counting has one blind spot: objects that reference each other keep each other's count above zero, even when nothing else can reach them.

rootsabNode A1Node B1
refcounts
A=1 B=1
a = Node('A')
b = Node('B')
a = Node('A'); b = Node('B')

Two fresh objects, each with a single reference — its name.

Step 1 / 4
3

The cyclic collector

A separate, generational collector catches exactly these unreachable cycles using trial deletion — temporarily subtracting internal references to see which objects are truly reachable.

rootsNode A1Node B1
refcounts
A=1 B=1
# A <-> B, no external refs
the leaked cycle

Here is the leak from before — a cycle with no path from any root. Reference counting is stuck.

Step 1 / 3
4

weakref breaks the cycle

A weak reference points at an object without raising its refcount — so a weak back-edge never forms a counted cycle, and reference counting alone can reclaim the object.

rootsparentchildNodeparent1Nodechild1
refcounts
P=1 C=1
parent = Node('parent')
child = Node('child')
parent = Node('parent'); child = Node('child')

Two objects, each with a single reference — its name.

Step 1 / 4
5

Generations

The cyclic collector is generational. New objects start in generation 0 and are swept often; survivors are promoted to older generations that are scanned far less. Real default thresholds: (700, 10, 10).

gen 0thr 7000↓ promotegen 1thr 100↓ promotegen 2thr 100
thresholds
700, 10, 10
three generations

Every tracked container object is born into generation 0. A generation is collected once its counter crosses its threshold.

Step 1 / 5

Reference counting

Every object carries a count of the references pointing at it. Bind a new name, pass it into a list, store it on an attribute — the count ticks up. Drop a name with del or rebinding, and it ticks down. When the count hits zero, CPython frees the object on the spot, no collector involved. This handles the overwhelming majority of objects, and it is why CPython memory use is so predictable.

The count lives in the object header (ob_refcnt), and the interpreter adjusts it with the Py_INCREF / Py_DECREF macros — among the most frequently executed operations in all of CPython:

#define Py_INCREF(op) ((op)->ob_refcnt++) #define Py_DECREF(op) do { if (--(op)->ob_refcnt == 0) _Py_Dealloc(op); } while (0)

Because every assignment touches a counter, those updates must be atomic — which is one of the reasons the GIL exists: a single global lock is cheaper than making every refcount operation thread-safe on its own. The cost of all this bookkeeping is real overhead in tight loops, and it has one blind spot.

The cycle problem

Reference counting cannot reclaim a reference cycle. If a.link = b and b.link = a, each object still has a reference — from the other — even after every external name is gone. The counts never reach zero, so the objects leak. Self-referential data structures (doubly linked lists, parent/child trees, objects holding a callback that closes over them) all create this shape.

The generational cyclic collector

For exactly these cases, CPython runs a second collector over the container objects it tracks, organized into three generations:

  • New objects start in generation 0, which is scanned most often. Survivors are promoted to generation 1, then generation 2 — the older a cycle gets, the less often it is checked.
  • The default thresholds are gc.get_threshold()(700, 10, 10): gen 0 is collected after 700 net allocations, and the older generations after every 10 collections of the younger one.
  • Detection uses trial deletion: temporarily subtract references that live inside the candidate set. Any object whose count then drops to zero was only kept alive by the cycle — it is unreachable, and gets reclaimed.

The gc module

import gc gc.collect() # force a full collection; returns objects reclaimed gc.get_threshold() # (700, 10, 10) by default gc.get_count() # per-generation allocations since the last collection gc.disable() # turn the cyclic collector off (refcounting stays on) gc.freeze() # move surviving objects out of GC's view (e.g. post-import)

Disabling the cyclic collector is a legitimate latency tactic for short-lived, cycle-free workloads — refcounting still reclaims everything acyclic. gc.freeze() after startup keeps long-lived objects from being rescanned forever.

Inspecting the collector

gc.get_stats() returns per-generation totals — collections run, objects collected, and uncollectable objects. After a workload that churned through cyclic garbage, a real run looked like:

>>> gc.get_stats() [{'collections': 41, 'collected': 17160, 'uncollectable': 0}, # gen 0 {'collections': 2, 'collected': 800, 'uncollectable': 0}, # gen 1 {'collections': 1, 'collected': 0, 'uncollectable': 0}] # gen 2

Generation 0 ran 41 times and reclaimed the bulk; the older generations ran rarely and collected little — exactly the generational payoff. A climbing uncollectable count is your signal to hunt down finalizer cycles.

Weak references

A weakref lets you reference an object without raising its refcount, so it does not keep the object alive — the clean way to observe or cache without creating a cycle:

import weakref cache = weakref.WeakValueDictionary() # entries vanish when values die cache['key'] = obj # no strong reference held parent.child = child child.parent = weakref.ref(parent) # break the would-be cycle

Pitfalls

  • Don't rely on __del__ for timely cleanup. An object in a cycle is finalized only when the collector runs, and the order finalizers fire within a cycle is undefined — since Python 3.4 they do run (PEP 442), but never depend on the sequence. Use a context manager (with) for resources that must close deterministically.
  • Mutable default arguments persist. def f(items=[]) binds one list at definition time; it lives as long as the function does and accumulates across calls. Use None and create inside.

If you found this explanation helpful, consider sharing it with others.

Mastodon