Amortized Performance and Python Sets

Posted on December 11, 2016 by Brian Jaress
Tags: code, advice

There’s an old discussion on Stack Overflow on the asymptotic performance of Python sets. The top comments all point out that Python sets are based on hash table algorithms, and – according to Wikipedia – hash tables are \(O(1)\) in the average case and \(O(n)\) in the worst case. Then they point out it’s a bad idea to assume you’ll get the average case.

But there’s another side to the discussion. Farther down in the Stack Overflow comments, someone mentions amortized analysis and resizing. Someone else links to the code in Python. That code cites [4], which cites [2] and [1] for \(O(1)\) amortized time “regardless of the keys being inserted.” The same Wikipedia article that gives \(O(n)\) worst case at the top also cites [4] and others for “constant average cost per operation.”

So why the confusion, and what’s really going on? Amortized analysis isn’t as well known as worst case or average case, and that might be causing some confusion.

(There are also different ways of analyzing the possibility of problems with the hash function, but I don’t think that’s where most the confusion is coming from.)

Amortized Analysis

Amortized analysis is different from average-case analysis, and it’s usually the one that matters. Average-case analysis is an average over different possibilities, but amortized analysis is an average over a worst-case sequence of actions. That opens up the possibility of expensive actions being “paid for” by cheaper actions that precede them, even in the worst case where those expensive actions happen as often as possible. [3] [5]

You can’t prove anything about math using benchmarks, but I think some code and measurement would help make things clearer. I wrote some code to time insertions and deletions on a Python set and graph the results. There’s a part related to amortization:

def amortize(records):
    results = []
    cumulative_total = 0
    for count, duration in records:
        cumulative_total += duration
        results.append((count, cumulative_total / count))
    return results

Without amortization, each data point is the size of the table vs the time it took to insert the last batch of items, bringing it to that size. With amortization, it’s the total time to insert up to that many items starting from none, divided by the number of items.

For many applications, that’s what counts. If you’re using a Python set to record IDs of items that have already been processed, you’re eventually putting every ID in there. Amortized constant insert time means your algorithm is potentially linear (depending on what you actually do in your processing). Taking the worst-case time for one insert and multiplying it by \(n\) inserts would make you think your overall algorithm has to be quadratic.


The point is to compare trends as the number of items increases, so the details of my hardware aren’t as important as they would be in a typical benchmark. For the record, my old laptop reports having 3.8 GiB of memory and an Intel® Core™ i3-4000M CPU @ 2.40GHz × 4.

The idea of the benchmark is simple: add \(2 \times 10^7\) items to a set one at a time, timing each batch of \(2 \times 10^4\). Then delete them one at a time, timing each batch of \(2 \times 10^4\).

MAX = 2 * 10**7
STEP = 2 * 10**4

test_subject = set()
insert_results = timed_calls(insert_step(test_subject), MAX, STEP)
remove_results = timed_calls(remove_step(test_subject), MAX, STEP)

(I’ve reorganized the code in this post for clarity. Check out the full source if you prefer.)

def timed_calls(callback, target, stepsize):
    results = []
    for i in range(0, target, stepsize):
        before =
        callback(i, i + stepsize)
        after =
        results.append((i + stepsize, (after - before).total_seconds()))
    return results

The timed_calls function handles splitting the full range of items into sub-ranges and timing how long it takes to handle that each sub-range.



For insertions, we just take the sub-range and insert everything in it. It’s possible to insert integers into a Python set, but converting them to strings means some realistic hashing has to occur.

def insert_step(container):
    def insert(start, stop):
        for i in range(start, stop):
    return insert

And then we graph it:

        [count for count, duration in insert_results],
        [duration for count, duration in insert_results],
        "Size", "Duration (s)",
        "Time to Insert a Batch of Items, vs Size")

        [count for count, duration in insert_results],
        [duration for count, duration in amortize(insert_results)],
        "Size", "Duration (s)",
        "Amortized Time to Insert an Item, vs Size")
Time to Insert a Batch of Items, vs Size
Amortized Time to Insert an Item, vs Size

The un-amortized chart has a point for each batch of \(2 \times 10^4\), and the \(y\) value is simply the amount of time that batch took. (The \(x\) value is the number of items in the set once the batch was inserted.) If you only looked at that chart, you might think “Hey, a bunch of average cases and a few worst cases.”

The amortized chart is a flat line. Numerically, there’s some slope, but it’s got 14 zeros after the decimal point. If you only looked at that chart, you might think, “Well, we got the average case every time because it’s typical.”

But look at the charts and code together, and you realize that the points flying up to the sky on the un-amortized chart are in the flat line of the amortized chart. It’s the same \(x\) and the same raw duration from the same batch, it’s just used to compute a cumulative average. What happens is that those flyaway points get farther apart as they get higher, and they’re diluted by everything that comes between them.


def remove_step(container):
    def remove(start, stop):
        for i in range(start, stop):
    return remove

        [count for count, duration in remove_results],
        [duration for count, duration in remove_results],
        "Total Removed", "Duration (s)",
        "Time to Remove a Batch of Items vs Total Removed")

        [count for count, duration in remove_results],
        [duration for count, duration in amortize(remove_results)],
        "Total Removed", "Duration (s)",
        "Amortized Time to Remove an Item vs Total Removed")

Something similar happens with deletes, where everything snaps into a flat line under amortized analysis:

Time to Remove a Batch of Items vs Total Removed
Amortized Time to Remove an Item vs Total Removed

What does this mean?

I’m definitely not trying to be the last word in hash tables, and I could be mistaken in any of this code or analysis. That said, if you are analyzing the performance of an algorithm that puts many items into a hash table, you should probably use the amortized \(O(1)\) time, not the \(O(n)\) worst case. Don’t be put off by people who think you’re talking about the \(O(1)\) average case – that’s different.


[1] Dietzfelbinger, M., Karlin, A., Mehlhorn, K., Heide, F.M. auf der, Rohnert, H. and Tarjan, R.E. 1994. Dynamic perfect hashing: Upper and lower bounds. SIAM Journal on Computing. 23, 4 (1994), 738–761.

[2] Fredman, M.L., Komlós, J. and Szemerédi, E. 1984. Storing a sparse table with 0(1) worst case access time. J. ACM. 31, 3 (Jun. 1984), 538–544.

[3] Harfst, G.C. and Reingold, E.M. 2000. A potential-based amortized analysis of the union-find data structure. SIGACT News. 31, 3 (Sep. 2000), 86–95.

[4] Knuth, D.E. 1998. The art of computer programming, volume 3: (2Nd ed.) sorting and searching. Addison Wesley Longman Publishing Co., Inc.

[5] Tarjan, R.E. 1985. Amortized computational complexity. SIAM Journal on Algebraic Discrete Methods. 6, 2 (1985), 306–318.