chenty.in

After Building a Thread Pool: What Do Real Concurrent Systems Still Need?

In the first four articles, we built a teaching version of a thread pool from scratch.

The first article used function pointers and void* to represent a task.

The second article used a mutex to protect the task queue so that multiple workers do not corrupt shared state.

The third article used condition variables so workers can sleep when there is no work and wake when a task arrives.

The fourth article assembled the pieces into:

thread_pool_init
thread_pool_submit
thread_pool_wait
thread_pool_destroy

At this point, we have a working thread pool.

It can:

create worker threads
accept submitted tasks
let workers take and run tasks
wait until all tasks finish
shut down cleanly

So a natural question appears:

Is this already an industrial-grade thread pool?

The answer is no.

It is a good teaching implementation, but real concurrent systems have many more constraints.

Writing a runnable thread pool is only the first step.

After that, we still need to think about:

deadlocks
thread count
shared-queue bottlenecks
work stealing
task granularity
memory ownership
return values
error handling
shutdown strategy

This article is a map of those next problems.

The question is:

After writing a thread pool, what does a real concurrent system still need to solve?


1. What has the teaching thread pool already solved?

Our current thread pool has already covered the core synchronization structure.

It has:

task abstraction
ring buffer
mutex
condition variables
worker loop
wait for completion
cooperative shutdown

The task abstraction is:

typedef void (*thread_task_fn)(void*);

typedef struct {
  thread_task_fn fn;
  void* arg;
} ThreadTask;

The worker waits for work:

while (pool->queue_size == 0 && !pool->stop) {
  pthread_cond_wait(&pool->not_empty, &pool->mutex);
}

The submitter wakes a worker:

pthread_cond_signal(&pool->not_empty);

The waiter waits until the pool is idle:

while (pool->queue_size > 0 || pool->working_count > 0) {
  pthread_cond_wait(&pool->all_done, &pool->mutex);
}

These are real, useful building blocks.

They are not toy ideas.

Many production systems still rely on the same concepts.

But production systems also care about things that this teaching version intentionally leaves out:

What if tasks submit more tasks?
What if one task waits for another task?
What if a task never returns?
What if the queue lock becomes a bottleneck?
What if there are too many threads?
What if task arguments are freed too early?
What if the caller needs a return value?
What if the pool is destroyed while other threads are still submitting?

These are the next layer.


2. First issue: locks can cause deadlocks

A mutex protects shared state.

But a mutex can also cause a deadlock.

A deadlock means several threads are waiting for each other forever, and none of them can make progress.

The classic example is:

pthread_mutex_lock(&a);
pthread_mutex_lock(&b);

/* use resources protected by a and b */

pthread_mutex_unlock(&b);
pthread_mutex_unlock(&a);

If another thread locks in the opposite order:

pthread_mutex_lock(&b);
pthread_mutex_lock(&a);

then this can happen:

Thread 1 holds a and waits for b.
Thread 2 holds b and waits for a.

Neither can continue.

The program looks alive, but it will never move forward.


3. Why does deadlock happen?

The classic deadlock conditions are:

1. Mutual exclusion
2. Hold and wait
3. No preemption
4. Circular wait

Mutual exclusion means:

Only one thread can hold a resource at a time.

Hold and wait means:

A thread holds one resource while waiting for another.

No preemption means:

The resource cannot be forcibly taken away.

Circular wait means:

A waits for B.
B waits for C.
C waits for A.

To prevent deadlock, we usually break at least one of these conditions.

In normal C mutex code, the most practical method is:

Always acquire locks in a consistent order.

For example:

Always lock a before b.

Do not sometimes write:

lock a, then lock b

and elsewhere write:

lock b, then lock a

Consistent lock ordering is one of the simplest and most important rules in concurrent code.


4. Can our thread pool deadlock?

Our teaching thread pool uses one mutex:

pool->mutex

So it does not have a lock-ordering problem between multiple pool locks.

That is good.

But deadlocks can still appear at a higher level.

For example, suppose a task submits another task and then waits for it:

void task_a(void* arg) {
  thread_pool_submit(pool, task_b, arg);
  thread_pool_wait(pool);
}

This is dangerous.

If every worker is already running a task that waits for other tasks, the pool may stop making progress.

A similar problem appears when a task waits for a result that can only be produced by another task in the same saturated pool.

The thread pool implementation itself may be correct, but the usage pattern can still deadlock.

Another dangerous pattern is running user code while holding the pool mutex:

pthread_mutex_lock(&pool->mutex);
task.fn(task.arg);
pthread_mutex_unlock(&pool->mutex);

Our implementation avoids that.

It takes the task under the mutex, releases the mutex, and then runs the task:

pthread_mutex_unlock(&pool->mutex);

task.fn(task.arg);

This design is important.

User-provided task code can do anything:

block
submit more tasks
wait on other locks
call slow I/O

So the pool should not hold its internal lock while running user code.


5. Second issue: how many threads should we create?

Another real question is:

How many worker threads should the pool have?

The beginner answer is often:

thread_count = 8;

or:

Use the number of CPU cores.

But the correct answer depends on the workload.

For CPU-bound tasks, such as:

compression
image processing
hashing
pure computation

too many threads are not helpful.

If the machine has eight cores and we create one hundred CPU-bound workers, the operating system spends more time switching between threads.

That adds overhead:

context switching
cache misses
scheduler overhead

For CPU-bound work, a good starting point is often close to the number of CPU cores.

For I/O-bound tasks, such as:

network requests
disk I/O
database calls
waiting on external services

threads spend a lot of time blocked.

In that case, more threads may improve throughput because while one thread waits for I/O, another thread can run.

So:

CPU-bound workload: thread_count near CPU core count
I/O-bound workload: thread_count may be larger

But this is only a starting point.

Real systems usually need measurement.

The right number depends on:

hardware
task duration
blocking ratio
memory pressure
scheduler behavior
latency requirements

There is no universally correct number.


6. Third issue: one shared queue can become a bottleneck

Our teaching thread pool uses one shared task queue.

Every submitter inserts tasks into the same queue.

Every worker takes tasks from the same queue.

That means:

all threads compete for the same mutex

This is simple and correct.

But under heavy load, it may become a bottleneck.

The hot lock is:

pool->mutex

Every task submission needs it.

Every task take needs it.

Every completion update needs it.

If tasks are very small, workers may spend a lot of time fighting over the queue instead of doing useful work.

For example:

take task
run tiny task
take task
run tiny task
take task
run tiny task

If each task is tiny, the lock overhead may dominate.

That is why more advanced thread pools often avoid a single global queue as the only scheduling structure.


7. Work stealing: each worker has its own queue

One common design is work stealing.

Instead of one global queue, each worker has its own local queue:

worker 0 -> local queue 0
worker 1 -> local queue 1
worker 2 -> local queue 2
worker 3 -> local queue 3

Normally, a worker takes work from its own queue:

worker 0 takes from queue 0

If its own queue is empty, it tries to steal work from another worker:

worker 0 steals from queue 2

The benefit is:

workers mostly touch their own queues
contention is reduced
cache locality may improve

A simplified model:

push local tasks to local queue
pop local tasks from local queue
if local queue is empty, steal from another queue

This is especially useful for recursive or divide-and-conquer workloads.

For example:

sort left half
sort right half
combine result

Each task may produce more tasks.

Local queues help keep related work near the worker that created it.


8. Why steal from the other end of a queue?

Many work-stealing designs use a deque.

The owner worker pushes and pops from one end:

owner uses bottom

Thieves steal from the other end:

thieves use top

Why split the ends?

Because the owner is the common path.

Most of the time, a worker should operate on its own queue without fighting with other workers.

Stealing is the uncommon path.

So the design tries to make the common path cheap:

local push/pop should be fast
stealing may be slower

This reduces unnecessary sharing.

And reducing unnecessary sharing is one of the core goals of high-performance schedulers.

Our teaching thread pool does not implement work stealing.

That is fine.

But it is useful to know why production thread pools often become more complex.


9. Fourth issue: task granularity matters

Another important question:

How large should one task be?

If tasks are too large, parallelism is poor.

For example:

task 1 runs for 10 seconds
task 2 runs for 10 milliseconds
task 3 runs for 10 milliseconds

One worker may be stuck for a long time while others finish quickly.

This causes load imbalance.

If tasks are too small, overhead becomes too high.

For example:

each task only increments one integer

Then the cost of:

submitting the task
locking the queue
waking a worker
scheduling the task
updating completion state

may be larger than the useful work itself.

So task granularity is a tradeoff:

too coarse: poor load balancing
too fine: overhead dominates

Real systems often batch small work items together.

They may also split large work recursively when needed.


10. Fifth issue: task arguments and memory ownership

Our task interface is:

typedef void (*thread_task_fn)(void*);

This is flexible, but it also raises an ownership question:

Who owns the memory pointed to by arg?

For example:

thread_pool_submit(&pool, print_number, &x);

Is x still alive when the task runs?

This is unsafe:

for (int i = 0; i < 100; ++i) {
  int x = i;
  thread_pool_submit(&pool, print_number, &x);
}

The pointer points to a local variable.

The task may run after the loop iteration has ended.

Then the pointer is invalid or points to a changed value.

Better options include:

int nums[100];

for (int i = 0; i < 100; ++i) {
  nums[i] = i;
  thread_pool_submit(&pool, print_number, &nums[i]);
}

thread_pool_wait(&pool);

Here, nums stays alive until all tasks finish.

Another option is heap allocation:

int* value = malloc(sizeof(int));
*value = i;
thread_pool_submit(&pool, print_number_and_free, value);

But then the task function must clearly own and free the memory.

So every task submission needs an ownership contract:

Who allocates the argument?
Who frees it?
How long must it remain valid?
Can the task store it after returning?

Many real bugs in C thread pools come from this exact issue.

The threading code is correct, but the task argument lifetime is wrong.


11. Sixth issue: do tasks have return values?

Our task function returns void:

typedef void (*thread_task_fn)(void*);

This is simple.

But real users often want:

submit a task
get a result later
know whether it succeeded
wait for that one task, not the whole pool

That leads to a future or promise-style design.

For example:

Future* future = thread_pool_submit(pool, compute, arg);

Then later:

void* result = future_get(future);

Now each task needs additional state:

pending
running
finished
failed
result pointer
error code
condition variable for waiters

This is much more complex than a void (*)(void*) task.

It also raises new ownership questions:

Who owns the Future?
Who owns the result?
Who frees the error object?
What happens if the caller abandons the Future?

Our teaching thread pool avoids this on purpose.

It focuses on scheduling work, not returning values.


12. Seventh issue: what if a task fails?

In the teaching version, a task has no return value:

void task(void* arg);

So the pool cannot directly know whether the task succeeded.

A task may fail internally:

open file failed
network request failed
allocation failed
parse failed

But the pool only sees:

task.fn(task.arg);

When the function returns, the pool assumes the task is done.

Real systems need an error reporting policy.

Possible designs:

task writes result into user-provided memory
task stores error in a Future
task logs and returns nothing
task calls a callback on failure
task retries automatically
task cancels related tasks

None of these is universally correct.

The right choice depends on the application.

This is why thread pool design is not only about threads.

It is also about task lifecycle management.


13. Eighth issue: shutdown policy

Our current destroy uses this policy:

stop accepting new tasks
wake all workers
finish already queued tasks
join all workers
free resources

This is a graceful shutdown.

But real systems may need different policies.

For example:

finish all queued tasks before exit
discard tasks that have not started
cancel running tasks if possible
wait only for a timeout
reject new tasks but let running tasks finish

Each policy has different semantics.

A web server may want graceful shutdown:

finish current requests
stop accepting new requests
exit after all active requests finish

A command-line tool may want faster shutdown:

stop as soon as possible
drop remaining work
clean up resources

A background system may need timeouts:

wait at most 5 seconds
then force shutdown or report failure

The teaching pool has one policy.

A production pool usually needs to document its shutdown semantics very clearly.


14. The distance between the teaching pool and a real scheduler

The teaching pool is useful because it exposes the core structure:

queue
workers
mutex
condition variables
shutdown flag
completion condition

But a real scheduler often adds:

dynamic thread count
bounded and unbounded queues
task priorities
task cancellation
timeouts
futures
work stealing
per-worker queues
metrics
backpressure
error propagation
debugging hooks

Each feature introduces new state.

Each new state needs a synchronization rule.

For example:

If tasks can be canceled, who can cancel them?
If priorities exist, how is starvation avoided?
If queues are unbounded, how is memory usage controlled?
If workers can be added dynamically, who owns their lifecycle?
If a task panics or crashes, how is the pool state repaired?

This is why production concurrency libraries look complicated.

The complexity is not there for decoration.

It comes from the number of states that must remain consistent under concurrency.


15. Closing

The teaching thread pool we built is not the final destination.

But it is a good foundation.

It teaches the core ideas:

represent work as a function and argument
store work in a queue
protect shared state with a mutex
sleep and wake threads with condition variables
track active workers
wait for completion
shut down cooperatively

Once these ideas are clear, more advanced thread pools are easier to read.

When you see a production scheduler, you can ask better questions:

Where is the shared state?
Who owns this memory?
Which lock protects this field?
Which condition wakes this thread?
Can this task wait for another task?
What happens during shutdown?

That is the real value of writing a small thread pool yourself.

Not because the small version is enough for every real system.

But because it gives you a mental model for understanding larger ones.

"It runs" is only the first step. A real concurrent system must balance correctness, performance, and maintainability.