Debugging third-party Goroutine leaks

Table of Contents

Some time ago, we had a microservice with bizarre behavior: memory always increased, but in a very slow manner. It could take days for the memory to fill, and the pod would eventually get OOMKilled. Yet despite careful scrutiny, I couldn’t find any variables in our code that were created when a request was received that weren’t freed once the request finished.

Back then, I wasn’t aware of goroutine leaks being a common issue (or how to debug them), so this hypothesis never crossed my mind. I simply assumed it was some third-party library bug that would be fixed eventually.

But it wasn’t.

Two months later, another team inherited this service, and a colleague found a GitHub issue filed for one of the libraries imported by our application. It was an issue regarding goroutine leaks, and it wasn’t fixed in mainline.

They replaced the library (an in-memory cache utility) with another one with greater community popularity and… Bingo! That was the cause of all those OOMKilled events: a goroutine leak caused by that library.

It never crossed my mind that the cache library could be the problem because the application itself cached very little data (less than an MB). And indeed, it cleared the data when appropriate — but the author forgot to program the lib to clear the associated goroutines too.

What Are Goroutine Leaks? #

That’s the definition of goroutine leaks: dangling goroutines that keep running in the background forever. Here’s an example:

func leakyFunction() {
    ch := make(chan int)
    go func() {
        // This goroutine will be blocked forever
        // because no one sends to the channel
        val := <-ch
        fmt.Println(val)
    }()
    // Function returns while goroutine is still blocked
}

Common causes of goroutine leaks include:

Waiting for a channel that never receives (shown above)
Misused sync primitives
Infinite loops
Writing to receiver-less unbuffered channels
Logic errors

Notice that it gets worse if the goroutine holds data that cannot be garbage-collected: the leaking goroutine will force more data to be kept in RAM, making memory usage grow faster. That makes things worse, but small goroutines such as the example above are not free. If you let them reproduce for enough time, those small goroutines can amount to MBs or GBs faster than you might imagine.

That was exactly our issue with the cache library: goroutines were spawned as the requests came in and never freed. Millions of requests later, the little KBs allocated for each goroutine grew to GBs. Resource exhaustion, pod restarted.

Debugging Goroutine Leaks #

There are some ways to debug goroutine leaks, but it gets especially difficult when they’re happening in code you don’t control (third-party modules/libraries).

You can try starting the application, attaching a debugger/profiler, looking for all active goroutines, then attaching again minutes/hours later to see if:

Some goroutines are still there when they shouldn’t be
Some line of code is continually spawning goroutines that should have been reclaimed already

This can be done using a hosted profiler, such as DataDog’s Guided Profiler or using local tools such as pprof for generating profiles.

Most of the time, however, those reports aren’t sufficient to find the underlying issue. After all, the goroutine leak may only materialize in some circumstances such as specific payloads or function parameters and the reports may point to some cryptic lib function you don’t have control over.

That’s when Uber’s Goleak comes in. It allows catching goroutine leaks through tests. It works with existing tests, so you can plug it to unit or integration tests you already have set up:

import (
    "testing"
    "go.uber.org/goleak"
)

// ...

// TestMain is a special test function that runs after each test 
func TestMain(m *testing.M) {
	goleak.VerifyTestMain(m)
}

// *existing test*

Goleak can help isolate in which circumstances the Goroutine leaks actually happen and is more productive than bug-hunting in production when you already have tests covering most of your app.

However, if your code uses dependency injection/mocks for the problematic library (and no test injects the concrete implementation), you may get false-positives. Nonetheless, if all tests are passing, you can try injecting the actual implementation for suspected libraries one by one to find what’s really happening. Then, use this test to check if your fix is actually working and save it as an integration test to prevent this kind of issue from happening again.

I think it’s a net positive and, in my experience, you’ll probably iterate faster this way than bughunting via trial-and-error and multiple deploys.

Preventing Goroutine Leaks #

There are plenty of resources on the web about preventing goroutine leaks, so I’ll reference two that I particularly recommend:

The Chapter 4 of the book “Concurrency in Go” by Katherine Cox-Buday explains in depth how goroutines leak and how to avoid them, and it’s available to read for free at O’Reilly’s website. It’s an excellent starting point for diving deeper into the subject.

Uber’s engineering blog also has a list of “Leaky Go Program Patterns” which covers subtle cases of goroutine leaks (such as improper early returns) that the authors say were recurrent in their codebase. This is a valuable resource for learning about more advanced and harder-to-spot leaks.