Debugging third-party goroutine leaks
Table of Contents
Some time ago, we had a microservice with bizarre behaviour: memory always increased, but in a very slow manner.
It could take days for the memory to fill, and the pod would eventually get OOMKilled
. But I could not find any
variables in our code that were created when a request was received that were not freed once the request finished.
Back then, I wasn’t aware of goroutine leaks being a common issue (or how to debug them), so this hypothesis never got to my mind. I just assumed it was some third-party library bug that would be fixed eventually. But it wasn’t. Two months later, another team inherited this service and a colleague found a Github issue filled for one of the libraries imported by this application. It was an issue regarding goroutine leaks, and it wasn’t fixed in mainline.
They replaced the library (an in-memory cache utility) with another one with greater community popularity and…
Bingo! That was the cause of all those OOMKilled
events: a goroutine leak caused by that library.
It never crossed my mind that the cache library could be the problem because the application itself cached very little data (less than an MB). And indeed, it cleared the data when appropriate — but the author forgot to program the lib to clear some goroutines too. That’s the definition of goroutine leaks: dangling goroutines that keep running in background forever. Here’s an example:
func leakyFunction() {
ch := make(chan int)
go func() {
// This goroutine will be blocked forever
// because no one sends to the channel
val := <-ch
fmt.Println(val)
}()
// Function returns while goroutine is still blocked
}
Common causes of goroutine leaks include waiting for a channel that never receives (above), misused sync primitives, infinite loops, and logic errors.
Notice above that it gets worse if the goroutine holds data that cannot be GC’ed: the leaking goroutine will force more data to be kept in RAM making memory usage grow faster. That makes things worse, but small goroutines such as the example above are not free. If you let them reproduce for enough time, those small goroutines can amount to MBs or GBs faster than you imagine.
That was exactly our issue with the cache library: goroutines were spawned as the requests came in and never freed. Millions of requests later, the little KBs allocated for each goroutine are now worth GBs. Resource exhaustion, pod restarted.
Debugging goroutine leaks #
There are not many ways to debug goroutine leaks, especially when they’re in code you don’t control (third party modules/libraries).
You can try starting the application, attaching a debugger, looking for all active goroutines, then attaching again minutes/hours later to see if:
- some of them are still there, but they shouldn’t;
- some line of code appears is continually spawning goroutines that should have been reclaimed already.
This may be the go-to approach if you have a system you don’t know much about that’s already leaking memory in production.
Another option you can try is using Uber’s Goleak to catch goroutine leaks leveraging unit tests. It works for existing tests, so you can check for leaks using already set-up test cases:
import (
"testing"
"go.uber.org/goleak"
)
// ...
func TestMain(m *testing.M) {
goleak.VerifyTestMain(m)
}
// *existing test*
However, if the code uses dependency injection for the problematic library (and no tests injects the concrete implementation), you may need to write specific tests for ensuring that injected libraries are not to blame — as last resource if everything else passes the test.
Preventing goroutine leaks #
There are plenty of resources on the web about preventing goroutine leaks, so I will reference two that I like:
The Chapter 4 of the book Concurrency in Go by Katherine Cox-Buday explains in more depth how goroutines leak and how to avoid them, and it’s available to read for free at O’Reilly website. It’s a good starting point for diving deeper into the subject.
Uber’s engineering blog also has a list of “Leaky Go Program Patterns” which covers subtle cases of goroutine leaks (such as improper early returns) that authors say were recurrent in their codebase. This is a resource for learning about more advanced and harder to spot leaks.