Optimising Go allocations using pprof – Robust Perception | Prometheus Monitoring Experts

As I mentioned in a previous post, I made some memory-related improvements to Prometheus that'll be in the 1.5 release. Let's look at how I came across unneeded memory allocations and ultimately improved the code.

My initial goal was to provide guidance to Prometheus users on how many resources they should provision for a given load and what flag settings were appropriate. So I setup a basic test setup on EC2 to simulate a multi-thousand machine environment.

The Loadtest

My primary tool to understand the system was Prometheus itself, as Prometheus is (unsurprisingly) instrumented with Prometheus metrics. In addition the Prometheus Go client provides a pile of process and Go-related metrics out of the box such as CPU usage and GC pause duration. I had a meta-Prometheus scraping the loaded Prometheus every second, and over the course of weeks of loadtesting built up a Grafana dashboard exposing the most relevant metrics.

I quickly found things that were both surprising and undesirable. Primarily of concern was that the samples ingested wasn't even. Scrapes weren't being missed, however this was a sign of unhealthiness. While some non-uniformity would be expected on a small system, with the thousands of targets I was simulating it should have been almost perfectly flat.

I noticed there was a correlation with spiky CPU usage, which is also ungood. Spiky CPU usually means you need to provision for the peak CPU usage you see, which in this case was over 60% of the average CPU. Smoothing that out would help cut CPU requirements and reduce tail latency by quite a bit.

After false starts investigating various background maintenance processes, I realised this aligned with GC cycles.

Go Garbage Collection

At this point I need to stop and discuss how Go garbage collection works. Unlike Python's GC which uses reference counting or Java's GC which is generational mark-sweep, Go's GC is a relatively simple concurrent mark-sweep. Go keeps track of how much has been allocated, and once that reaches double what was left over after the previous GC run a sweep happens.

Sweeps mean pauses, which are famously brief in Go. The pauses themselves were not the problem here, typically taking under 1ms. If you dig through the Go source code you'll find this comment:

// It uses a feedback control algorithm to adjust the memstats.gc_trigger
// trigger based on the heap growth and GC CPU utilization each cycle.
// This algorithm optimizes for heap growth to match GOGC and for CPU
// utilization between assist and background marking to be 25% of
// GOMAXPROCS. The high-level design of this algorithm is documented
// at https://golang.org/s/go15gcpacing.

GOMAXPROCS

Reducing Allocations

GOGC

Thus we need to look at reducing the amount of garbage collection that is done, which means reducing the volume of allocations. Initial numbers indicated that Prometheus was allocating around 2kB per sample ingested.

-alloc_space

go tool pprof -alloc_space -svg :9090/debug/pprof/heap > heap.svg

Looking through this for places for improvement, I noticed metric cloning and relabelling taking a significant amount of allocations:

make()

go_memstats_alloc_bytes_totalTotalAllocs

These small changes had the net effect of halving the allocations per sample down to around 1kB per sample ingested.

Being able to quickly bounce between pprof, Prometheus metrics and the source code was essential to making these improvements. Without all three it's unlikely I would have discovered these inefficiencies, let alone understand their impact.

The other takeaway is that Go GC adds 33% onto your required cores in non-batch usages, and over 100% onto your required RAM. Request Oriented Collector should improve upon this in future Go versions.

Want to utilise Prometheus in your loadtesting? Contact us.