Jaeger Exporter causing huge CPU spike

Thank you for the response Manish!

My postulation here is that on server restart and wakeup there was a flood of traces being produced and since that Jaeger exporter doesn’t allow one to configure the bundler’s BundleCountThreshold, it uses a very low threshold of only 10 google-api-go-client/bundler.go at 78b596aa1e71326943165617fdb108dd18b34dbf · googleapis/google-api-go-client · GitHub

which will cause a huge amount of goroutines and due to load balancing by the runtime scheduler over your cores, it consumes a lot of CPU. Eventually the uploading/consumption rate catches up to the production rate and the amount of goroutines stabilizes. A simple experiment to experience this is by running these Go programs and monitoring your CPU usage and eventually causing a core dump with Ctrl + \ for which for “Scenario 1” you’ll see many goroutines tripping on startFlushLocked. Interesting also that the first goroutine is the one being tripped up on:

Scenario 1: (postulated current scenario)

package main

import (
	"net/http"
	"time"

	"google.golang.org/api/support/bundler"
)

func main() {
	bdlr := bundler.NewBundler((*int)(nil), func(bundle interface{}) {
		sl := bundle.([]*int)
		if len(sl) == 0 {
			panic("empty slice")
		}
	})

	for {
		for i := 0; i < 1e5; i++ {
			bdlr.Add(&i, 1)
		}
		<-time.After(5 * time.Second)
	}

	// Just using this instead of: for {}
	http.ListenAndServe(":8888", nil)
}

Scenario 2: (ideal with sufficient bundling count to enable trace production to catch up with consumption rate):

package main

import (
	"net/http"
	"time"

	"google.golang.org/api/support/bundler"
)

func main() {
	bdlr := bundler.NewBundler((*int)(nil), func(bundle interface{}) {
		sl := bundle.([]*int)
		if len(sl) == 0 {
			panic("empty slice")
		}
	})
	bdlr.BundleCountThreshold = 1e3

	for {
		for i := 0; i < 1e5; i++ {
			bdlr.Add(&i, 1)
		}
		<-time.After(5 * time.Second)
	}

	// Just using this instead of: for {}
	http.ListenAndServe(":8888", nil)
}

Remedies;

  • I’d highly suggest using the agent opencensus.io/service/components/agent/ and use the ocagent-exporter which will replace the Jaeger exporter but also will enable you to horizontally scale your deployments Dgraph servers if using the Prometheus exporter and is literally a drop-in replacement for both exporters like this:
	oce, err := ocagent.NewExporter(
		ocagent.WithInsecure(),
		ocagent.WithReconnectionPeriod(5 * time.Second),
		ocagent.WithAddress("localhost:55678"), // Only included here for demo purposes.
		ocagent.WithServiceName("ocagent-go-example"))
	if err != nil {
		log.Fatalf("Failed to create ocagent-exporter: %v", err)
	}
	trace.RegisterExporter(oce)
	view.RegisterExporter(oce)

I’ve talked to folks who use Prometheus and there are issues during scraping cycles with Prometheus’ client.

Also the ocagent-go-exporter has a bundler default of 300 which is a simple approximation for a very high traffic application. If your server creates spans at say 3,000 QPS, in a second, exporting will be invoked 10 times only instead of the current 300 times – this is very reasonable for various scenarios including startup with a flood of spans.

The agent will also ensure that your customers deploy the agent as a side car once in their entire cloud/cluster and change their export configurations without having to stop their Dgraph servers, they don’t need to have a backend running before their applications run, etc…

Hope this helps and I look forward to hearing back from you.

Thank you!

1 Like