Thank you for the response Manish!
My postulation here is that on server restart and wakeup there was a flood of traces being produced and since that Jaeger exporter doesn’t allow one to configure the bundler’s BundleCountThreshold, it uses a very low threshold of only 10 google-api-go-client/bundler.go at 78b596aa1e71326943165617fdb108dd18b34dbf · googleapis/google-api-go-client · GitHub
which will cause a huge amount of goroutines and due to load balancing by the runtime scheduler over your cores, it consumes a lot of CPU. Eventually the uploading/consumption rate catches up to the production rate and the amount of goroutines stabilizes. A simple experiment to experience this is by running these Go programs and monitoring your CPU usage and eventually causing a core dump with Ctrl + \ for which for “Scenario 1” you’ll see many goroutines tripping on startFlushLocked. Interesting also that the first goroutine is the one being tripped up on:
Scenario 1: (postulated current scenario)
package main
import (
"net/http"
"time"
"google.golang.org/api/support/bundler"
)
func main() {
bdlr := bundler.NewBundler((*int)(nil), func(bundle interface{}) {
sl := bundle.([]*int)
if len(sl) == 0 {
panic("empty slice")
}
})
for {
for i := 0; i < 1e5; i++ {
bdlr.Add(&i, 1)
}
<-time.After(5 * time.Second)
}
// Just using this instead of: for {}
http.ListenAndServe(":8888", nil)
}
Scenario 2: (ideal with sufficient bundling count to enable trace production to catch up with consumption rate):
package main
import (
"net/http"
"time"
"google.golang.org/api/support/bundler"
)
func main() {
bdlr := bundler.NewBundler((*int)(nil), func(bundle interface{}) {
sl := bundle.([]*int)
if len(sl) == 0 {
panic("empty slice")
}
})
bdlr.BundleCountThreshold = 1e3
for {
for i := 0; i < 1e5; i++ {
bdlr.Add(&i, 1)
}
<-time.After(5 * time.Second)
}
// Just using this instead of: for {}
http.ListenAndServe(":8888", nil)
}
Remedies;
- I’d highly suggest using the agent opencensus.io/service/components/agent/ and use the ocagent-exporter which will replace the Jaeger exporter but also will enable you to horizontally scale your deployments Dgraph servers if using the Prometheus exporter and is literally a drop-in replacement for both exporters like this:
oce, err := ocagent.NewExporter(
ocagent.WithInsecure(),
ocagent.WithReconnectionPeriod(5 * time.Second),
ocagent.WithAddress("localhost:55678"), // Only included here for demo purposes.
ocagent.WithServiceName("ocagent-go-example"))
if err != nil {
log.Fatalf("Failed to create ocagent-exporter: %v", err)
}
trace.RegisterExporter(oce)
view.RegisterExporter(oce)
I’ve talked to folks who use Prometheus and there are issues during scraping cycles with Prometheus’ client.
Also the ocagent-go-exporter has a bundler default of 300 which is a simple approximation for a very high traffic application. If your server creates spans at say 3,000 QPS, in a second, exporting will be invoked 10 times only instead of the current 300 times – this is very reasonable for various scenarios including startup with a flood of spans.
The agent will also ensure that your customers deploy the agent as a side car once in their entire cloud/cluster and change their export configurations without having to stop their Dgraph servers, they don’t need to have a backend running before their applications run, etc…
Hope this helps and I look forward to hearing back from you.
Thank you!