Jaeger can not start, Badger deleting file very slow, jaeger can not init storage

jaeger-all-in-one when the data is bigger than 2G, it can not start. show deleting empty file, but the file not in data directory . very strange!

when jaeger starting, reading badger file cost very long time. why badger needs so long to init. the key ttl is 2h and total key about 1.5G, total data about 2.2G。as you can see reading data start at 21:55:55 and finished at 22:36:50, 41minutes!

I think you will need to profile jaeger on the system you are referring too, sounds like it must be related to that. Badger starts up in seconds with 500GiB in Dgraph.

Also the deleting empty file line will happen on almost every start. It is missing because it deleted it, as it stated.

thanks! if the data file (key or log) has error, doest badger need a long time to recover ? it seems the jaeger stopped for writing data to disk error, but when it restaring the data init need so long time that liveness probe never success, finally the pod crashed !

Just from your logs in the screenshots above, seems like badger itself should have been good to go after it printed ‘all X tables opened in Y’. The other lines you see ‘LOG Compact’ are from runtime compaction which happens periodically after the database is open.

Seems like maybe you would want to look at what happens in Jaeger between badger.Open() and whatever prints out that storage config line.

thank you ! as you can see jaeger do not do time consuming tasks but waiting the file open finished (see the picture below ). from the second screenshots we can see jaeger starting reading data at 21:55:55 and return ready at 22:36:50 when jaeger print the badger options. so i still think badger to be ready and return take a long time.

That NewCacheStore() call has prefill==true which iterates all the services in the database, and for each service iterates every operation for that service (happening here). Iterating the entire database in several serial iterations is what is taking the time there, not starting badger, which happens in 411ms in the logs from your original post. There are certainly more efficient ways to query badger than the ways done in that cache prefill logic.

Also note in the log screenshots the small compactions are taking ~20s and the first one listed took a very long time, I think maybe your disks backing this may be very slow? Like, overlayfs slow (storing the files within a running container). Just a guess there, though.

1 Like

got it, thank you very much. the disk is indeed not fast (RAID0). thank you again!

1 Like