Concurrent query with ReadOnlyTxn without initial table serving may be blocked

Moved from GitHub dgraph/3567

Posted by wangleiin:

If you suspect this could be a bug, follow the template.

  • What version of Dgraph are you using?
    I’m using release v1.0.15.

  • Have you tried reproducing the issue with latest release?
    Yes

  • What is the hardware spec (RAM, OS)?
    Not releated.

  • Steps to reproduce the issue (command/config used to run Dgraph).
    Concurrent Query like this

func DgQuery(q string, c *dgo.Dgraph) ([]byte, error) {

	ctx, cancel := context.WithTimeout(context.Background(), time.Second*30)
	defer cancel()

	txn := c.NewReadOnlyTxn()
	defer txn.Discard(ctx)

	resp, err := txn.Query(ctx, q)
	if err != nil {
		return []byte(""), err
	}

	b := resp.GetJson()
	if b == nil {
		err = fmt.Errorf("resp is empty")
		return []byte(""), err
	}
	return b, nil
}
func main() {
	c := NewClient()

	wg := new(sync.WaitGroup)
	for i := 0; i < 20; i++ {
		wg.Add(1)
		go func(pred string) {
			defer wg.Done()
			q := fmt.Sprintf(`
		{
			all(func: has(%s)) {
				uid
				balance
			}
		}
	`, pred)
			result, err := client.DgQuery(q, c)
			if err != nil {
				log.Fatal(err)
			}
			log.Println(string(result))
		}(fmt.Sprintf("predicate%d", i))
	}
	wg.Wait()

}

With a ReadOnlyTxn and predicates not served initially, queries may be blocked until 30s timeout.

  • Expected behaviour and actual result.

Expect empty results repidly.

Actually it was blocked forever.

I have found in github.com/dgraph-io/dgraph/worker/groups.go,
func (g *groupi) processOracleDeltaStream has a batch process for deltaCh, while the GroupChecksums not update, which may cause func (g *groupi) ChecksumsMatch(ctx context.Context) blocked for checksums not match “==” forever.

wangleiin commented :

I wonder if the checksum should be updated for
github.com/dgraph-io/dgraph/worker/groups.go, line 832

SLURP:
			for {
				select {
				case more := <-deltaCh:
					if more == nil {
						return
					}
					batch++
					delta.Txns = append(delta.Txns, more.Txns...)
					delta.MaxAssigned = x.Max(delta.MaxAssigned, more.MaxAssigned)
					**delta.GroupChecksums = more.GroupChecksums**
				default:
					break SLURP
				}
			}

martinmr commented :

Thanks for reporting the issue and including the steps to reproduce. I can reproduce the error in the 1.0 branch but not in the master branch (what will become version 1.1). I will try to check the differences between the two versions to see if I can find the fix.

martinmr commented :

I haven’t had luck finding the root cause but I can confirm that queries are getting blocked while trying to verify the checksums. Since that’s the case, this affects all transactions, not just read-only ones.

EDIT: It doesn’t seem like a blocking issues: I printed both the oracle and delta checksums as they are received. Both are received by the alpha but their values are different.

martinmr commented :

@manishrjain: Any ideas of how I could further debug this issue? So far I have only seen it in the 1.0 branch.