Mutation always get stuck and Timeout 。。。

Report a Dgraph Bug

What version of Dgraph are you using?

Dgraph Version
$ dgraph version
 
Dgraph version   : v20.11.0-rc1-506-g32f1f5893
Dgraph codename  : unnamed-mod
Dgraph SHA-256   : d3322ca0ffad72ade603a2c963610d210e916b1d5e920151f07a42669f790cee
Commit SHA-1     : 32f1f5893
Commit timestamp : 2021-04-01 22:52:41 +0530
Branch           : release/v21.03
Go version       : go1.15.9
jemalloc enabled : true

Have you tried reproducing the issue with the latest release?

What is the hardware spec (RAM, OS)?

CentOS 7.6 32G RAM

Steps to reproduce the issue (command/config used to run Dgraph).

Expected behaviour and actual result.


i thought the mutation should not be so, always stuck and then timeout, never succed.

Experience Report for Feature Request

Note: Feature requests are judged based on user experience and modeled on Go Experience Reports. These reports should focus on the problems: they should not focus on and need not propose solutions.

What you wanted to do

we insert many data, and when the inserted data(or called node) increased to about 250000, the new mutation must be Timeout.

What you actually did

the data of mutations likes:

{
  set {
    _:r <dgraph.type> "RESPONSE" .
    _:r <response> "HTTP/1.1 302 Found\r\nDate: Wed, 30 Dec 2020 20:29:27 GMT\r\nServer: Apache/2.4.6\r\nLocation: https://209.250.0.97/\r\nCache-Control: max-age=1728<head>\n<title>302 Found</title>\n</head><body>\n<h1>Found</h1>\n<p>The document has moved <a href=\"https://209.250.0.97/\">here</a>.</p>\n</body></html>\n" .
    _:r <title> "302 Found" .
    _:r <md5> "3bb0aed81197f8131dabd9ed19de9a9f" .
    _:r <status_code> "302" .
    _:r <body> "<!DOCTYPE HTML PUBLIC \"-//IETF//DTD HTML 2.0//EN\">\n<html><head>\n<title>302 Found</title>\n</head><body>\n<h1>Found</h1>\n<p>The document has moved <a href=\"https://209.250.0.97/\">here</a>.</p>\n</body></html>\n" .
    _:r <url> "http://209.250.0.97/" .
    _:r <uuids> _:u .
    _:u <RESPONSE> _:r .
    _:u <dgraph.type> "UUID" .
    _:u <uuid> "5fece32e9dc6d60d26b9d274" .
    _:u <time> "2020-12-30T15:29:34" .
    _:u <taskid> "5fead9f69dc6d6277cf06d9a" .
    _:p <dgraph.type> "PORT" .
    _:p <port> "80" .
    _:u <PORT> _:p .
    _:s <dgraph.type> "SERVICE" .
    _:s <service> "http" .
    _:u <SERVICE> _:s .
    _:t <dgraph.type> "TRANSPORT" .
    _:t <transport> "tcp" .
    _:u <TRANSPORT> _:t .
    _:i <dgraph.type> "IP" .
    _:i <ip> "209.250.0.97" .
    _:u <IP> _:i .
      _:f <dgraph.type> "FAVICON" .
      _:f <hash> "97289029" .
      _:f <data> "PCFkb2N0eXBlIGh0bWw+PGh0bWwgbIj48L3NjcmlwdD48c2NyaXB0IHR5cGU9InRleHQvamF2YXNjcmlwdCI+c2V0VGltZW91dChm\ndW5jdGlvbigpe2FuZ3VsYXIuYm9vdHN0cmFwKGRvY3VtZW50LFsiZmliby5zdHVkZW50YXBwIl0p\nO30pOzwvc2NyaXB0PjwvYm9keT48L2h0bWw+\n" .
      _:f <location> "http://209.250.0.97:80/favicon.ico" .
      _:u <FAVICON> _:f .
      _:r <FAVICON> _:f .
      _:app_1 <dgraph.type> "APP" .
      _:app_1 <app> "Apache" .
      _:app_1 <version> "2.4.6" .
      _:u <APP> _:app_1 .
          _:t1 <dgraph.type> "TAG" .
          _:t1 <tag> "Web servers" .
          _:app_1 <tags> _:t1 .
          _:u <TAG> _:t1 .
_:h <Date> "Wed, 30 Dec 2020 20:29:27 GMT" .
_:h <Server> "Apache/2.4.6" .
_:h <Location> "https://209.250.0.97/" .
_:h <Cache-Control> "max-age=172800" .
_:h <Expires> "Fri, 01 Jan 2021 20:29:27 GMT" .
_:h <Content-Length> "205" .
_:h <Connection> "close" .
_:h <Content-Type> "text/html; charset=iso-8859-1" .
_:h <dgraph.type> "HEADERS" .
_:r <HEADERS> _:h .
  }
}

in that case, the body and data field maybe a little big.

Why that wasn’t great, with examples

i guess is that the data too big or the indexs is to much cause too slow ?

Any external references to support your case

Hi, did you run this on Ratel or did you run the mutation using something else?

The mutation looks pretty small, it shouldn’t time out.

the schema and indexes likes:

uuid: string @index(hash) .
time: dateTime @index(month) .
taskid: string @index(hash) .
type UUID {
    uuid: string
    time: dateTime
    taskid: string
}
uuids: [uid] @reverse .

ip: string @index(hash) .
type: string @index(hash) .
isp: string @index(hash) .
asn: string @index(hash) .
idc: string @index(hash) .
IP: [uid] @reverse .
type IP {
    ip
    type
    isp
    asn
    idc
}

port: int @index(int) .
PORT: [uid] @reverse .
type PORT {
    port
}

transport: string @index(hash) .
TRANSPORT: [uid] @reverse .
type TRANSPORT {
    transport
}

version: string @index(hash) .
service: string @index(hash) .
SERVICE: [uid] @reverse .
type SERVICE {
    service
    version
}

app: string @index(hash) .
APP: [uid] @reverse .
type APP {
    app
    version
}

product: string @index(hash) .
PRODUCT: [uid] @reverse .
type PRODUCT {
    product
    version
}

vendor: string @index(hash) .
VENDOR: [uid] @reverse .
type VENDOR {
    vendor
}

database: string @index(hash) .
DATABASE: [uid] @reverse .
type DATABASE {
    database
    version
}

language: string @index(hash) .
LANGUAGE: [uid] @reverse .
type LANGUAGE {
    language
    version
}

os: string @index(hash) .
OS: [uid] @reverse .
type OS {
    os
    version
}

frontend: string @index(hash) .
FRONTEND: [uid] @reverse .
type FRONTEND {
    frontend
    version
}

backend: string @index(hash) .
BACKEND: [uid] @reverse .
type BACKEND {
    backend
    version
}

tag: string @index(hash) .
tags: [uid] @reverse .
TAG: [uid] @reverse .
type TAG {
    tag
}

response: string @index(fulltext) @upsert .
md5: string @index(hash) .
url: string @index(hash) .
title: string @index(hash) .
status_code: int @index(int) .
body: string @index(trigram) .
RESPONSE: [uid] @reverse .
type RESPONSE {
    url
    md5
    title
    status_code
    body
    response
}

data: string .
hash: string @index(hash) .
location: string @index(hash) .
type FAVICON {
    data
    hash
    location
}

@chewxy
yep, i used both ratel and pydgraph. in the ratel the mutation always stuck and never return, and in the pydgraph it always be rpc timeout error code = DeadlineExceeded desc = context deadline exceeded

I just took your schema and your mutation and ran it on my local instance of Dgraph. It finished. Which version of Dgraph are you using

@Ro0tk1t Like @chewxy, I could not reproduce this the error. I used rdf (data.rdf) and schema with indexes (data.schema) mentioned above, and using a similar vagrant environment to simulate your environment, I was able to successfully apply the mutation and schema them.

Apply schema

ALPHA="192.168.3.4:8080"

# data
curl --silent "$ALPHA/mutate?commitNow=true" \
  --header "Content-Type: application/rdf"  \
  --request POST \
  --data-binary "@data.rdf" | jq

# apply schema
curl "$ALPHA/alter" --silent \
  --request POST \
  --data-binary "@data.schema" | jq

Dump and verify schema

ALPHA="192.168.3.4:8080"
curl --silent "$ALPHA/query"\
  --header "Content-Type: application/dql" \
  --request POST  --data $'schema {}'

yields:

{"data":{"schema":[{"predicate":"APP","type":"uid","reverse":true,"list":true},{"predicate":"BACKEND","type":"uid","reverse":true,"list":true},{"predicate":"Cache-Control","type":"default"},{"predicate":"Connection","type":"default"},{"predicate":"Content-Length","type":"default"},{"predicate":"Content-Type","type":"default"},{"predicate":"DATABASE","type":"uid","reverse":true,"list":true},{"predicate":"Date","type":"default"},{"predicate":"Expires","type":"default"},{"predicate":"FAVICON","type":"uid","list":true},{"predicate":"FRONTEND","type":"uid","reverse":true,"list":true},{"predicate":"HEADERS","type":"uid","list":true},{"predicate":"IP","type":"uid","reverse":true,"list":true},{"predicate":"LANGUAGE","type":"uid","reverse":true,"list":true},{"predicate":"Location","type":"default"},{"predicate":"OS","type":"uid","reverse":true,"list":true},{"predicate":"PORT","type":"uid","reverse":true,"list":true},{"predicate":"PRODUCT","type":"uid","reverse":true,"list":true},{"predicate":"RESPONSE","type":"uid","reverse":true,"list":true},{"predicate":"SERVICE","type":"uid","reverse":true,"list":true},{"predicate":"Server","type":"default"},{"predicate":"TAG","type":"uid","reverse":true,"list":true},{"predicate":"TRANSPORT","type":"uid","reverse":true,"list":true},{"predicate":"VENDOR","type":"uid","reverse":true,"list":true},{"predicate":"app","type":"string","index":true,"tokenizer":["hash"]},{"predicate":"asn","type":"string","index":true,"tokenizer":["hash"]},{"predicate":"backend","type":"string","index":true,"tokenizer":["hash"]},{"predicate":"body","type":"string","index":true,"tokenizer":["trigram"]},{"predicate":"data","type":"string"},{"predicate":"database","type":"string","index":true,"tokenizer":["hash"]},{"predicate":"dgraph.drop.op","type":"string"},{"predicate":"dgraph.graphql.p_query","type":"string","index":true,"tokenizer":["sha256"]},{"predicate":"dgraph.graphql.schema","type":"string"},{"predicate":"dgraph.graphql.xid","type":"string","index":true,"tokenizer":["exact"],"upsert":true},{"predicate":"dgraph.type","type":"string","index":true,"tokenizer":["exact"],"list":true},{"predicate":"frontend","type":"string","index":true,"tokenizer":["hash"]},{"predicate":"hash","type":"string","index":true,"tokenizer":["hash"]},{"predicate":"idc","type":"string","index":true,"tokenizer":["hash"]},{"predicate":"ip","type":"string","index":true,"tokenizer":["hash"]},{"predicate":"isp","type":"string","index":true,"tokenizer":["hash"]},{"predicate":"language","type":"string","index":true,"tokenizer":["hash"]},{"predicate":"location","type":"string","index":true,"tokenizer":["hash"]},{"predicate":"md5","type":"string","index":true,"tokenizer":["hash"]},{"predicate":"os","type":"string","index":true,"tokenizer":["hash"]},{"predicate":"port","type":"int","index":true,"tokenizer":["int"]},{"predicate":"product","type":"string","index":true,"tokenizer":["hash"]},{"predicate":"response","type":"string","index":true,"tokenizer":["fulltext"],"upsert":true},{"predicate":"service","type":"string","index":true,"tokenizer":["hash"]},{"predicate":"status_code","type":"int","index":true,"tokenizer":["int"]},{"predicate":"tag","type":"string","index":true,"tokenizer":["hash"]},{"predicate":"tags","type":"uid","reverse":true,"list":true},{"predicate":"taskid","type":"string","index":true,"tokenizer":["hash"]},{"predicate":"time","type":"datetime","index":true,"tokenizer":["month"]},{"predicate":"title","type":"string","index":true,"tokenizer":["hash"]},{"predicate":"transport","type":"string","index":true,"tokenizer":["hash"]},{"predicate":"type","type":"string","index":true,"tokenizer":["hash"]},{"predicate":"url","type":"string","index":true,"tokenizer":["hash"]},{"predicate":"uuid","type":"string","index":true,"tokenizer":["hash"]},{"predicate":"uuids","type":"uid","reverse":true,"list":true},{"predicate":"vendor","type":"string","index":true,"tokenizer":["hash"]},{"predicate":"version","type":"string","index":true,"tokenizer":["hash"]}],"types":[{"fields":[{"name":"app"},{"name":"version"}],"name":"APP"},{"fields":[{"name":"backend"},{"name":"version"}],"name":"BACKEND"},{"fields":[{"name":"database"},{"name":"version"}],"name":"DATABASE"},{"fields":[{"name":"data"},{"name":"hash"},{"name":"location"}],"name":"FAVICON"},{"fields":[{"name":"frontend"},{"name":"version"}],"name":"FRONTEND"},{"fields":[{"name":"ip"},{"name":"type"},{"name":"isp"},{"name":"asn"},{"name":"idc"}],"name":"IP"},{"fields":[{"name":"language"},{"name":"version"}],"name":"LANGUAGE"},{"fields":[{"name":"os"},{"name":"version"}],"name":"OS"},{"fields":[{"name":"port"}],"name":"PORT"},{"fields":[{"name":"product"},{"name":"version"}],"name":"PRODUCT"},{"fields":[{"name":"url"},{"name":"md5"},{"name":"title"},{"name":"status_code"},{"name":"body"},{"name":"response"}],"name":"RESPONSE"},{"fields":[{"name":"service"},{"name":"version"}],"name":"SERVICE"},{"fields":[{"name":"tag"}],"name":"TAG"},{"fields":[{"name":"transport"}],"name":"TRANSPORT"},{"fields":[{"name":"uuid"},{"name":"time"},{"name":"taskid"}],"name":"UUID"},{"fields":[{"name":"vendor"}],"name":"VENDOR"},{"fields":[{"name":"dgraph.graphql.schema"},{"name":"dgraph.graphql.xid"}],"name":"dgraph.graphql"},{"fields":[{"name":"dgraph.graphql.p_query"}],"name":"dgraph.graphql.persisted_query"}]},"extensions":{"server_latency":{"parsing_ns":26590,"processing_ns":917826,"assign_timestamp_ns":6239776,"total_ns":9406401},"txn":{"start_ts":6},"metrics":{"num_uids":{"_total":0}}}}(3.9.0/e

Dgraph version and build environment

The dgraph version was built from release/v21.03 branch and commit SHA-1 is f9d045acd.

The dgraph binary was placed alongside the other files:

.
└── vagrant_env
    ├── data.rdf
    ├── data.schema
    ├── dgraph
    └── Vagrantfile

So, something like this to build and copy the binary over to Vagrant env:

cd ~/dgraph && git checkout release/v21.03 && git pull && make 
cp ~/dgraph/dgraph/dgraph ~/vagrant_env

cd ~/vagrant_env
vagrant up # see below

Vagrant environment v2

I used the similar vagrant environment mentioned in the other issue, but added provisioning to automate the setup:

@hosts = {"zero"=>"192.168.3.2", "alpha"=>"192.168.3.4"}

$script = <<-SCRIPT
[[ -f /vagrant/dgraph ]] && cp /vagrant/dgraph /usr/local/bin/dgraph

if [[ $1 == "zero" ]]; then
  COMMAND="/usr/local/bin/dgraph zero --bindall=true --my 192.168.3.2:5080 --raft idx=1 &"
  firewall-cmd --zone=public --permanent --add-port=5080/tcp
  firewall-cmd --zone=public --permanent --add-port=6080/tcp
  firewall-cmd --reload
elif [[ $1 == "alpha" ]]; then
  COMMAND="/usr/local/bin/dgraph alpha --bindall=true --zero=192.168.3.2:5080 -v 5 --my=192.168.3.4:7080 --security whitelist=192.168.3.1/16 &"
  firewall-cmd --zone=public --permanent --add-port=7080/tcp
  firewall-cmd --zone=public --permanent --add-port=8080/tcp
  firewall-cmd --zone=public --permanent --add-port=9080/tcp
  firewall-cmd --reload
fi

cat <<-EOF > /etc/rc.d/rc.local
#!/bin/bash
touch /var/lock/subsys/local
${COMMAND}
EOF

chmod +x /etc/rc.d/rc.local
sudo systemctl enable rc-local
sudo systemctl start rc-local
SCRIPT

Vagrant.configure("2") do |config|
  @hosts.each do |hostname, ipaddr|
    config.vm.define hostname do |node|
      node.vm.box = "generic/centos7"
      node.vm.hostname = "#{hostname}"
      node.vm.network "private_network", ip: ipaddr
      node.vm.synced_folder ".", "/vagrant"

      node.vm.provision "shell" do |shell|
        shell.inline = $script
        shell.args = [hostname]
        shell.privileged = true
      end
    end
  end
end

To use that environment:

## bring up 2 VMs + provision (firewall, rc.local for dgraph)
vagrant up 


## test health/state
ZERO="192.168.3.2:6080"
ALPHA="192.168.3.4:8080"
curl -s $ZERO/state | jq '.zeros."1"'
curl -s $ALPHA/health | jq '.[0]'

@joaquin @chewxy
the problem appears when amount of data node is a little too much, it won’t appear when database is empty.

And finally, i found the reason, the fulltext index and term index is too much consumption of resources, seems like only hashexact etc index can work fine based on big dataset.

@joaquin
and still, there is new problem… :face_with_raised_eyebrow:

too many open files

To resolve “Too many open files error”, you may try https://dgraph.io/docs/deploy/troubleshooting/#too-many-open-files .