Bulk loading 72.1M records from RDBMS with 0 output

cabbage.merchant · July 13, 2020, 6:21am

Hello,

I have made several attempts trying to import a small fraction of data from RDBMS to dgraph using bulk loader. But whenever it reaches REDUCE phase, everything seems to be empty.

But when I tried with just 2 tiny tables, I can see there is data in the p folder.

This is the command I attempted:

dgraph bulk -f ./bix/json-gzip/ -s ./bix/schema/bix_schema.txt --replace_out --out ./bulkdgraph --tmp ./tmpdgraph --num_go_routines 4 --mapoutput_mb 16

[05:23:31Z] MAP 04h51m37s nquad_count:788.9M err_count:0.000 nquad_speed:45.09k/sec edge_count:3.748G edge_speed:214.2k/sec
[05:23:32Z] MAP 04h51m38s nquad_count:788.9M err_count:0.000 nquad_speed:45.08k/sec edge_count:3.748G edge_speed:214.2k/sec
Shard tmpdgraph/map_output/000 -> Reduce tmpdgraph/shards/shard_0/000
[05:23:33Z] REDUCE 04h51m39s 0.00% edge_count:0.000 edge_speed:0.000/sec plist_count:0.000 plist_speed:0.000/sec. Num Encoding: 0
[05:23:34Z] REDUCE 04h51m40s 0.00% edge_count:0.000 edge_speed:0.000/sec plist_count:0.000 plist_speed:0.000/sec. Num Encoding: 0
[05:23:35Z] REDUCE 04h51m41s 0.00% edge_count:0.000 edge_speed:0.000/sec plist_count:0.000 plist_speed:0.000/sec. Num Encoding: 0
[05:23:36Z] REDUCE 04h51m42s 0.00% edge_count:0.000 edge_speed:0.000/sec plist_count:0.000 plist_speed:0.000/sec. Num Encoding: 0
[05:23:37Z] REDUCE 04h51m43s 0.00% edge_count:0.000 edge_speed:0.000/sec plist_count:0.000 plist_speed:0.000/sec. Num Encoding: 0
[05:23:38Z] REDUCE 04h51m44s 0.00% edge_count:0.000 edge_speed:0.000/sec plist_count:0.000 plist_speed:0.000/sec. Num Encoding: 0
[05:23:39Z] REDUCE 04h51m45s 0.00% edge_count:0.000 edge_speed:0.000/sec plist_count:0.000 plist_speed:0.000/sec. Num Encoding: 0
[05:23:40Z] REDUCE 04h51m46s 0.00% edge_count:0.000 edge_speed:0.000/sec plist_count:0.000 plist_speed:0.000/sec. Num Encoding: 0
Num CPUs: 4
[05:23:47Z] REDUCE 04h51m53s 0.00% edge_count:0.000 edge_speed:0.000/sec plist_count:0.000 plist_speed:0.000/sec. Num Encoding: 0
[05:23:51Z] REDUCE 04h51m57s 0.00% edge_count:0.000 edge_speed:0.000/sec plist_count:0.000 plist_speed:0.000/sec. Num Encoding: 0
[05:23:52Z] REDUCE 04h51m59s 0.00% edge_count:0.000 edge_speed:0.000/sec plist_count:0.000 plist_speed:0.000/sec. Num Encoding: 0
[05:23:53Z] REDUCE 04h52m00s 0.00% edge_count:0.000 edge_speed:0.000/sec plist_count:0.000 plist_speed:0.000/sec. Num Encoding: 0
[05:23:54Z] REDUCE 04h52m01s 0.00% edge_count:0.000 edge_speed:0.000/sec plist_count:0.000 plist_speed:0.000/sec. Num Encoding: 0
[05:23:55Z] REDUCE 04h52m02s 0.00% edge_count:0.000 edge_speed:0.000/sec plist_count:0.000 plist_speed:0.000/sec. Num Encoding: 0
[05:23:56Z] REDUCE 04h52m03s 0.00% edge_count:0.000 edge_speed:0.000/sec plist_count:0.000 plist_speed:0.000/sec. Num Encoding: 0

I ran the above command under 1 AWS t3a.xlarge instance. It took ~6 hours to complete data import for 72.1M records (rows) with <400 columns.

Paras · July 13, 2020, 6:27am

You are likely running out of memory in the REDUCE phase. Can you take the heap profile, block profile and the htop output maybe every 30 sec?

Another similar thread, Fatal error: runtime: out of memory when bulk loader

cc @harshil_goel

mrjn · July 13, 2020, 7:08am

This isn’t an OOM issue. The bulk loader is completing.

Paras · July 13, 2020, 7:26am

Not OOM directly but bulk loader is in a state where it has held a lot of memory in REDUCE phase and is then stuck presumably waiting on more and finally it exits after some time. The exit status is 0 indicating success.

@harshil_goel and I are investigating on this.

cabbage.merchant · July 13, 2020, 7:39am

Hi @Paras thank you for your quick reply.

I am trying to rerun the command again but with --skip_map_phase flag as the mapping phase is taking half day. What I got was Assert failed error.

dgraph bulk -f ./bix/json-gzip/ -s ./bix/schema/bix_schema.txt --replace_out --out ./bulkdgraph --tmp ./tmpdgraph --num_go_routines 4 --mapoutput_mb 16 --skip_map_phase

[Decoder]: Using assembly version of decoder
I0713 07:22:36.862743    2574 init.go:99]

Dgraph version   : v20.03.3
Dgraph SHA-256   : 08424035910be6b6720570427948bab8352a0b5a6d59a0d20c3ec5ed29533121
Commit SHA-1     : fa3c19120
Commit timestamp : 2020-06-02 16:47:25 -0700
Branch           : HEAD
Go version       : go1.14.1

For Dgraph official documentation, visit https://docs.dgraph.io.
For discussions about Dgraph     , visit http://discuss.dgraph.io.
To say hi to the community       , visit https://dgraph.slack.com.

Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Copyright 2015-2020 Dgraph Labs, Inc.


{
        "DataFiles": "./bix/json-gzip/",
        "DataFormat": "",
        "SchemaFile": "./bix/schema/bix_schema.txt",
        "GqlSchemaFile": "",
        "OutDir": "./bulkdgraph-tiny",
        "ReplaceOutDir": false,
        "TmpDir": "./tmpdgraph-tiny",
        "NumGoroutines": 5,
        "MapBufSize": 16777216,
        "SkipMapPhase": true,
        "CleanupTmp": true,
        "NumReducers": 1,
        "Version": false,
        "StoreXids": false,
        "ZeroAddr": "localhost:5080",
        "HttpAddr": "localhost:8080",
        "IgnoreErrors": false,
        "CustomTokenizers": "",
        "NewUids": false,
        "Encrypted": false,
        "MapShards": 1,
        "ReduceShards": 1,
        "BadgerKeyFile": "",
        "BadgerCompressionLevel": 1
}

The bulk loader needs to open many files at once. This number depends on the size of the data set loaded, the map file output size, and the level of indexing. 100,000 is adequate for most data set sizes. See `man ulimit` for details of how to change the limit.
Current max open files limit: 100000

Connecting to zero at localhost:5080
2020/07/13 07:22:37 Assert failed
github.com/dgraph-io/dgraph/x.AssertTrue
        /ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:94
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*reducer).run
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/reduce.go:59
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*loader).reduceStage
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/loader.go:299
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.run
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/run.go:241
github.com/dgraph-io/dgraph/dgraph/cmd/bulk.init.0.func1
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/bulk/run.go:49
github.com/spf13/cobra.(*Command).execute
        /go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830
github.com/spf13/cobra.(*Command).ExecuteC
        /go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914
github.com/spf13/cobra.(*Command).Execute
        /go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
github.com/dgraph-io/dgraph/dgraph/cmd.Execute
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/root.go:70
main.main
        /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/main.go:78
runtime.main
        /usr/local/go/src/runtime/proc.go:203
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1373

I tried to look up the code in github, but github is down currently, so I have no clue what to do with the error above.

Paras · July 14, 2020, 6:30am

You could skip map phase if the expected output from the map phase (the tmp folder containing the map output) already exists.

Here is the info about --skip-map option.

     --skip_map_phase                 Skip the map phase (assumes that map output files already exist).

cabbage.merchant · July 14, 2020, 1:05pm

Thank you for your reply @Paras

I managed to import them successfully after provisioning x1e.4xlarge (16CPU 488GB RAM) EC2 spot fleet instance.

nohup dgraph bulk -f ./bix/json-gzip -s ./bix/schema/weet_schema.txt --replace_out --out ./bulkdgraph --tmp ./tmpdgraph --num_go_routines 32 --mapoutput_mb 64 --cleanup_tmp=false --skip_map_phase &> bulkdgraphlog &

It wasn’t straight forward as I struggled with disk capacity. The final output takes 63GB in total which I think it is too big for a small fraction of data. The MAP phase took roughly about 1 hour followed by another hour for REDUCE phase.

I noticed during the MAP phase, there were several errors:

[07:39:57Z] MAP 02m22s nquad_count:62.22M err_count:0.000 nquad_speed:436.8k/sec edge_count:165.8M edge_speed:1.164M/sec
E0714 07:39:58.576076    3644 xidmap.go:133] Error while getting lease: rpc error: code = Unknown desc = context deadline exceeded
[07:39:58Z] MAP 02m23s nquad_count:63.14M err_count:0.000 nquad_speed:440.0k/sec edge_count:168.2M edge_speed:1.172M/sec

...

[08:10:57Z] MAP 33m22s nquad_count:644.7M err_count:0.000 nquad_speed:322.0k/sec edge_count:2.771G edge_speed:1.384M/sec
E0714 08:10:57.720867    3644 xidmap.go:133] Error while getting lease: rpc error: code = DeadlineExceeded desc = context deadline exceeded
[08:10:58Z] MAP 33m23s nquad_count:645.3M err_count:0.000 nquad_speed:322.1k/sec edge_count:2.775G edge_speed:1.386M/sec

...

[08:10:59Z] MAP 33m24s nquad_count:645.9M err_count:0.000 nquad_speed:322.3k/sec edge_count:2.779G edge_speed:1.387M/sec
E0714 08:10:59.792037    3644 xidmap.go:133] Error while getting lease: rpc error: code = DeadlineExceeded desc = context deadline exceeded
[08:11:00Z] MAP 33m25s nquad_count:646.3M err_count:0.000 nquad_speed:322.3k/sec edge_count:2.782G edge_speed:1.387M/sec
E0714 08:11:00.872293    3644 xidmap.go:133] Error while getting lease: rpc error: code = DeadlineExceeded desc = context deadline exceeded
E0714 08:11:01.037544    3644 xidmap.go:133] Error while getting lease: rpc error: code = Unknown desc = No healthy connection found to Leader of group zero
E0714 08:11:01.372338    3644 xidmap.go:133] Error while getting lease: rpc error: code = Unknown desc = No healthy connection found to Leader of group zero
[08:11:01Z] MAP 33m26s nquad_count:646.6M err_count:0.000 nquad_speed:322.3k/sec edge_count:2.784G edge_speed:1.388M/sec
E0714 08:11:02.033083    3644 xidmap.go:133] Error while getting lease: rpc error: code = Unknown desc = No healthy connection found to Leader of group zero
[08:11:02Z] MAP 33m27s nquad_count:646.6M err_count:0.000 nquad_speed:322.2k/sec edge_count:2.784G edge_speed:1.387M/sec

Not sure if those errors would lead to corrupt data. From htop, I could see bulk loader was taking ~71GB RAM. I did capture heap profile several times, not every 30 seconds. I tried to capture block profile, but there was no data - could be missing something?

So after all the import drama, I copied the /p folder to /var/lib/dgraph/p and ran dgraph-alpha service. But the service failed to start due to too many open files. Just an additional note, I downsized the instance back to t3a.large as I need to compare performance with Elastic Search that is currently running in testing environment.

Jul 14 12:58:52 ip-172-31-45-91 systemd[1]: Started dgraph.io Alpha instance.
Jul 14 12:58:52 ip-172-31-45-91 dgraph[27165]: [Decoder]: Using assembly version of decoder
Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: [Sentry] 2020/07/14 12:58:53 Integration installed: ContextifyFrames
Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: [Sentry] 2020/07/14 12:58:53 Integration installed: Environment
Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: [Sentry] 2020/07/14 12:58:53 Integration installed: Modules
Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: [Sentry] 2020/07/14 12:58:53 Integration installed: IgnoreErrors
Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: [Decoder]: Using assembly version of decoder
Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: [Sentry] 2020/07/14 12:58:53 Integration installed: ContextifyFrames
Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: [Sentry] 2020/07/14 12:58:53 Integration installed: Environment
Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: [Sentry] 2020/07/14 12:58:53 Integration installed: Modules
Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: [Sentry] 2020/07/14 12:58:53 Integration installed: IgnoreErrors
Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: I0714 12:58:53.368838   27191 init.go:99]
Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: Dgraph version   : v20.03.3
Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: Dgraph SHA-256   : 08424035910be6b6720570427948bab8352a0b5a6d59a0d20c3ec5ed29533121
Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: Commit SHA-1     : fa3c19120
Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: Commit timestamp : 2020-06-02 16:47:25 -0700
Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: Branch           : HEAD
Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: Go version       : go1.14.1
Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: For Dgraph official documentation, visit https://docs.dgraph.io.
Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: For discussions about Dgraph     , visit http://discuss.dgraph.io.
Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: To say hi to the community       , visit https://dgraph.slack.com.
Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: Copyright 2015-2020 Dgraph Labs, Inc.
Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: I0714 12:58:53.369393   27191 run.go:608] x.Config: {PortOffset:0 QueryEdgeLimit:1000000 NormalizeNodeLimit:10000}
Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: I0714 12:58:53.369428   27191 run.go:609] x.WorkerConfig: {ExportPath:export NumPendingProposals:256 Tracing:0.01 MyAddr: ZeroAddr:[localhost:5080] RaftId:0 WhiteListedIPRanges:[] MaxRetries:-1 StrictMutations:false AclEnabled:false AbortOlderThan:5m0s SnapshotAfter:10000 ProposedGroupId:0 StartTime:2020-07-14 12:58:53.086607981 +0000 UTC m=+0.012427411 LudicrousMode:false BadgerKeyFile:}Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: I0714 12:58:53.369479   27191 run.go:610] worker.Config: {PostingDir:/var/lib/dgraph/p BadgerTables:mmap BadgerVlog:mmap BadgerKeyFile: BadgerCompressionLevel:3 WALDir:/var/lib/dgraph/w MutationsMode:0 AuthToken: AllottedMemory:7144 HmacSecret:**** AccessJwtTtl:0s RefreshJwtTtl:0s AclRefreshInterval:0s}
Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: I0714 12:58:53.369529   27191 server_state.go:75] Setting Badger Compression Level: 3
Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: I0714 12:58:53.369537   27191 server_state.go:84] Setting Badger table load option: mmap
Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: I0714 12:58:53.369543   27191 server_state.go:96] Setting Badger value log load option: mmap
Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: I0714 12:58:53.369550   27191 server_state.go:141] Opening write-ahead log BadgerDB with options: {Dir:/var/lib/dgraph/w ValueDir:/var/lib/dgraph/w SyncWrites:false TableLoadingMode:1 ValueLogLoadingMode:2 NumVersionsToKeep:1 ReadOnly:false Truncate:true Logger:0x28325d0 Compression:2 InMemory:false MaxTableSize:67108864 LevelSizeMultiplier:10 MaxLevels:7 ValueThreshold:1048576 NumMemtables:5 BlockSize:4096 BloomFalsePositive:0.01 KeepL0InMemory:true MaxCacheSize:10485760 MaxBfCacheSize:0 LoadBloomsOnOpen:false NumLevelZeroTables:5 NumLevelZeroTablesStall:10 LevelOneSize:268435456 ValueLogFileSize:1073741823 ValueLogMaxEntries:10000 NumCompactors:2 CompactL0OnClose:true LogRotatesToFlush:2 ZSTDCompressionLevel:3 VerifyValueChecksum:false EncryptionKey:[] EncryptionKeyRotationDuration:240h0m0s BypassLockGuard:false ChecksumVerificationMode:0 KeepBlockIndicesInCache:false KeepBlocksInCache:false managedTxns:false maxBatchCount:0 maxBatchSize:0}
Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: I0714 12:58:53.380574   27191 log.go:34] All 1 tables opened in 2ms
Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: I0714 12:58:53.383540   27191 log.go:34] Replaying file id: 5 at offset: 94766
Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: I0714 12:58:53.383727   27191 log.go:34] Replay took: 5.64µs
Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: I0714 12:58:53.384050   27191 server_state.go:75] Setting Badger Compression Level: 3
Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: I0714 12:58:53.384266   27191 server_state.go:84] Setting Badger table load option: mmap
Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: I0714 12:58:53.384418   27191 server_state.go:96] Setting Badger value log load option: mmap
Jul 14 12:58:53 ip-172-31-45-91 dgraph[27165]: I0714 12:58:53.384510   27191 server_state.go:164] Opening postings BadgerDB with options: {Dir:/var/lib/dgraph/p ValueDir:/var/lib/dgraph/p SyncWrites:false TableLoadingMode:2 ValueLogLoadingMode:2 NumVersionsToKeep:2147483647 ReadOnly:false Truncate:true Logger:0x28325d0 Compression:2 InMemory:false MaxTableSize:67108864 LevelSizeMultiplier:10 MaxLevels:7 ValueThreshold:1024 NumMemtables:5 BlockSize:4096 BloomFalsePositive:0.01 KeepL0InMemory:true MaxCacheSize:1073741824 MaxBfCacheSize:0 LoadBloomsOnOpen:false NumLevelZeroTables:5 NumLevelZeroTablesStall:10 LevelOneSize:268435456 ValueLogFileSize:1073741823 ValueLogMaxEntries:1000000 NumCompactors:2 CompactL0OnClose:true LogRotatesToFlush:2 ZSTDCompressionLevel:3 VerifyValueChecksum:false EncryptionKey:[] EncryptionKeyRotationDuration:240h0m0s BypassLockGuard:false ChecksumVerificationMode:0 KeepBlockIndicesInCache:true KeepBlocksInCache:true managedTxns:false maxBatchCount:0 maxBatchSize:0}
Jul 14 12:58:56 ip-172-31-45-91 dgraph[27165]: I0714 12:58:56.414753   27191 log.go:34] 457 tables out of 1314 opened in 3.006s
Jul 14 12:58:59 ip-172-31-45-91 dgraph[27165]: I0714 12:58:59.415275   27191 log.go:34] 880 tables out of 1314 opened in 6.006s
Jul 14 12:59:00 ip-172-31-45-91 dgraph[27165]: [Sentry] 2020/07/14 12:59:00 ModuleIntegration wasn't able to extract modules: module integration failed
Jul 14 12:59:00 ip-172-31-45-91 dgraph[27165]: [Sentry] 2020/07/14 12:59:00 Sending fatal event [941117a56edb4ab8ae57c60a6a441a61] to o318308.ingest.sentry.io project: 1805390
Jul 14 12:59:00 ip-172-31-45-91 dgraph[27165]: 2020/07/14 12:59:00 open /var/lib/dgraph/p/000213.sst: too many open files
Jul 14 12:59:00 ip-172-31-45-91 dgraph[27165]: Opening file: "/var/lib/dgraph/p/000213.sst"
Jul 14 12:59:00 ip-172-31-45-91 dgraph[27165]: github.com/dgraph-io/badger/v2.newLevelsController.func1
Jul 14 12:59:00 ip-172-31-45-91 dgraph[27165]:         /go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200528205344-e7b6e76f96e8/levels.go:148
Jul 14 12:59:00 ip-172-31-45-91 dgraph[27165]: runtime.goexit
Jul 14 12:59:00 ip-172-31-45-91 dgraph[27165]:         /usr/local/go/src/runtime/asm_amd64.s:1373
Jul 14 12:59:00 ip-172-31-45-91 dgraph[27165]: Error while creating badger KV posting store
Jul 14 12:59:00 ip-172-31-45-91 dgraph[27165]: github.com/dgraph-io/dgraph/x.Checkf
Jul 14 12:59:00 ip-172-31-45-91 dgraph[27165]:         /ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:51
Jul 14 12:59:00 ip-172-31-45-91 dgraph[27165]: github.com/dgraph-io/dgraph/worker.(*ServerState).initStorage
Jul 14 12:59:00 ip-172-31-45-91 dgraph[27165]:         /ext-go/1/src/github.com/dgraph-io/dgraph/worker/server_state.go:168
Jul 14 12:59:00 ip-172-31-45-91 dgraph[27165]: github.com/dgraph-io/dgraph/worker.InitServerState
Jul 14 12:59:00 ip-172-31-45-91 dgraph[27165]:         /ext-go/1/src/github.com/dgraph-io/dgraph/worker/server_state.go:57
Jul 14 12:59:00 ip-172-31-45-91 dgraph[27165]: github.com/dgraph-io/dgraph/dgraph/cmd/alpha.run
Jul 14 12:59:00 ip-172-31-45-91 dgraph[27165]:         /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/alpha/run.go:612
Jul 14 12:59:00 ip-172-31-45-91 dgraph[27165]: github.com/dgraph-io/dgraph/dgraph/cmd/alpha.init.2.func1
Jul 14 12:59:00 ip-172-31-45-91 dgraph[27165]:         /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/alpha/run.go:90
Jul 14 12:59:00 ip-172-31-45-91 dgraph[27165]: github.com/spf13/cobra.(*Command).execute
Jul 14 12:59:00 ip-172-31-45-91 dgraph[27165]:         /go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830
Jul 14 12:59:00 ip-172-31-45-91 dgraph[27165]: github.com/spf13/cobra.(*Command).ExecuteC
Jul 14 12:59:00 ip-172-31-45-91 dgraph[27165]:         /go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914
Jul 14 12:59:00 ip-172-31-45-91 dgraph[27165]: github.com/spf13/cobra.(*Command).Execute
Jul 14 12:59:00 ip-172-31-45-91 dgraph[27165]:         /go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
Jul 14 12:59:00 ip-172-31-45-91 dgraph[27165]: github.com/dgraph-io/dgraph/dgraph/cmd.Execute
Jul 14 12:59:00 ip-172-31-45-91 dgraph[27165]:         /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/root.go:70
Jul 14 12:59:00 ip-172-31-45-91 dgraph[27165]: main.main
Jul 14 12:59:00 ip-172-31-45-91 dgraph[27165]:         /ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/main.go:78
Jul 14 12:59:00 ip-172-31-45-91 dgraph[27165]: runtime.main
Jul 14 12:59:00 ip-172-31-45-91 dgraph[27165]:         /usr/local/go/src/runtime/proc.go:203
Jul 14 12:59:00 ip-172-31-45-91 dgraph[27165]: runtime.goexit
Jul 14 12:59:00 ip-172-31-45-91 dgraph[27165]:         /usr/local/go/src/runtime/asm_amd64.s:1373
Jul 14 12:59:00 ip-172-31-45-91 systemd[1]: dgraph-alpha.service: Main process exited, code=exited, status=1/FAILURE
Jul 14 12:59:00 ip-172-31-45-91 systemd[1]: dgraph-alpha.service: Failed with result 'exit-code'.
Jul 14 12:59:00 ip-172-31-45-91 systemd[1]: dgraph-alpha.service: Service hold-off time over, scheduling restart.
Jul 14 12:59:00 ip-172-31-45-91 systemd[1]: dgraph-alpha.service: Scheduled restart job, restart counter is at 278.
Jul 14 12:59:00 ip-172-31-45-91 systemd[1]: Stopped dgraph.io Alpha instance.
Jul 14 12:59:00 ip-172-31-45-91 systemd[1]: Started dgraph.io Alpha instance.
Jul 14 12:59:00 ip-172-31-45-91 dgraph[27250]: [Decoder]: Using assembly version of decoder
Jul 14 12:59:01 ip-172-31-45-91 dgraph[27250]: [Sentry] 2020/07/14 12:59:01 Integration installed: ContextifyFrames
Jul 14 12:59:01 ip-172-31-45-91 dgraph[27250]: [Sentry] 2020/07/14 12:59:01 Integration installed: Environment
Jul 14 12:59:01 ip-172-31-45-91 dgraph[27250]: [Sentry] 2020/07/14 12:59:01 Integration installed: Modules
Jul 14 12:59:01 ip-172-31-45-91 dgraph[27250]: [Sentry] 2020/07/14 12:59:01 Integration installed: IgnoreErrors
Jul 14 12:59:01 ip-172-31-45-91 dgraph[27250]: [Decoder]: Using assembly version of decoder
Jul 14 12:59:01 ip-172-31-45-91 dgraph[27250]: [Sentry] 2020/07/14 12:59:01 Integration installed: ContextifyFrames
Jul 14 12:59:01 ip-172-31-45-91 dgraph[27250]: [Sentry] 2020/07/14 12:59:01 Integration installed: Environment
Jul 14 12:59:01 ip-172-31-45-91 dgraph[27250]: [Sentry] 2020/07/14 12:59:01 Integration installed: Modules
Jul 14 12:59:01 ip-172-31-45-91 dgraph[27250]: [Sentry] 2020/07/14 12:59:01 Integration installed: IgnoreErrors
Jul 14 12:59:01 ip-172-31-45-91 dgraph[27250]: I0714 12:59:01.369110   27274 init.go:99]
Jul 14 12:59:01 ip-172-31-45-91 dgraph[27250]: Dgraph version   : v20.03.3
Jul 14 12:59:01 ip-172-31-45-91 dgraph[27250]: Dgraph SHA-256   : 08424035910be6b6720570427948bab8352a0b5a6d59a0d20c3ec5ed29533121

Paras · July 14, 2020, 8:10pm

Thanks @cabbage.merchant for the information.

How much data are you loading in terms of # rdf (does 72.1M records = rdfs?) and disk size? Could you share your schema file and the data? I am thinking you may have a lot of indexes.

This error basically means that the bulk loader is unable to lease additional UIDs from the Zero within 1 sec and then retries. Eventually, it will make progress if the Zero is alive.

This error means that the Zero is unable to connect to its leader. Question: did you have more than 1 zero in the cluster?

It should not corrupt data. These may be transient connectivity errors.

Yes. Bulk is going to consume a lot of RAM. And we are looking to improve its memory requirements. See Bulk loader xidmap memory optimization for some work in the area.

Please share your heap profile.

My bad about the block profile. You needed to use the --profile_mode block --block_rate 1 options with bulk.

What was your default ulimit -n. In most cases, it is best to bump this number up to 100,000 or so.

cc @harshil_goel

cabbage.merchant · July 15, 2020, 4:44am

Hi @Paras

I assume a single RDF means a single of (Subject Predicate Object). Please correct me if my understanding is wrong. RDF concept is very new to me.

By 72.1M records, this refers to 72.1M rows in relational database terms. And then there are <400 columns. Therefore, to calculate number of RDFs would be to figure out how much cells in total. If my calculation is right, then there are about 720M cells = RDFs.

I have only 1 zero service during the import. My intention is to run 1 service in one instance only so I can compare performance between dgraph and Elastic Search for full-text search and with a little bit of aggregation operation along with JOIN. I am aware they are not apple to apple. But if dgraph can do full-text search, JOIN and aggregation operations with the same equivalent of Elastic Search performance, why not ? Hence, it explains why I have lot of indices.

As for schema, I can only give you a rough idea. The schema is generated from INFORMATION_SCHEMA tables and its purpose is just for a quick test.

var x = new Dictionary<string, string>()
{
    {"int", "@index(int) @upsert"},
    {"float", "@index(float) @upsert"},
    {"string", "@index(term, fulltext, trigram) @upsert"},
    {"bool", "@index(bool) @upsert"},
    {"geo", "@index(geo) @upsert"},
    {"dateTime", "@index(hour) @upsert"}
};

I didn’t know there was a --profile_mode flag. In that case, the heap profile I took probably would target the alpha service instead of the bulk tool itself. I am not too sure if the profiler would also capture the import data or not. I can’t share it if it contains them since the data is belong to our clients.

That too many open files error message happened when the file descriptor limit had been set to 100,000. Today I increased to 1,000,000 and still got the same error.

Paras · July 15, 2020, 5:50am

Isn’t that 28 Billion RDFs? (72M * 400)

cabbage.merchant:

var x = new Dictionary<string, string>()
{
    {"int", "@index(int) @upsert"},
    {"float", "@index(float) @upsert"},
    {"string", "@index(term, fulltext, trigram) @upsert"},
    {"bool", "@index(bool) @upsert"},
    {"geo", "@index(geo) @upsert"},
    {"dateTime", "@index(hour) @upsert"}
};

It seems you are indexing everything. I think that is a lot of indexes specially the string indexes. Can you share the schema that you fed to the bulk loader?

You don’t need to have an alpha instance while bulk loading. Only a zero instance is required.
To gather heap, CPU and block you will need to point it to bulk as follows (note the port 8080 for bulk)

http://localhost:8080/debug/pprof/profile
http://localhost:8080/debug/pprof/heap
http://localhost:8080/debug/pprof/block (require bulk to be started with profile_mode block --block_rate 1

cc @ibrahim

Can you look at your “p” dir and send the output of ls -lh? I want to know how many sst and vlog files do you have and their sizes?

Also, after setting ulimit to 100,000, can you run this command and send me the output.
dgraph debug -p <path-to-p-dir> --histogram -o=false . Make sure no alpha or zero is running while doing this.

cabbage.merchant · July 15, 2020, 6:46am

Hi @Paras

The calculation is not as simple as that. Each table have different number of columns and number of rows. To calculate number of cells for each table, that would be total columns in a table * total rows in a table. Then to get the overall number of cells would be the summation of all total cells from each table.

Yes, I index everything because all fields in Elastic Search are indexed. I am afraid I can’t share the schema. But basically all columns or predicates are indexed. To give you some additional overview:

This is the heap profile that I took yesterday prior provisioning x1e.4xlarge. It was running under t3a.xlarge when the REDUCE phase failed. I pointed it to port 8080 as you described above. I am not sure when I’m going to rerun again just to capture block profile.

There are 1314 sst and 4 vlog in p folder.

Here is the debug output:

$ sudo dgraph debug -p /var/lib/dgraph/p --histogram -o=false
[Decoder]: Using assembly version of decoder
Opening DB: /var/lib/dgraph/p
badger 2020/07/15 06:39:19 INFO: 139 tables out of 1314 opened in 3.009s
badger 2020/07/15 06:39:22 INFO: 280 tables out of 1314 opened in 6.02s
badger 2020/07/15 06:39:25 INFO: 427 tables out of 1314 opened in 9.001s
badger 2020/07/15 06:39:28 INFO: 567 tables out of 1314 opened in 12.028s
badger 2020/07/15 06:39:31 INFO: 702 tables out of 1314 opened in 15.003s
badger 2020/07/15 06:39:34 INFO: 826 tables out of 1314 opened in 18.01s
badger 2020/07/15 06:39:37 INFO: 985 tables out of 1314 opened in 21.024s
badger 2020/07/15 06:39:40 INFO: 1086 tables out of 1314 opened in 24.039s
badger 2020/07/15 06:39:43 INFO: 1236 tables out of 1314 opened in 27.026s
badger 2020/07/15 06:39:46 INFO: All 1314 tables opened in 30.054s
badger 2020/07/15 06:39:46 INFO: Replaying file id: 3 at offset: 472316997
badger 2020/07/15 06:39:46 INFO: Replay took: 66.524246ms
badger 2020/07/15 06:39:47 INFO: Got compaction priority: {level:1 score:1.0491600222885609 dropPrefix:[]}
badger 2020/07/15 06:39:47 INFO: Running for level: 1
badger 2020/07/15 06:39:47 INFO: Got compaction priority: {level:2 score:1.0068540036678315 dropPrefix:[]}
badger 2020/07/15 06:39:47 INFO: Running for level: 2
badger 2020/07/15 06:39:49 INFO: LOG Compact 1->2, del 1 tables, add 1 tables, took 1.849272021s
badger 2020/07/15 06:39:49 INFO: Compaction for level: 1 DONE
badger 2020/07/15 06:39:49 INFO: Got compaction priority: {level:2 score:1.0068539693951606 dropPrefix:[]}
badger 2020/07/15 06:39:49 INFO: Running for level: 2
badger 2020/07/15 06:39:49 INFO: LOG Compact 2->3, del 1 tables, add 1 tables, took 1.732434933s
badger 2020/07/15 06:39:49 INFO: Compaction for level: 2 DONE
badger 2020/07/15 06:39:49 INFO: Got compaction priority: {level:3 score:1.0073740751296283 dropPrefix:[]}
badger 2020/07/15 06:39:49 INFO: Running for level: 3
badger 2020/07/15 06:39:50 INFO: LOG Compact 2->3, del 1 tables, add 1 tables, took 1.531729242s
badger 2020/07/15 06:39:50 INFO: Compaction for level: 2 DONE
badger 2020/07/15 06:39:50 INFO: Got compaction priority: {level:3 score:1.0073740717023612 dropPrefix:[]}
badger 2020/07/15 06:39:50 INFO: Running for level: 3
badger 2020/07/15 06:39:51 INFO: LOG Compact 3->4, del 1 tables, add 1 tables, took 1.532886688s
badger 2020/07/15 06:39:51 INFO: Compaction for level: 3 DONE
badger 2020/07/15 06:39:51 INFO: Got compaction priority: {level:3 score:1.0047984521463513 dropPrefix:[]}
badger 2020/07/15 06:39:51 INFO: Running for level: 3
badger 2020/07/15 06:39:52 INFO: LOG Compact 3->4, del 1 tables, add 1 tables, took 1.437778875s
badger 2020/07/15 06:39:52 INFO: Compaction for level: 3 DONE
badger 2020/07/15 06:39:52 INFO: Got compaction priority: {level:3 score:1.0022227748855947 dropPrefix:[]}
badger 2020/07/15 06:39:52 INFO: Running for level: 3
badger 2020/07/15 06:39:53 INFO: LOG Compact 3->4, del 1 tables, add 1 tables, took 2.235294842s
badger 2020/07/15 06:39:53 INFO: Compaction for level: 3 DONE
badger 2020/07/15 06:39:55 INFO: LOG Compact 3->4, del 1 tables, add 1 tables, took 2.815491232s
badger 2020/07/15 06:39:55 INFO: Compaction for level: 3 DONE

Paras · July 15, 2020, 3:15pm

Thanks @cabbage.merchant
The dgraph debug output seems truncated. Could you send the whole output?

ibrahim · July 15, 2020, 4:44pm

@cabbage.merchant how many vlog and sst files do you have in your p and w directory? Also what’s the size of the biggest file and the smallest file?

cabbage.merchant · July 17, 2020, 12:12am

Hi @Paras, that dgraph debug output is not truncated. It stopped there.

@ibrahim there are 3 vlog and 1314 sst files in p folder. The w folder had been removed sadly while attempting to get alpha service running.

Smallest .sst is 417 bytes, largest is 67.9MB.
Smallest .vlog is 453.1MB, largest is 1GB.

cabbage.merchant · July 17, 2020, 12:23am

@Paras I reran it again, took ~27 minutes.

$ sudo dgraph debug -p /var/lib/dgraph/p --histogram -o=false
[Decoder]: Using assembly version of decoder
Opening DB: /var/lib/dgraph/p
badger 2020/07/16 23:53:42 INFO: 148 tables out of 1314 opened in 3.151s
badger 2020/07/16 23:53:45 INFO: 269 tables out of 1314 opened in 6.038s
badger 2020/07/16 23:53:48 INFO: 412 tables out of 1314 opened in 9.06s
badger 2020/07/16 23:53:51 INFO: 549 tables out of 1314 opened in 12.018s
badger 2020/07/16 23:53:54 INFO: 682 tables out of 1314 opened in 15.012s
badger 2020/07/16 23:53:57 INFO: 798 tables out of 1314 opened in 18.074s
badger 2020/07/16 23:54:00 INFO: 964 tables out of 1314 opened in 21.03s
badger 2020/07/16 23:54:03 INFO: 1062 tables out of 1314 opened in 24.034s
badger 2020/07/16 23:54:06 INFO: 1237 tables out of 1314 opened in 27.058s
badger 2020/07/16 23:54:08 INFO: All 1314 tables opened in 28.793s
badger 2020/07/16 23:54:08 INFO: Replaying file id: 3 at offset: 472316997
badger 2020/07/16 23:54:08 INFO: Replay took: 72.196622ms
Min commit: 1. Max commit: 90007, w.r.t 18446744073709551615
prefix =
Found 892413401 keys

Histogram of key sizes (in bytes)
Min value: 7
Max value: 83
Mean: 49.01
                   Range     Count
[         0,         32)  54513588
[        32,         64) 733663710
[        64,        128) 104236103

Histogram of value sizes (in bytes)
Min value: 4
Max value: 522936
Mean: 50.51
                   Range     Count
[         0,         32) 277762336
[        32,         64) 520877724
[        64,        128)  92801329
[       128,        256)    635476
[       256,        512)    173551
[       512,       1024)     58984
[      1024,       2048)     37474
[      2048,       4096)     30838
[      4096,       8192)     12499
[      8192,      16384)      5927
[     16384,      32768)      3555
[     32768,      65536)      2434
[     65536,     131072)      2619
[    131072,     262144)      1207
[    262144,     524288)      7448
badger 2020/07/17 00:20:17 INFO: Got compaction priority: {level:0 score:1.73 dropPrefix:[]}
badger 2020/07/17 00:20:17 INFO: Running for level: 0
badger 2020/07/17 00:20:17 INFO: LOG Compact 0->1, del 2 tables, add 1 tables, took 66.77077ms
badger 2020/07/17 00:20:17 INFO: Compaction for level: 0 DONE
badger 2020/07/17 00:20:17 INFO: Force compaction on level 0 done

cabbage.merchant · July 20, 2020, 8:42pm

Just wondering, any workaround for the too many open files issue?

harshil_goel · July 21, 2020, 5:36am

Have you tried increasing ulimit -n. Try keeping it at 1 million or more. (You might have to increase the hard limit too)

cabbage.merchant · July 22, 2020, 1:26am

Hi @harshil_goel, thank you for your reply.

I had increased the limit to a million. You could see that in my post above. The hard limit had been set to 1048576. And yet, I am still get that too many open files issue.

Hopefully there is a workaround for it.

Topic		Replies	Views
Bulk loader still OOM during reduce phase Dgraph area:bulk-loader	18	871	August 1, 2021
Bulk Loader - Deploy Documentation	0	898	December 16, 2020
Bulk Loader REDUCE problem - it's very slow Dgraph dgraph , status:accepted , kind:bug , ticket:created	22	968	March 12, 2021
Bulk Loader performance Dgraph kind:question , dgraph	5	488	February 1, 2021
Improve throughput of bulk loader with distributed loading Dgraph dgraph , kind:enhancement , priority:p2 , status:accepted , popular	21	1027	February 6, 2020

Bulk loading 72.1M records from RDBMS with 0 output

Related topics