July 31, 2014 - by Max Bruning
Recently, Joyent published a series of benchmarks using YCSB (Yahoo Cloud Service Benchmark) comparing the performance of MongoDB, PostgreSQL, and Elasticsearch between machines running in the Joyent Public Cloud and similarly configured machines running on Amazon Web Services.
When I looked at the results, my first impression was that the numbers look good, but how can I tell what is really being benchmarked? To do this, I needed to do "active benchmarking". This basically means that I wanted to analyze the performance of the system while it was being tested. I decided to look specifically at the YCSB benchmark that used MongoDB.
Rather than use shards, a router, and the entire setup that is used in the published results, I used a single machine in the Joyent cloud that ran the YCSB benchmark with another machine in the Joyent cloud that was running MongoDB. The machine running YCSB is a standard 4GB instance, and the machine running MongoDB is a standard 7.5GB system.
In this blog, we'll first simply run the benchmark and watch results. Then we'll do it again, but do active benchmarking while the benchmark is running. We'll do this to try to determine what are the limiting factors (i.e., what is the bottleneck for the test results). We'll also take a look at scaling issues.
First, we'll run the benchmark to load 6 million records, each record consisting of 10 fields of 1 hundred bytes each, creating 1k of data per record.
# time ./bin/ycsb load mongodb -s -P workloads/workloada -p recordcount=6000000 -p mongodb.url=mongodb://199.192.241.168:27017 -p insertorder=ordered -threads 1000 -p operationcount=6000000
YCSB Client 0.1
Command line: -db com.yahoo.ycsb.db.MongoDbClient -s -P workloads/workloada -p recordcount=6000000 -p mongodb.url=mongodb://199.192.241.168:27017 -p insertorder=ordered -threads 1000 -p operationcount=6000000 -load
Loading workload...
Starting test.
0 sec: 0 operations;
mongo connection created with 199.192.241.168:27017/ycsb
10 sec: 17563 operations; 1753.49 current ops/sec; [INSERT AverageLatency(us)=302209.69]
20 sec: 73255 operations; 5525.55 current ops/sec; [INSERT AverageLatency(us)=48674.83]
30 sec: 147128 operations; 7385.82 current ops/sec; [INSERT AverageLatency(us)=106637.21]
40 sec: 228286 operations; 8114.18 current ops/sec; [INSERT AverageLatency(us)=117745.16]
...
640 sec: 4993046 operations; 8210.88 current ops/sec; [INSERT AverageLatency(us)=123054.32]
650 sec: 5074073 operations; 8102.7 current ops/sec; [INSERT AverageLatency(us)=120990.57]
660 sec: 5150469 operations; 7638.84 current ops/sec; [INSERT AverageLatency(us)=127621.01] [CLEANUP AverageLatency(us)=26]
670 sec: 5223843 operations; 7336.67 current ops/sec; [INSERT AverageLatency(us)=144175.32] [CLEANUP AverageLatency(us)=6.33]
680 sec: 5297258 operations; 7340.77 current ops/sec; [INSERT AverageLatency(us)=131909.57] [CLEANUP AverageLatency(us)=4.67]
690 sec: 5376306 operations; 7904.01 current ops/sec; [INSERT AverageLatency(us)=124980.39] [CLEANUP AverageLatency(us)=5.62]
...
[OVERALL], RunTime(ms), 771609.0
[OVERALL], Throughput(ops/sec), 7775.9590673514695
[INSERT], Operations, 6000000
[INSERT], AverageLatency(us), 122792.53285266667
[INSERT], MinLatency(us), 401
[INSERT], MaxLatency(us), 7160628
[INSERT], Return=0, 6000000
[INSERT], 0, 4387109
[INSERT], 1, 435791
...
[CLEANUP], Operations, 1000
[CLEANUP], AverageLatency(us), 70.878
[CLEANUP], MinLatency(us), 1
[CLEANUP], MaxLatency(us), 66032
[CLEANUP], 95thPercentileLatency(ms), 0
[CLEANUP], 99thPercentileLatency(ms), 0
[CLEANUP], 0, 999
[CLEANUP], 1, 0
[CLEANUP], 2, 0
...
real 12m52.267s
user 13m33.122s
sys 3m24.784s
#
So loading 6,000,000 records took 12 minutes, 52 seconds.
Now we'll run workloada (a mixture of 50/50 reads/writes).
# time ./bin/ycsb run mongodb -s -P workloads/workloada -p mongodb.url=mongodb://199.192.241.168:27017 -p insertorder=ordered -threads 1000 -p operationcount=6000000
java -cp /root/YCSB-master/infinispan/src/main/conf:/root/YCSB-master/mongodb/target/mongodb-binding-0.1.4.jar:/root/YCSB-master/mongodb/target/archive-tmp/mongodb-binding-0.1.4.jar:/root/YCSB-master/jdbc/src/main/conf:/root/YCSB-master/gemfire/src/main/conf:/root/YCSB-master/core/target/core-0.1.4.jar:/root/YCSB-master/dynamodb/conf:/root/YCSB-master/nosqldb/src/main/conf:/root/YCSB-master/voldemort/src/main/conf:/root/YCSB-master/hbase/src/main/conf com.yahoo.ycsb.Client -db com.yahoo.ycsb.db.MongoDbClient -s -P workloads/workloada -p mongodb.url=mongodb://199.192.241.168:27017 -p insertorder=ordered -threads 1000 -p operationcount=6000000 -t
YCSB Client 0.1
Command line: -db com.yahoo.ycsb.db.MongoDbClient -s -P workloads/workloada -p mongodb.url=mongodb://199.192.241.168:27017 -p insertorder=ordered -threads 1000 -p operationcount=6000000 -t
Loading workload...
Starting test.
0 sec: 0 operations;
mongo connection created with 199.192.241.168:27017/ycsb
10 sec: 11377 operations; 1130.24 current ops/sec; [UPDATE AverageLatency(us)=747864.1] [READ AverageLatency(us)=379445.04]
20 sec: 60430 operations; 4877.98 current ops/sec; [UPDATE AverageLatency(us)=86604.21] [READ AverageLatency(us)=69050.56]
30 sec: 135550 operations; 7480.58 current ops/sec; [UPDATE AverageLatency(us)=96481.73] [READ AverageLatency(us)=94490]
40 sec: 229001 operations; 9344.17 current ops/sec; [UPDATE AverageLatency(us)=68793.49] [READ AverageLatency(us)=64952.19]
50 sec: 328821 operations; 9920.49 current ops/sec; [UPDATE AverageLatency(us)=57943.78] [READ AverageLatency(us)=61490.36]
60 sec: 442787 operations; 11335.39 current ops/sec; [UPDATE AverageLatency(us)=68971.15] [READ AverageLatency(us)=60517.94]
70 sec: 555053 operations; 11201.96 current ops/sec; [UPDATE AverageLatency(us)=60592.3] [READ AverageLatency(us)=33707.96]
80 sec: 671583 operations; 11603.11 current ops/sec; [UPDATE AverageLatency(us)=74702.3] [READ AverageLatency(us)=46173.29]
...
[OVERALL], RunTime(ms), 522700.0
[OVERALL], Throughput(ops/sec), 11478.859766596519
[UPDATE], Operations, 2999228
[UPDATE], AverageLatency(us), 85524.36139733292
[UPDATE], MinLatency(us), 394
[UPDATE], MaxLatency(us), 14435871
[UPDATE], 95thPercentileLatency(ms), 365
[UPDATE], Return=0, 2999228
[UPDATE], 0, 869530
...
[READ], Operations, 3000772
[READ], AverageLatency(us), 29704.56337702431
[READ], MinLatency(us), 353
[READ], MaxLatency(us), 14292934
[READ], 95thPercentileLatency(ms), 115
[READ], 99thPercentileLatency(ms), 345
[READ], Return=0, 3000772
[READ], 0, 706416
...
[CLEANUP], Operations, 1000
[CLEANUP], AverageLatency(us), 102.235
[CLEANUP], MinLatency(us), 1
[CLEANUP], MaxLatency(us), 81274
[CLEANUP], 95thPercentileLatency(ms), 0
[CLEANUP], 99thPercentileLatency(ms), 0
[CLEANUP], 0, 998
[CLEANUP], 1, 0
...
real 8m43.573s
user 13m53.838s
sys 3m40.015s
#
So in 6 million read/write operations, there was 3,000,772 reads and 2,999,228 writes (close enough to a 50/50 split). Reads took on average about 29.7 msecs, with a minimum latency of 394 usecs and a maximum latency of 14 seconds. Updates (writes) took about the same amount of time.
Let's try workloadb, a 95/5 split between reads and writes (95% reads).
# time ./bin/ycsb run mongodb -s -P workloads/workloadb -p mongodb.url=mongodb://199.192.241.168:27017 -p insertorder=ordered -threads 1000 -p operationcount=6000000
...
YCSB Client 0.1
Command line: -db com.yahoo.ycsb.db.MongoDbClient -s -P workloads/workloadb -p mongodb.url=mongodb://199.192.241.168:27017 -p insertorder=ordered -threads 1000 -p operationcount=6000000 -t
Loading workload...
Starting test.
0 sec: 0 operations;
mongo connection created with 199.192.241.168:27017/ycsb
10 sec: 39561 operations; 3937.99 current ops/sec; [UPDATE AverageLatency(us)=64134.53] [READ AverageLatency(us)=77735.02]
20 sec: 128222 operations; 8806.22 current ops/sec; [UPDATE AverageLatency(us)=33152.18] [READ AverageLatency(us)=37736.67]
30 sec: 202620 operations; 7342.87 current ops/sec; [UPDATE AverageLatency(us)=75466.98] [READ AverageLatency(us)=58739.72]
40 sec: 286881 operations; 8426.1 current ops/sec; [UPDATE AverageLatency(us)=52512.36] [READ AverageLatency(us)=43191.99]
50 sec: 381094 operations; 9420.36 current ops/sec; [UPDATE AverageLatency(us)=31639.92] [READ AverageLatency(us)=26998.68]
60 sec: 483100 operations; 10199.58 current ops/sec; [UPDATE AverageLatency(us)=33720.43] [READ AverageLatency(us)=52603.76]
70 sec: 587825 operations; 10418.32 current ops/sec; [UPDATE AverageLatency(us)=42287.56] [READ AverageLatency(us)=37719.84]
...
[OVERALL], RunTime(ms), 559369.0
[OVERALL], Throughput(ops/sec), 10726.37203706319
[UPDATE], Operations, 299901
[UPDATE], AverageLatency(us), 44403.52669714339
[UPDATE], MinLatency(us), 379
[UPDATE], MaxLatency(us), 3324334
[UPDATE], 95thPercentileLatency(ms), 138
[UPDATE], 99thPercentileLatency(ms), 740
[UPDATE], Return=0, 299901
[UPDATE], 0, 145122
[UPDATE], 1, 14669
...
[READ], Operations, 5700099
[READ], AverageLatency(us), 37128.780924682185
[READ], MinLatency(us), 329
[READ], MaxLatency(us), 3876008
[READ], 95thPercentileLatency(ms), 137
[READ], 99thPercentileLatency(ms), 381
[READ], Return=0, 5700099
[READ], 0, 2938426
[READ], 1, 193083
...
[CLEANUP], Operations, 1000
[CLEANUP], AverageLatency(us), 95.518
[CLEANUP], MinLatency(us), 1
[CLEANUP], MaxLatency(us), 91746
[CLEANUP], 95thPercentileLatency(ms), 0
[CLEANUP], 99thPercentileLatency(ms), 0
[CLEANUP], 0, 999
[CLEANUP], 1, 0
[CLEANUP], 2, 0
...
real 9m19.973s
user 15m37.622s
sys 3m28.291s
#
Here, there are 5,700,099 read and 299,901 updates. Roughly a 95/5 split as workloadb specifies. The remaining workloads yield similar results.
Now we'll run the benchmark and begin to do some active benchmarking.
For starters, let's use prstat(1M)
on the system where mongo is running.
# prstat -c
Please wait...
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
24426 mongodb 70G 244M cpu21 60 0 3:06:37 12% mongod/2014
48484 root 3256K 2036K sleep 59 0 0:00:04 0.0% bash/1
42051 root 3272K 2248K sleep 59 0 0:00:00 0.0% bash/1
48278 root 5560K 1388K sleep 59 0 0:00:11 0.0% sshd/1
61323 root 7360K 2376K sleep 59 0 0:00:49 0.0% rsyslogd/6
41639 root 5560K 2904K sleep 59 0 0:00:00 0.0% sshd/1
59741 root 2676K 1388K sleep 59 0 0:00:06 0.0% init/1
61348 root 1652K 1008K sleep 59 0 0:00:04 0.0% utmpd/1
61357 root 1940K 1028K sleep 59 0 0:00:03 0.0% ttymon/1
59626 root 0K 0K sleep 60 - 0:00:00 0.0% zsched/1
41641 root 6224K 3744K sleep 59 0 0:00:00 0.0% sshd/1
61359 root 2276K 1060K sleep 59 0 0:00:02 0.0% ttymon/1
61082 root 2488K 1276K sleep 59 0 0:00:02 0.0% pfexecd/3
59793 root 8824K 6256K sleep 59 0 0:00:36 0.0% svc.configd/13
61378 root 4340K 1496K sleep 59 0 0:00:10 0.0% sshd/1
Total: 23 processes, 2100 lwps, load averages: 12.39, 7.57, 3.78
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
24426 mongodb 70G 242M cpu22 52 0 3:06:49 12% mongod/2014
76885 root 4064K 3276K cpu6 59 0 0:00:00 0.0% prstat/1
48484 root 3256K 2036K sleep 59 0 0:00:04 0.0% bash/1
42051 root 3272K 2248K sleep 59 0 0:00:00 0.0% bash/1
48278 root 5560K 1388K sleep 59 0 0:00:11 0.0% sshd/1
61323 root 7360K 2376K sleep 59 0 0:00:49 0.0% rsyslogd/6
41639 root 5560K 2904K sleep 59 0 0:00:00 0.0% sshd/1
59741 root 2676K 1388K sleep 59 0 0:00:06 0.0% init/1
61348 root 1652K 1008K sleep 59 0 0:00:04 0.0% utmpd/1
61357 root 1940K 1028K sleep 59 0 0:00:03 0.0% ttymon/1
59626 root 0K 0K sleep 60 - 0:00:00 0.0% zsched/1
41641 root 6224K 3744K sleep 59 0 0:00:00 0.0% sshd/1
61359 root 2276K 1060K sleep 59 0 0:00:02 0.0% ttymon/1
61082 root 2488K 1276K sleep 59 0 0:00:02 0.0% pfexecd/3
59793 root 8824K 6256K sleep 59 0 0:00:36 0.0% svc.configd/13
Total: 23 processes, 2100 lwps, load averages: 12.02, 7.57, 3.80
(^c)
#
So mongod is using ~12% of the cpus on the machine. This machine has 24 cpus, so this is approximately 2.8 cpus at 100% utilization.
# psrinfo
0 on-line since 11/22/2013 03:58:51
1 on-line since 11/22/2013 03:58:58
2 on-line since 11/22/2013 03:58:58
3 on-line since 11/22/2013 03:58:58
4 on-line since 11/22/2013 03:58:58
5 on-line since 11/22/2013 03:58:58
6 on-line since 11/22/2013 03:58:58
7 on-line since 11/22/2013 03:58:59
8 on-line since 11/22/2013 03:58:59
9 on-line since 11/22/2013 03:58:59
10 on-line since 11/22/2013 03:58:59
11 on-line since 11/22/2013 03:58:59
12 on-line since 11/22/2013 03:58:59
13 on-line since 11/22/2013 03:58:59
14 on-line since 11/22/2013 03:58:59
15 on-line since 11/22/2013 03:58:59
16 on-line since 11/22/2013 03:58:59
17 on-line since 11/22/2013 03:58:59
18 on-line since 11/22/2013 03:58:59
19 on-line since 11/22/2013 03:58:59
20 on-line since 11/22/2013 03:58:59
21 on-line since 11/22/2013 03:58:59
22 on-line since 11/22/2013 03:58:59
23 on-line since 11/22/2013 03:58:59
#
Let's take a quick look at other statistics (vmstat, vfsstat, mpstat, and iostat) on the system running mongo.
# vmstat 1
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr rm s0 s1 zn in sy cs us sy id
0 0 118 87846368 6706592 4408 50302 5 57 58 0 29 -5789 -1 74 74 18565 68963 14973 10 9 81
0 0 175 81510632 9860540 8639 97242 0 0 0 0 0 0 0 31 31 28396 129491 48858 13 9 78
0 0 175 81454632 9853364 8364 94341 0 0 0 0 0 0 0 18 18 31016 125756 47106 14 9 77
0 0 175 81518144 9862564 3537 23890 0 0 0 0 0 20 0 188 188 28130 139412 50791 17 6 77
0 0 175 81517996 9862384 840 3519 0 0 0 0 0 0 0 25 25 25988 132854 56871 11 4 85
0 0 175 81518020 9862768 3545 22590 0 0 0 0 0 0 0 15 15 24562 131160 51082 7 5 88
0 0 175 81508880 9860908 5840 79660 0 0 0 0 0 0 0 30 30 29503 124580 47687 12 8 80
(^C)
#
The system as a whole is ~80% idle. There is no memory pressure on the system, but lots of page faults. We may come back to this later.
Here's vfsstat:
# vfsstat 1
r/s w/s kr/s kw/s ractv wactv read_t writ_t %r %w d/s del_t zone
76.1 16.5 243.5 55.4 0.0 0.0 2.8 28.6 0 0 0.2 12.4 ded0419b (239)
43.7 0.0 4.9 0.0 0.0 0.0 2.9 0.0 0 0 0.0 0.0 ded0419b (239)
43.7 0.0 4.9 0.0 0.0 0.0 3.0 0.0 0 0 0.0 0.0 ded0419b (239)
43.5 5.8 4.9 417.6 0.0 0.0 3.0 128.1 0 0 0.0 0.0 ded0419b (239)
43.3 9.6 4.8 639.6 0.0 0.0 2.9 117.1 0 0 0.0 0.0 ded0419b (239)
(^C)
#
So, not much file system activity in the zone where mongod is running. The zone is not being I/O throttled (the 'd/s' and 'del_t' columns in the output). This is to be expected. Unlike many (most?) databases, mongoDB uses memory mapped files instead of read/write for database data. It periodically syncs the files to backing store.
Now mpstat:
# mpstat 1
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
...
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 20718 0 4 2637 117 4506 190 394 245 0 8766 24 42 0 35
1 3787 0 44 512 54 5167 114 276 144 0 10722 18 14 0 68
2 13064 0 11 672 39 8919 364 659 206 0 17084 21 24 0 55
3 1338 0 46 897 186 7578 270 378 206 0 13350 13 13 0 74
4 5740 0 11 617 49 6637 281 467 155 0 10217 13 15 0 72
5 134 0 21 427 56 4342 91 238 82 0 5418 6 7 0 87
6 3418 0 11 780 100 8558 326 553 171 0 13850 15 16 0 69
7 425 0 23 679 37 8544 248 373 204 0 15762 12 11 0 77
8 1619 0 1 710 52 8241 341 430 115 0 13678 18 10 0 71
9 51 0 25 458 41 5290 197 262 176 0 7459 9 9 0 82
10 2476 0 4 748 66 9833 458 607 165 0 18239 18 13 0 68
11 1378 0 1 830 75 9630 546 380 147 0 13152 13 12 0 75
12 4315 0 1 620 65 5900 127 479 133 0 11299 20 15 0 66
13 1903 0 8 472 105 5247 91 309 191 0 10625 15 13 0 72
14 2016 0 5 390 30 5982 169 376 139 0 10792 18 16 0 66
15 6371 0 53 630 271 3790 50 202 190 0 7473 10 13 0 77
16 4501 0 5 15233 14632 4090 132 365 413 0 8432 32 37 0 31
17 2962 0 9 7298 6753 3446 83 234 329 0 13258 47 20 0 33
18 4545 0 494 402 26 4010 94 260 163 0 7613 34 21 0 45
19 2 0 645 440 114 4668 47 219 113 0 8205 12 6 0 82
20 1029 0 52 840 159 9325 278 606 117 0 18382 25 13 0 62
21 1120 0 1431 348 71 6572 94 325 140 0 15726 22 12 0 67
22 5110 0 4 886 177 6302 162 377 120 0 12207 21 13 0 66
23 2075 0 1246 304 45 3236 75 193 99 0 7788 15 9 0 76
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 1702 0 24 2553 140 8653 172 516 152 0 15491 14 13 0 73
1 520 0 624 475 79 6676 99 317 79 0 12575 14 8 0 78
2 2752 0 71 517 36 7173 151 465 103 0 14320 17 20 0 63
3 153 0 67 384 34 5310 181 261 141 0 8308 11 7 0 82
4 107 0 17 547 42 6262 218 368 97 0 8531 10 8 0 82
5 11 0 25 410 62 3691 78 210 64 0 4769 4 4 0 92
6 2216 0 38 774 97 7905 297 453 149 0 13275 12 11 0 77
7 7 0 17 561 52 7084 216 322 97 0 11671 14 8 0 78
8 18 0 58 508 31 5952 200 355 76 0 9057 9 7 0 84
9 6 0 55 396 57 5069 129 219 104 0 6318 5 5 0 90
10 42 0 79 644 65 8559 259 501 124 0 14886 12 9 0 79
11 385 0 87 756 102 9742 472 350 85 0 12346 10 9 0 81
12 1541 0 62 477 74 7014 116 543 92 0 14726 15 10 0 75
13 944 0 18 356 103 5551 48 329 96 0 9791 8 7 0 85
14 2 0 14 316 60 5367 74 347 62 0 9730 12 14 0 74
15 905 0 8 425 41 2503 67 128 43 0 5236 42 4 0 54
16 8 0 21 14268 13527 5552 135 413 294 0 11511 20 30 0 50
17 5 0 48 6972 6357 5826 93 304 337 0 11184 23 18 0 59
18 8 0 509 314 51 5479 81 299 60 0 10575 18 10 0 72
19 7 0 804 564 202 4123 31 194 65 0 13360 7 6 0 88
20 2 0 75 758 157 6378 152 356 64 0 11863 26 23 0 52
21 2287 0 453 339 56 5706 69 256 69 0 11302 21 8 0 70
22 3 0 64 867 172 5633 76 286 88 0 10716 11 18 0 71
23 63 0 302 305 74 4025 47 201 46 0 7674 8 4 0 87
#
The machine is busy, but over 50% idle. Nothing unusual here.
What system calls are being made by mongo?
# dtrace -n 'syscall:::entry/pid == $target/{@[probefunc]=count();} tick-60s{exit(0);}' -p `pgrep mongod`
dtrace: description 'syscall:::entry' matched 235 probes
CPU ID FUNCTION:NAME
16 76285 :tick-60s
getpid 1
pwrite 1
ioctl 2
read 2
close 3
open 3
nanosleep 12
mmap 30
memcntl 47
mincore 78
fdsync 527
getcwd 527
lseek 529
fstat 531
lwp_mutex_timedlock 998
lwp_mutex_wakeup 1137
yield 4617
pollsys 5964
write 6124
gtime 28018
send 477650
recv 956294
lwp_park 3121961
clock_gettime 3590578
#
The (undocumented) lwp_park() system call is used for user level synchronization, and is called when a user level thread needs to block for a user level mutex, reader/writer lock, or condition variable. The clock_gettime system call is made frequently, but should not consume too much time. The send/recv calls are interesting. The recv(2) call is done to retrieve requests from clients, and send(2) calls are made to return responses to clients.
Snooping the network and saving the snoop output to a file, then running wireshark on the captured output shows that the recv/send calls are probably all using the "Mongo Wire Protocol". Searching the internet for "Mongo Wire Protocol" yields http://docs.mongodb.org/meta-driver/latest/legacy/mongodb-wire-protocol/. From this document, we'll implement a DTrace script to watch packets being sent and received. The following script snoops packets and displays the message headers being sent/received.
#!/usr/sbin/dtrace -s
/* assumes packets to/from mongodb */
#pragma D option quiet
/* from http://docs.mongodb.org/meta-driver/latest/legacy/mongodb-wire-protocol/
struct MsgHeader {
int32 messageLength; /* total message size, including this */
int32 requestID; /* identifier for this message */
int32 responseTo; /* requestID from the original request */
/* (used in responses from db) */
int32 opCode; /* request type - see table below */
};
BEGIN
{
printf("%10s: %-10s %-10s %-10s %-10s\n",
"Direction", "Length", "ReqID", "ResTo", "opCode");
}
/*
* When send/recv(2) are called, assume packet is in "mongo wire protocol"
* (see
* http://docs.mongodb.org/meta-driver/latest/legacy/mongodb-wire-protocol/)
*/
syscall::send:entry
/pid == $target/
{
self->smh = (struct MsgHeader *)copyin(arg1, sizeof(struct MsgHeader));
self->mL = self->smh->messageLength;
self->reqID = self->smh->requestID;
self->resTo = self->smh->responseTo;
self->opCode = self->smh->opCode;
}
syscall::send:entry
/pid == $target && self->mL > 0/
{
printf("%10s: %-10d %-10d %-10d %-10x\n",
probefunc, self->mL, self->reqID, self->resTo, self->opCode);
self->smh = 0;
self->mL = 0;
self->reqID = 0;
self->resTo = 0;
self->opCode = 0;
}
syscall::recv:entry
/pid == $target/
{
self->a1 = arg1; /* save user address of packet */
}
syscall::recv:return /* now the packet is there. */
/self->a1/
{
self->rmh = (struct MsgHeader *)copyin(self->a1, sizeof(struct MsgHeader));
self->mL = self->rmh->messageLength;
self->reqID = self->rmh->requestID;
self->resTo = self->rmh->responseTo;
self->opCode = self->rmh->opCode;
}
syscall::recv:return
/self->a1 && self->mL > 0/
{
printf("%10s: %-10d %-10d %-10d %-10d\n",
probefunc, self->mL, self->reqID, self->resTo, self->opCode);
self->a1 = 0;
self->mh = 0;
self->mL = 0;
self->reqID = 0;
self->resTo = 0;
self->opCode = 0;
}
Running the above script while the ycsb benchmark is running shows:
#./mongo_snoop.d -p `pgrep mongod`
Direction: Length ReqID ResTo opCode
recv: 61 962617 0 2004
recv: 61 961087 0 2004
recv: 61 961883 0 2004
send: 110 480867 962617 1
send: 110 480868 961087 1
recv: 61 962405 0 2004
send: 110 480869 961883 1
recv: 61 961498 0 2004
send: 110 480870 962405 1
send: 110 480871 961498 1
recv: 61 962013 0 2004
recv: 61 962198 0 2004
send: 110 480872 962013 1
send: 110 480873 962198 1
recv: 61 961709 0 2004
recv: 1190 962736 0 2002
recv: 1190 962735 0 2002
recv: 61 961907 0 2004
send: 110 480874 961709 1
send: 110 480875 961907 1
recv: 1190 962739 0 2002
recv: 61 961771 0 2004
...
In the above output, OpCode 1 is OP_REPLY, a reply sent to a client request. The ResTo field matches the ReqID that was received from the client. opCode 2004 is OP_QUERY, and 2002 is OP_INSERT. The benchmark is currently loading the database.
Here is a script that determines latency between a request being received from a client, and the response for that request.
#!/usr/sbin/dtrace -s
/* assumes packets to/from mongodb */
#pragma D option quiet
struct MsgHeader {
int32 messageLength; /* total message size, including this */
int32 requestID; /* identifier for this message */
int32 responseTo; /* requestID from the original request */
/* (used in reponses from db) */
int32 opCode; /* request type - see table below */
};
BEGIN
{
printf("wait 1 minute...");
}
/*
* When send/recv(2) are called, assume packet is in "mongo wire protocol"
* (see
* http://docs.mongodb.org/meta-driver/latest/legacy/mongodb-wire-protocol/)
*/
syscall::recv:entry
/pid == $target/
{
self->a1 = arg1; /* save user address of packet */
}
syscall::recv:return /* now the packet is there. */
/self->a1 && arg1 > 0/ /* arg1 = number of bytes received */
{
self->rmh = (struct MsgHeader *)copyin(self->a1, sizeof(struct MsgHeader));
self->mL = self->rmh->messageLength;
self->reqID = self->rmh->requestID;
}
syscall::recv:return
/self->a1 && self->mL > 0/
{
t[self->reqID] = timestamp;
self->a1 = 0;
self->rmh = 0;
self->mL = 0;
self->reqID = 0;
}
syscall::send:entry
/pid == $target/
{
self->smh = (struct MsgHeader *)copyin(arg1, sizeof(struct MsgHeader));
self->mL = self->smh->messageLength;
self->resTO = self->smh->responseTo;
}
syscall::send:entry
/pid == $target && self->mL > 0 && t[self->resTO] != 0/
{
@d["distribution in nsecs"] = quantize(timestamp-t[self->resTO]);
@a["avg"] = avg(timestamp-t[self->resTO]);
@t["total requests handled"] = count();
self->smh = 0;
self->mL = 0;
self->resTO = 0;
t[self->resTO] = 0;
}
syscall::send:entry
/pid == $target && self->mL > 0/
{
self->smh = 0;
self->mL = 0;
self->resTO = 0;
}
tick-60s
{
exit(0);
}
This script runs for one minute and shows the distribution of times taken to handle database requests. Running the script shows:
# ./mongo_time.d -x dynvarsize=256m -p `pgrep mongod`
wait 1 minute...
distribution in nsecs
value ------------- Distribution ------------- count
16384 | 0
32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ 304521
65536 |@@@@@@@@@@@@ 133288
131072 |@ 5643
262144 | 271
524288 | 6
1048576 | 0
avg 65688
total requests handled 443729
#
So requests take between 32 and 524 microseconds. Further information can be retrieved from other data in the requests and replies, but is left as an exercise. The setting of dynvarsize is to get rid of "dynamic variable drops" messages.
Here is output of the above script while the test is running (as opposed to loading) the database.
./mongo\_time.d -x dynvarsize=256m -p `pgrep mongod`
wait 1 minute...
distribution in nsecs
value ------------- Distribution ------------- count
16384 | 0
32768 |@@@@@@@@@@@@@@ 181524
65536 |@@@@@@@@@@@@ 155129
131072 |@@@@@ 64382
262144 |@@@ 39143
524288 |@@@ 38864
1048576 |@@ 25175
2097152 | 5632
4194304 | 1714
8388608 | 451
16777216 | 49
33554432 | 0
67108864 | 0
134217728 | 100
268435456 | 369
536870912 | 0
avg 548253
total requests handled 512532
#
Most of the requests take between 32 and 838 microseconds, but some take between 134 and 536 milliseconds. The average latency is much longer when running the database as opposed to loading it. Another thing to note is that when the database is loading, there are many OP_INSERT requests from the YCSB client. When it is running as opposed to loading, the client only sends OP_QUERY requests (seen by running the mongo_snoop.d script from earlier). Most likely, the longer requests are updates as opposed to lookups, but that has not been verified (and as left to the reader as an exercise).
As an experiment, I ran both ycsb and mongo on the same machine, and saw much higher ops/second numbers (over 30000). Running two ycsb clients on different machines did not yield a linear improvement (only around a 20% increase in total ops/second). More experimentation is needed here. It would be interesting to see the effects of multiple clients on the virtual machines actually used for the published results, but I have not tried this.
What happens if we increase the size of the database? Instead of having a 6GB database, what if it were a database larger than the size of the virtual machine. The virtual machine is 8GB. Trying a 10GB database introduced some problems.
Let's go back to prstat and watch the mongod process specifically, but using the larger database.
# prstat -cmLp `pgrep mongod`
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
24426 mongodb 2.0 0.8 0.0 0.0 0.0 2.1 95 0.3 2K 109 11K 0 mongod/22159
24426 mongodb 1.7 0.7 0.0 0.0 0.0 1.6 96 0.2 2K 32 10K 0 mongod/22115
24426 mongodb 1.6 0.6 0.0 0.0 0.0 1.7 96 0.2 1K 68 9K 0 mongod/22296
24426 mongodb 1.4 0.5 0.0 0.0 0.0 1.3 97 0.2 1K 61 7K 0 mongod/21496
24426 mongodb 1.3 0.5 0.0 0.0 0.0 2.0 96 0.2 1K 55 7K 0 mongod/22240
24426 mongodb 1.2 0.4 0.0 0.0 0.0 1.6 97 0.2 1K 59 6K 0 mongod/21649
24426 mongodb 1.2 0.4 0.0 0.0 0.0 1.7 97 0.2 1K 61 6K 0 mongod/22153
24426 mongodb 1.1 0.4 0.0 0.0 0.0 1.7 97 0.2 1K 52 6K 0 mongod/22293
24426 mongodb 1.1 0.4 0.0 0.0 0.0 1.0 97 0.1 1K 57 6K 0 mongod/22042
24426 mongodb 1.0 0.4 0.0 0.0 0.0 1.6 97 0.2 1K 43 5K 0 mongod/21461
24426 mongodb 1.0 0.4 0.0 0.0 0.0 1.5 97 0.1 1K 36 5K 0 mongod/22175
24426 mongodb 1.0 0.4 0.0 0.0 0.0 0.9 98 0.1 1K 17 6K 0 mongod/22073
24426 mongodb 1.0 0.4 0.0 0.0 0.0 0.6 98 0.1 1K 25 5K 0 mongod/22105
24426 mongodb 1.0 0.3 0.0 0.0 0.0 1.3 97 0.1 986 45 5K 0 mongod/22273
24426 mongodb 0.9 0.4 0.0 0.0 0.0 0.7 98 0.1 964 13 5K 0 mongod/21529
Total: 1 processes, 982 lwps, load averages: 0.39, 0.86, 3.93
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
24426 mongodb 1.2 0.4 0.0 0.0 0.0 0.8 97 0.1 1K 26 6K 0 mongod/22041
24426 mongodb 0.9 0.4 0.0 0.0 0.0 0.8 98 0.1 1K 18 5K 0 mongod/22040
24426 mongodb 0.9 0.3 0.0 0.0 0.0 0.7 98 0.1 923 26 4K 0 mongod/21844
24426 mongodb 0.8 0.3 0.0 0.0 0.0 0.7 98 0.1 860 17 4K 0 mongod/22089
24426 mongodb 0.8 0.3 0.0 0.0 0.0 1.0 98 0.1 907 32 4K 0 mongod/22175
24426 mongodb 0.8 0.3 0.0 0.0 0.0 0.6 98 0.1 869 18 4K 0 mongod/22121
24426 mongodb 0.8 0.3 0.0 0.0 0.0 0.5 98 0.1 837 22 4K 0 mongod/22028
24426 mongodb 0.8 0.3 0.0 0.0 0.0 1.0 98 0.1 802 38 4K 0 mongod/22151
24426 mongodb 0.8 0.3 0.0 0.0 0.0 0.7 98 0.1 858 18 4K 0 mongod/21407
24426 mongodb 0.8 0.3 0.0 0.0 0.0 0.7 98 0.1 898 11 4K 0 mongod/22013
24426 mongodb 0.8 0.3 0.0 0.0 0.0 0.9 98 0.1 798 45 4K 0 mongod/22230
24426 mongodb 0.8 0.3 0.0 0.0 0.0 0.8 98 0.1 785 35 4K 0 mongod/22153
24426 mongodb 0.8 0.3 0.0 0.0 0.0 1.2 98 0.1 870 40 4K 0 mongod/22248
24426 mongodb 0.8 0.3 0.0 0.0 0.0 0.9 98 0.1 784 43 4K 0 mongod/22245
24426 mongodb 0.7 0.3 0.0 0.0 0.0 1.0 98 0.1 872 28 4K 0 mongod/22195
Total: 1 processes, 982 lwps, load averages: 0.46, 0.87, 3.91
(^C)
#
The mongod process has, at the time this was run, 982 threads. Most of these threads are spending ~98% of the time sleeping. There is a little contention for cpus (LAT column is non-null), and some time is spent waiting on locks (LCK is < 1%). But letting this run for a longer time shows that most threads spend lots of time sleeping, but thread id 3 and 5 spend time locking. This is a periodic activity. Here is output from prstat when thread 3 and 5 are busy.
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
24426 mongodb 1.0 0.4 0.0 0.0 0.0 1.9 96 0.3 1K 53 5K 0 mongod/22503
24426 mongodb 0.9 0.4 0.0 0.0 0.0 1.8 97 0.2 1K 94 5K 0 mongod/22852
24426 mongodb 0.8 0.3 0.0 0.0 0.0 1.2 98 0.1 936 24 4K 0 mongod/22540
24426 mongodb 0.7 0.4 0.0 0.0 0.0 99 0.3 0.0 273 11 975 0 mongod/5
24426 mongodb 0.8 0.3 0.0 0.0 0.0 1.1 98 0.1 881 56 4K 0 mongod/22700
24426 mongodb 0.8 0.3 0.0 0.0 0.0 2.0 97 0.2 1K 64 4K 0 mongod/22657
24426 mongodb 0.7 0.3 0.0 0.0 0.0 1.7 97 0.1 917 19 4K 0 mongod/22952
24426 mongodb 0.8 0.3 0.0 0.0 0.0 0.9 98 0.1 793 13 4K 0 mongod/22444
24426 mongodb 0.7 0.3 0.0 0.0 0.0 1.2 98 0.1 924 31 3K 0 mongod/23237
24426 mongodb 0.7 0.3 0.0 0.0 0.0 1.5 97 0.1 757 39 3K 0 mongod/22353
24426 mongodb 0.7 0.3 0.0 0.0 0.0 1.3 98 0.1 838 13 3K 0 mongod/22973
24426 mongodb 0.7 0.3 0.0 0.0 0.0 1.5 97 0.2 789 51 3K 0 mongod/22821
24426 mongodb 0.7 0.3 0.0 0.0 0.0 1.6 97 0.1 876 18 3K 0 mongod/23127
24426 mongodb 0.7 0.3 0.0 0.0 0.0 0.9 98 0.1 740 20 3K 0 mongod/22590
24426 mongodb 0.7 0.3 0.0 0.0 0.0 1.3 98 0.1 757 26 3K 0 mongod/22917
Total: 1 processes, 1013 lwps, load averages: 1.14, 0.39, 0.18
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
24426 mongodb 0.0 38 0.0 0.0 0.0 61 0.0 0.0 8 81 15 0 mongod/3
24426 mongodb 1.6 0.7 0.0 0.0 0.0 5.2 92 0.2 2K 47 11K 0 mongod/23059
24426 mongodb 1.3 0.5 0.0 0.0 0.0 6.5 91 0.2 1K 84 8K 0 mongod/22978
24426 mongodb 1.1 0.4 0.0 0.0 0.0 6.9 91 0.2 1K 103 6K 0 mongod/23061
24426 mongodb 1.0 0.4 0.0 0.0 0.0 6.0 92 0.1 1K 43 6K 0 mongod/23182
24426 mongodb 1.0 0.4 0.0 0.0 0.0 4.6 94 0.1 1K 27 6K 0 mongod/22815
24426 mongodb 1.0 0.4 0.0 0.0 0.0 6.4 92 0.2 1K 122 6K 0 mongod/22393
24426 mongodb 1.0 0.4 0.0 0.0 0.0 5.7 93 0.1 1K 31 6K 0 mongod/22391
24426 mongodb 0.9 0.4 0.0 0.0 0.0 5.9 93 0.1 1K 61 6K 0 mongod/22802
24426 mongodb 1.0 0.3 0.0 0.0 0.0 6.4 92 0.1 1K 85 6K 0 mongod/23270
24426 mongodb 0.9 0.3 0.0 0.0 0.0 6.3 92 0.1 1K 109 6K 0 mongod/22871
24426 mongodb 0.9 0.3 0.0 0.0 0.0 6.6 92 0.1 1K 85 5K 0 mongod/22916
24426 mongodb 0.9 0.4 0.0 0.0 0.0 6.1 93 0.1 1K 26 6K 0 mongod/22517
24426 mongodb 0.9 0.4 0.0 0.0 0.0 5.9 93 0.1 1K 59 6K 0 mongod/22864
24426 mongodb 0.9 0.3 0.0 0.0 0.0 6.4 92 0.2 1K 97 6K 0 mongod/22459
Total: 1 processes, 1013 lwps, load averages: 1.20, 0.42, 0.19
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
24426 mongodb 0.0 59 0.0 0.0 0.0 40 0.1 0.0 19 230 18 0 mongod/3
24426 mongodb 0.8 0.3 0.0 0.0 0.0 28 71 0.1 936 39 4K 0 mongod/22526
24426 mongodb 0.8 0.3 0.0 0.0 0.0 24 75 0.1 889 22 4K 0 mongod/23089
24426 mongodb 0.7 0.3 0.0 0.0 0.0 25 74 0.1 809 13 4K 0 mongod/22838
24426 mongodb 0.7 0.3 0.0 0.0 0.0 27 72 0.1 946 23 4K 0 mongod/23077
24426 mongodb 0.7 0.3 0.0 0.0 0.0 24 75 0.1 859 22 3K 0 mongod/23274
24426 mongodb 0.6 0.3 0.0 0.0 0.0 29 70 0.1 933 40 4K 0 mongod/22814
24426 mongodb 0.6 0.4 0.0 0.0 0.0 77 22 0.0 267 15 800 0 mongod/5
24426 mongodb 0.6 0.2 0.0 0.0 0.0 29 70 0.1 645 32 3K 0 mongod/22790
24426 mongodb 0.6 0.2 0.0 0.0 0.0 27 72 0.2 701 11 3K 0 mongod/22600
24426 mongodb 0.6 0.2 0.0 0.0 0.0 28 71 0.1 832 28 3K 0 mongod/22413
24426 mongodb 0.6 0.2 0.0 0.0 0.0 25 74 0.1 560 5 3K 0 mongod/22415
24426 mongodb 0.5 0.2 0.0 0.0 0.0 28 71 0.2 651 28 3K 0 mongod/23035
24426 mongodb 0.5 0.2 0.0 0.0 0.0 27 72 0.1 544 38 2K 0 mongod/22738
24426 mongodb 0.5 0.2 0.0 0.0 0.0 28 71 0.1 594 22 3K 0 mongod/23273
Total: 1 processes, 1013 lwps, load averages: 1.23, 0.44, 0.20
Let's see what thread 3 is doing. When it runs, it spends a lot of time in system mode (38 then 59 percent in the above output). We'll use DTrace to profile mongod to see what it is doing while in system mode.
# dtrace -n 'profile-97/pid == $target && tid == 3 &&
arg0/{@[stack()] = count();} tick-60s{exit(0);}' -p `pgrep mongod`
-o /var/tmp/mongod.tid3.out
#
And I'll download the output file and convert to a flamegraph (see https://github.com/davepacheco/node-stackvis).
# stackvis < mongod.tid3.out > mongod.tid3.htm
#
Opening the flamegraph shows that thread id 3 is spending almost all of its time in the memcntl(2) system call. Let's take a look at how memcntl(2) gets called.
# dtrace -n 'syscall::memcntl:entry/pid == $target && tid == 3/{@[ustack()] = count();} tick-60s{exit(0);}' -p `pgrep mongod`
dtrace: description 'syscall::memcntl:entry' matched 2 probes
CPU ID FUNCTION:NAME
13 76285 :tick-60s
libc.so.1`memcntl+0xa
libc.so.1`msync+0xc9
mongod`_ZN5mongo14PosixFlushable5flushEv+0x3e
mongod`_ZN5mongo9MongoFile9_flushAllEb+0xfa
mongod`_ZN5mongo9MongoFile8flushAllEb+0x2b
mongod`_ZN5mongo12DataFileSync3runEv+0xf3
mongod`_ZN5mongo13BackgroundJob7jobBodyEv+0xe6
libboost_thread.so.1.55.0`thread_proxy+0x66
libc.so.1`_thrp_setup+0x8a
libc.so.1`_lwp_start
19
#
Taking a look at the log file for mongod, (/var/log/mongodb/mongod.log), we see lots of times when mongod took long periods of time to flush the data files.
# grep -i flush mongod.log
2014-06-23T16:12:40.792+0000 [DataFileSync] flushing mmaps took 20970ms for 33 files
2014-06-23T18:57:32.343+0000 [DataFileSync] flushing mmaps took 12508ms for 33 files
2014-07-19T17:51:37.175+0000 [DataFileSync] flushing mmaps took 11880ms for 33 files
...
#
And examining the source code for mongodb, the flush uses the msync(3c) library function to do the syncing (see src/mongo/util/mmap_posix.cpp for the code). The msync call in turn calls the memcntl(2) system call. And the larger the database, the longer the flush (i.e., msync(3c)) times.
The msync() function takes three arguments: a starting address, a length, and flags. On SmartOS, the kernel does a page lookup for each page in the range between the starting address and starting address plus length. On the database being loaded and run with 6 million 1K-byte records, (i.e., on a 6GB database), there are 6GB/4KB (1.5 million) page lookups on every flush.
As the database gets larger, this could effect performance. Fortunately in the case of mongodb, there is another, much faster way to sync the database to backing store. This is by using fsync(3C). While msync flushes a range of modified pages to disk, fsync does the same for modified pages of a file. In the case of mongodb, the end result of using msync versus fsync is equivalent. Each msync call is for a 2GB area of memory that is mmap'd to a 2GB file. Instead of calling msync with the range of memory, fsync can be called with the file descriptor for the mmap'd file. In the case of the fsync call, there is no need to do a page lookup. The pages are hanging off of the vnode_t for the file, which can be quickly retrieved via the file descriptor. This gets rid of the long flush times.
Another, more difficult problem, is that by going over the 8GB virtual machine size, the memory capper for the zone kicks in. Performance at this point was 2 orders of magnitude worse than when the machine was below the memory cap. Basically, the benchmark did not scale well with respect to the size of the database. This led my colleague, Jerry Jelinek, to take a look to the memory capper. There were 2 changes made to the memory capper. It no longer stops processes while they are being paged out, and it resumes running where it left off.
The result of these changes has made a vast difference. We no longer see the performance drop precipitously with respect to database size. Much more analysis can be done from this benchmark, but maybe in another blog post (and maybe with a different database).