SRE/Dc-operations/Platform-specific documentation/Dell PowerEdge RN10/R510 Performance Testing

The Dell PowerEdge R510 hosts are used for the es1001-es1010 hosts in eqiad.

Hardware stats

12 disks hardware raid card

Performance statistics

These tests done with the disks configured:

raid 10
256k stripe
no read ahead
writeback cache

Tests performed with sysbench 0.4.10-1build1, standard debian package (obtained by 'aptitude install sysbench')

cpu, memory, threads, mutex

root@es1003:/a/tmp/sysbench# for i in cpu memory threads mutex ; do sysbench --test=$i run; done
sysbench 0.4.10:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing CPU performance benchmark

Threads started!
Done.

Maximum prime number checked in CPU test: 10000


Test execution summary:
    total time:                          10.0859s
    total number of events:              10000
    total time taken by event execution: 10.0851
    per-request statistics:
         min:                                  0.96ms
         avg:                                  1.01ms
         max:                                  1.91ms
         approx.  95 percentile:               1.76ms

Threads fairness:
    events (avg/stddev):           10000.0000/0.00
    execution time (avg/stddev):   10.0851/0.00

sysbench 0.4.10:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing memory operations speed test
Memory block size: 1K

Memory transfer size: 102400M

Memory operations type: write
Memory scope type: global
Threads started!
Done.
Operations performed: 104857600 (2259393.46 ops/sec)
102400.00 MB transferred (2206.44 MB/sec)


Test execution summary:
    total time:                          46.4096s
    total number of events:              104857600
    total time taken by event execution: 39.3877
    per-request statistics:
         min:                                  0.00ms
         avg:                                  0.00ms
         max:                                  0.19ms
         approx.  95 percentile:               0.00ms

Threads fairness:
    events (avg/stddev):           104857600.0000/0.00
    execution time (avg/stddev):   39.3877/0.00

sysbench 0.4.10:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing thread subsystem performance test
Thread yields per test: 1000 Locks used: 8
Threads started!
Done.


Test execution summary:
    total time:                          2.0137s
    total number of events:              10000
    total time taken by event execution: 2.0128
    per-request statistics:
         min:                                  0.20ms
         avg:                                  0.20ms
         max:                                  0.33ms
         approx.  95 percentile:               0.20ms

Threads fairness:
    events (avg/stddev):           10000.0000/0.00
    execution time (avg/stddev):   2.0128/0.00

sysbench 0.4.10:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing mutex performance test
Threads started!
Done.


Test execution summary:
    total time:                          0.0021s
    total number of events:              1
    total time taken by event execution: 0.0020
    per-request statistics:
         min:                                  2.01ms
         avg:                                  2.01ms
         max:                                  2.01ms
         approx.  95 percentile:         10000000.00ms

Threads fairness:    events (avg/stddev):           1.0000/0.00
    execution time (avg/stddev):   0.0020/0.00

disks

This machine is also what is currently (May 2014) running graphite, thus it makes sense to have a syntetic benchmark of the raid itself. Graphite (carbon-cache) results in many small writes, thus fio was run with a config like the one below. Note that 8 matches the number of carbon-cache that write to disk.

# cat carbon-cache.job 
[global]
bs=4k
rw=randwrite
write_bw_log
write_lat_log

[carbon-cache]
group_reporting=1
directory=/a/tmp/
numjobs=8
size=10g

And executed like this: fio --output carbon-cache.out --bandwidth-log --latency-log carbon-cache.job

# cat carbon-cache.out 
carbon-cache: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
...
carbon-cache: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=sync, iodepth=1
fio 1.59
Starting 8 processes

carbon-cache: (groupid=0, jobs=8): err= 0: pid=13789
  write: io=81920MB, bw=29500KB/s, iops=7375 , runt=2843564msec
    clat (usec): min=2 , max=132888 , avg=1065.97, stdev=4886.38
     lat (usec): min=2 , max=132888 , avg=1066.14, stdev=4886.38
    bw (KB/s) : min=    1, max=964232, per=12.68%, avg=3739.50, stdev=18964.43
  cpu          : usr=0.06%, sys=0.82%, ctx=1093250, majf=0, minf=247055
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=0/20971520/0, short=0/0/0
     lat (usec): 4=8.98%, 10=79.06%, 20=6.70%, 50=0.06%, 100=0.03%
     lat (usec): 250=0.01%, 500=0.01%, 750=0.01%
     lat (msec): 2=0.01%, 4=0.01%, 20=4.61%, 50=0.51%, 100=0.04%
     lat (msec): 250=0.01%

Run status group 0 (all jobs):
  WRITE: io=81920MB, aggrb=29500KB/s, minb=30208KB/s, maxb=30208KB/s, mint=2843564msec, maxt=2843564msec

Disk stats (read/write):
  dm-0: ios=1/13680669, merge=0/0, ticks=1768/415097120, in_queue=415108824, util=99.96%, aggrios=330/13701997, aggrmerge=10/329684, aggrticks=474592/413980824, aggrin_queue=414435312, aggrutil=99.96%
    sda: ios=330/13701997, merge=10/329684, ticks=474592/413980824, in_queue=414435312, util=99.96%

So with 4k random writes and 8 processes, using read/write it looks like there can be ~7.3k iops sustained from the raid controller.

TODO more comprehensive tests, with varying block sizes and sync/async io etc and performance comparation