Beware of svctm in Linux’s iostat
I’ve been studying the source of iostat again and trying to understand whether all of its calculations I explained here are valid and correct. Two of the columns did not seem consistent to me. The await and svctm columns are supposed to measure the average time from beginning to end of requests including device queueing, and actual time to service the request on the device, respectively. But there’s really no instrumentation to support that distinction. The device statistics you can get from the kernel do not provide timing information about device queueing, only a) begin-to-end timing of completed requests and b) the time accumulated by requests that haven’t yet completed. I concluded that the await is correct, but the svctm cannot be.
I just looked at the sysstat website, and it has been updated recently to warn about this, too:
svctm
The average service time (in milliseconds) for I/O requests that were issued to the device. Warning! Do not trust this field any more. This field will be removed in a future sysstat version.



Can you clarify this? I’m not sure I understand why the instrumentation isn’t there, at least for a single disk. Are the underlying metric mis-calculated?
My understanding is that as long as utilization and throughput are provided, then you should be able to calculate service time (excluding queue time), right? Utilization Law being U = SX.
Sure, ‘utilization’ is wrong for an array, but for a single disk this is ok, isn’t it?
Nathan Webb
12 Sep 10 at 9:00 am
Little’s Law is that the number of jobs in the system is equal to the arrival rate times the service time. (All of the preceding are expressed as long-term averages.) You can compute response time, but not service time, with Little’s Law. Here I am using the terms carefully — response time is from the beginning of the job to the end of the job, and is the sum of the service time and the queue time.
The utilization number that iostat prints is at best misleading, too. It does not show the effects of concurrency, which is possible on a device with many disks behind it.
The kernel does not expose any instrumentation for service time or queue time, only response time. You cannot calculate queue time or service time from measurements of response time.
Xaprb
12 Sep 10 at 11:39 am
Yes, that’s Little’s Law (N=XR), but using the Utilization law (U=SX), you can calculate Service time as Utilization divided by throughput.
Again, though, this is only OK for a single disk. With the way that utilization is calculated by iostat, it would overstate the service time when dealing with multiple disks.
Nathan Webb
13 Sep 10 at 7:45 am
Nathan, your Utilization Law is just Little’s Law solved for a different dependent variable. It is valid. The problem is you are assuming there is some special meaning to the term “service time.” There isn’t. This is the same mistake that iostat makes. When utilization is computed over the total time spent in the system including the queueing, then “service time” will be in the same units. The time spent on the disk, as opposed to in the queue, is NOT instrumented. There is no magical way to derive what isn’t measured.
I urge you to look carefully at the exact documentation of /proc/diskstats. Follow the first link in the blog post above.
Xaprb
13 Sep 10 at 8:47 am
No, it’s definitely measured. OK, so after looking at the documentation you mentioned, I’m certain that it is correct for a single disk. I’ve also double-checked the source code for sysstat, in particular, iostat.c and common.c
I can’t see how you’ve concluded that “utilization is computed over the total time spent in the system including the queueing”.
Utilization, as calculated by iostat, is ‘change in Field 10′ divided by the time interval. For a single disk this is valid, and simply counts ‘busy’ if the device was doing something, or not doing something. This is the standard Utilization = Busy time / total time interval.
Queuing doesn’t matter at this point, nor does the depth of the queue, nor does the amount of time spent in queues. It’s simply a count of how much time the device was doing something during the observation period.
From here it is easy to calculate service time (using standard Queuing Theory definition, excluding wait time), as Utilization divided by Throughput.
Also, my reading of why svctm is due to be removed is because of a bug in the kernel, where field #9 sometimes has negative numbers.
Nathan Webb
14 Sep 10 at 2:08 am
ALL of the timing you get from the kernel is from __make_request() to end_that_request_last(). The __make_request() call puts the request onto the queue struct for the device. THERE IS NO MEASUREMENT FROM WHEN IT IS TAKEN OFF THE QUEUE AND SERVICED BY THE DEVICE. “Utilization of the device” is really “utilization of the request_queue struct plus the device.” Change in field 10 includes the time in the queue: “This field is increases so long as field 9 is nonzero” and field 9 is “Incremented as requests are given to appropriate struct request_queue and decremented as they finish.” What you are calling “service time” is the time from when the request is given to the queue, and when the IO is completed.
Xaprb
14 Sep 10 at 8:31 am
Relax – no need to shout. Just chill out, take a step back, and try to rethink this. Seriously, I’m not new at this, and I use it regularly for my work. If it’s wrong, I need to understand why. Can you please have a look at this example (below), and let me know where the problem is. IMHO, there isn’t a problem (except for multiple disks OR where the kernel has the wrong value, including negatives, in field 9 due to a bug, as previously stated).
Here’s the example – hopefully this is formatted ok – it needs fixed-width:
Two read IOs arrive during a 10ms interval, the first at 2ms, the second at 3ms. The first read doesn’t have to wait, but the second read has to wait for 2 ms while the first is being serviced. The values from /proc/diskstats are printed at the bottom (reset from 0).
Remember: Field 10 is *only* incremented while Field 9 is non-zero, and it is only incremented by the amount of time since the last update (which in my example is every 1 millisecond).
Time (ms) | ReadA | Read B | Field 9 | Field 10
0 | | | 0 | 0
1 | | | 0 | 0
2 | svctm=1 | | 1 | 1
3 | svctm=2 | wait=1 | 2 | 2
4 | svctm=3 | wait=2 | 2 | 3
5 | | svctm=1 | 1 | 4
6 | | svctm=2 | 1 | 5
7 | | svctm=3 | 1 | 6
8 | | | 0 | 6
9 | | | 0 | 6
10 | | | 0 | 6
from /proc/diskstats
Field 1 = 2
Field 2 = 0
Field 3 = 2
Field 4 = 8 <— This includes the 2ms of waiting
Field 5 = 0
Field 6 = 0
Field 7 = 0
Field 8 = 0
Field 9 = 0 <— currently no IO, so it's 0
Field 10 = 6 <— Wow! missing the wait time!
Field 11 = 8
finally, according to common.c of sysstats:
– utilization = field 10 divided by time = 6ms / 10ms = 60%
– tput = (field 1 + field 5) in seconds = (2 + 0) * 100 = 200 iops
– svctm = utilization / tput = 0.6 / 200 = 0.003 second = 3ms
So in this example, svctm = 3ms, which is correct and doesn't include wait time.
Nathan Webb
14 Sep 10 at 10:17 pm
I am relaxed, sorry, didn’t mean to shout. It just seemed that you were not seeing the key thing I was trying to point out.
Your example is correct as shown because it’s a single-disk system and there can’t be any real concurrency on it. But what if the IOs are merged because they are adjacent? It’s a happy coincidence that this works out OK because when there is no concurrency and no IO merging, field 10 is simply the sum of disk service times. But in the general case (and I guess you’re working with single disks for some reason, but I can’t remember the last time I saw a single-disk database server, which is what I work with), it doesn’t work. When the measurement is showing aggregated stats due to several things happening at once, trying to get the disk service time separately from the queue wait would be like trying to get the disk head seek time separately from the time taken to read while the platter rotates. The time is spent in there somewhere, but who knows where :-)
In your example, too, you’d be better off computing time separately for reads and writes, as field 4 and 8 divided by 1 and 5, respectively. You won’t get rounding errors caused by the more complex calculations iostat does. Sometimes these actually matter a lot when the denominators are small in the iostat computations, because the precision of the kernel stats is relatively coarse-grained.
On another note, I have also seen the kernel bug you mentioned, which not only causes field 9 to be wrong, but field 10 as well. Field 10 isn’t reliable in my observations. I could dig out a sample and send it to you if you like.
Xaprb
15 Sep 10 at 7:59 am
You won’t accept that you made a mistake, will you? ;)
I’m pretty sure I mentioned that this is only for a single disk? Maybe 5 times?
FYI, Databases are often the worst-case for this calculation, due to the small IOs resulting in many concurrent IOs to disks. *But* other multi-disk scenarios, which involve large sequential reads and writes, e.g. file servers, DMS, media servers (i.e. VOD) actually don’t deviate too much from the single-disk scenario. The reason for this is because the IO tends to be balanced across disks, and when the IOs span the entire stripe set, the whole set’s utilisation is pretty much the same as the scenario for a single disk.
But, if you’re only interested in Databases, why the hell aren’t you using the SAN stats directly, instead of iostat? There’s your instrumentation….
BTW, IO Merging makes no difference. And let me repeat – time spent waiting in queues has no bearing on any of this. It’s not included in any of the relevant calculations. Refer to my example to see why it isn’t relevant to working out Service Time. I included it to make a point, which is that it isn’t involved in any calculations of utilisation or service time.
Your third paragraph isn’t right for several reasons. There’s nothing mathematically complex in either iostat.c, or common.c, or S=U/X.
No need to send me any samples about field 10 being wrong – I understand that, which is why I mentioned that the calculations are wrong if field 9 goes negative.
Nathan Webb
15 Sep 10 at 8:56 pm
You win: you did mention a single disk every time :-) I’m sorry that I got so focused on my point that I missed yours. I hope I didn’t upset you too much. At least I get to look silly on my own blog instead of someone else’s!
Many enterprises use SANs, but most of the customers I consult for do not. They generally use RAID10 arrays of 6 to 20 or so 15k RPM 2.5-inch server-class disks.
I’ve got lots of samples, which I know you don’t want, where theoretically equivalent equations don’t give a reasonable answer due to the coarse granularity of the measurements. When the device is responding in tenths of a millisecond or less, it makes a big difference which exact formula is used. With the advent of devices like FusionIO cards, we really need more accurate instrumentation. I should not have mentioned rounding errors — it isn’t rounding that’s the problem.
Xaprb
15 Sep 10 at 9:51 pm
Nathan, I’ve finally seen the rest of the point you were making (I think). These formulas are independent of queueing. You are right and I am wrong. In fact, now I remember that I knew that about Little’s Law once upon a time. Thanks for being patient with me while I insisted that I was right.
Xaprb
16 Sep 10 at 7:09 am
No problem, and no offence taken. It would have been better if I could have made more point more clearly in the first place.
Nathan Webb
19 Sep 10 at 7:35 pm
Guys, both of you, thanks very much. I’ve been trying to code Nagios plugins around these concepts and, as you can guess, most of my confusion has centered around how best to express svctm and await. Your discussion has helped tremendously.
Thanks again!
Mike R
25 Sep 10 at 6:01 pm
p.339 of High Performance MySQL gives formula:
concurrency = (r/s + w/s) * (svctm/1000)
Any thoughts on if you would still use this formula, or modify it in some way.
gabriel
29 Sep 10 at 7:49 pm
Hi Gabriel,
That formula isn’t correct as the parts on the right are:
(r/s + w/s) = iops = throughput (X)
(svctm/1000) = Service time (S)
Using the Utilization Law, we can see that X*S equals utilization (U), not concurrency (N).
The formula should be:
N = XR
concurrency = (r/s + w/s) * (await/1000)
or
U = XS
utilization = (r/s + w/s) * (svctm/1000)
Without knowing the context of the formula, I’ll guess that the authors are trying to say how many concurrent transactions the system will support, given the service time of the disks. If that’s the case, then it’s a bizarre question, as a system will support any number of concurrent transactions, depending on what response time is seen as acceptable. E.g. a batch system doesn’t care about concurrency and response time as long as the throughput is as high as possible, and service levels are still being met. If I’ve got the context wrong, can you clarify?
The thing with Little’s Law and concurrency is that as throughput increases, utilization also increases, which leads to queuing, and a degradation of response time. In turn, transactions take longer to get through the system, which in turn leads to more concurrent transactions, without bounds.
BTW, I use formulas like the ones mentioned (U=SX and N=XR) all the time to figure out what level of utilization leads to unacceptable response times, and how many concurrent transactions should I expect to be able to handle while response time and utilization remain acceptable. So to answer your question, No, I wouldn’t use that formula, but I do use similar.
Nathan Webb
29 Sep 10 at 10:01 pm
I’m now gun-shy and afraid to be sure I’m right about that formula. The intention there was not to measure the concurrency of requests TO the disk device, which would include queued requests, but to measure how many requests were concurrently resident IN the device. I hope I’m making that clear. In some filesystems, per-inode mutexes can prevent multiple requests from being sent to the device at once. When you have a RAID array with 10 disks, and only one request is getting through to the disk at once, that matters a lot, and you want to know when it is happening.
What’s confusing to me now is that in practice, for say a 10-disk array, I remember that the formula usually gave a pretty good answer. For filesystems where I know this problem exists, and a lot of multi-threaded IO was happening to a single huge file, I would see a result of pretty close to 1. For filesystems where that’s not an issue, the result was usually pretty close to 10. However, it’s been a while since I’ve actually done this calculation, so I think I need to revisit this.
Xaprb
2 Oct 10 at 1:59 pm
I should have said that I’m afraid to be sure that *my current thought process about that formula* is right or not. My current thought process is that Nathan is right, it gives the utilization, which can (correctly) go to 10 on a system with 10 disks underneath it.
Xaprb
2 Oct 10 at 2:03 pm
Hey guys (nathan and xaprb). Your post and comments are amazing. I am impressed that you guys have sifted through the source code and provided elaborate examples.
For me the epiphany was realizing that %util in a multi-disk environment does not represent actual device saturation.
I would like to however pen down and re-confirm why this is so from you veterans :). Can you confirm the below logic -
field 10 is incremented every millisecond if the value of Field 9 is non-zero. This would mean that as long as there is even “1″ IO request being serviced field 10 will be incremented. Now even if the underlying array of disks is capable of servicing multiple requests, the fact that field 9 is a non-zero value does not mean that the underlying system cannot take any more IO requests. Therefore though %util would show up as 100% it does not actually mean the underlying disk subsystem is 100% utilized. It may actually mean that the device is still capable of handling more requests.
If the above is correct I request you folks to assist me on the following questions -
* how would one then determine actual device saturation when one has no SAN stats to go by. Is there any way to determine the same?
* is there any way to calculate as to how many simultaneous IO requests can be serviced by an underlying system? is it merely directly proportional to the number of disks or is there any calculation one may perform to obtain this count?
* additionally, in the above examples you mention that multiple disk systems that largely have sequential IO requests would closely mimic a single disk system. I did not entirely understand this part. However here is my attempt. In case of largely sequential IO requests, in a multi disk system which is striped, all disks are busy servicing the single IO request and hence even if field 9 has a value of 1 it represents 100% device saturation. As opposed to this in random IO, since each IO request could be serviced by independent disks separately (especially if the data requested is smaller than the stripe size per disk) a value of “1″ in field 9 does not signify 100% utilization.
thanks in advance for any response.
- Bhavin
CEO, Directi
Bhavin Turakhia
2 Nov 10 at 12:27 am
I had one more follow-up question. Would the %util value be accurate incase of a single flash drive. Is a single flash drive capable of performing more than one IO request at a time?
Bhavin Turakhia
2 Nov 10 at 12:29 am
:) … and yea, one more confirmation – barring the “negative values in field 9″ svctim is inaccurate inasmuch as it will always report a higher value than the actual value. Infact the quantum of inaccuracy is dependant on how wrong the %util is ie if %util is being reported as twice of the actual amount then svctim is being reported as twice of the actual svctim.
However if one is using svctim to generally get an indication of the ratio of the amount of time a request spends in queue vs the time spent in actually servicing the request by using await/svctim, then a high ratio would as such signify that most requests are spending a long time in queue as opposed to actual time being serviced. would that be correct?
Bhavin Turakhia
2 Nov 10 at 12:36 am
Bhavin, in general it is very difficult to know exactly what happens on the underlying disks for, say, a RAID10 volume. The RAID controller might merge adjacent requests, reorder them, cache them, buffer them, predict and prefetch… and it has its own mapping of physical to logical block location, too.
The %util from iostat just shows what percent of the time the device as a whole was busy.
Flash vs spindle doesn’t really matter.
iostat is a useful tool (and so is sar). But in general I work directly with /proc/diskstats these days, and bypass iostat.
Xaprb
3 Nov 10 at 9:58 pm
If you are able to perform 10 IO operations in 50ms, would you consider it inaccurate to state that they completed with an average time of 5ms? What if those 10 IO operations are running concurrently and each operation actually takes a full 50ms to complete? Now do you say they had an average time of 5ms, or 50ms?
It looks like await and svctm are just different ways of calculating that average.
On a server that only handles one IO at a time, the difference in averages also happens to match up to the amount of time spent waiting.
Darcy Sherwood
16 Nov 10 at 8:14 pm
Darcy, the terms you’re using are ambiguous. Does “in 50ms” mean “during a 50ms interval” or “with a total time of 50ms”, for example? If 10 operations are running concurrently and each takes 50ms to complete, then field 11 in /proc/diskstats will show 500ms of weighted time.
Xaprb
17 Nov 10 at 3:09 pm
Let’s say that during a 50ms interval 10 IO operations run concurrently, each taking the full 50ms to complete. Field 10 will be 50, resulting in a svctm of 5, and field 11 will be 500, resulting in an await of 50.
Now let’s say that during a 50ms interval 10 IO operations run one at a time, taking 5ms each. Field 10 will be 50, resulting in a svctm of 5, and field 11 will be be 50, resulting in an await of 5.
During both of these scenarios 10 IO operations complete during a 50ms period. svctm shows this by having the same value both times. So while the value for svctm is not incorrect, it is a bit confusing, but it’s hard to argue with 10 IOs in 50ms = 5ms/IO.
Since await on the other hand calculates weighted averages, it shows you the average time that each IO operation took, rather than the average time for all of the IO operations. Perhaps that’s the best way to describe the differences; await is the average time for a single IO operation to complete, svctm is the average time over all of the IO operations.
So if you’re curious to see the average time for an IO operation to complete, you want to be looking at await.
Darcy Sherwood
17 Nov 10 at 7:32 pm
Hi Darcy,
I suggest that you should read through the comments to get an understanding of where we are up to with this.
Just to re-iterate, svctm is the average amount of time that the disk spends servicing IOs, excluding wait time. await is the svctm + the average time spend waiting in the queue.
The examples you have used are problematic, and give a false understanding of the differences and meanings of service time and average wait time.
Firstly, as I’ve stated (and restated), “Sure, ‘utilization’ is wrong for an array, but for a single disk this is ok, isn’t it?”. So discussions about a 10 disk array and 10 concurrent IOs aren’t useful as the results will be erroneous.
Your next example, of 10 IOs arriving one at a time, is rather unusual. Most operations arrive in a random manner, mathematically known as a poisson distribution. With a random, memory-less arrival rate, you will get some transactions arriving at roughly the same time, and some of those transactions will need to wait to be serviced. In your example, each transaction arrives exactly as the previous one completes, and therefore the wait time = 0. This can happen, but only in unusual, non-random circumstances. The result of wait time = 0 means that await = (svctm + 0) = svctm, as your example shows. Normally when there are multiple IOs, wait time will be non-zero, and await will not equal svctm.
This statement is wrong: “await is the average time for a single IO operation to complete, svctm is the average time over all of the IO operations”
It’s wrong for several reasons, one reason being that a single IO operation doesn’t have an average time. An average can only apply to multiples.
Now, to address some of the confusion, and the question that Bhavin asks, there are ways to estimate the concurrency. You can use the average size of each IO, and knowledge about the RAID configuration, to very roughly estimate the concurrency (over long periods of time). BTW, I made a typo earlier when I said that iostat overestimates the svctm for concurrent transactions. That should be underestimates, as Darcy’s example so clearly shows. I would also use the physical disk characteristics (calculate theoretical service time) to validate my estimates. This can be useful, but I generally just create a model that links the application response time with the number of disk IOs. If that doesn’t work, then I might do the above to fine-tune my model.
It’s definitely not precise, and as xaprb makes clear, the RAID controller will do some merging and re-ordering, etc…
Nathan Webb
18 Nov 10 at 9:07 pm
Fantastic discussion – thank you!
I am still interested in concrete examples or calculations about how to be able to say that a (database or other app) system has a “disk problem”, meaning that services times are too large and require striping about more disks to fix that.
Would the above formula
U = XS
utilization = (r/s + w/s) * (svctm/1000)
be the right way to say that the disk (or disk array) is too small / too slow to handle the burden?
Is this formula “better” than what “iostat” calculates for %util?
Thank you for any hint.
Rudolf
29 Nov 10 at 9:09 pm
I added the formulas for Concurrancy and Utilisation to the output of iostat, multiplied by 100 to get comparable numbers:
sda 0.33 2.44 1.37 0.97 93.15 27.27 46.57 13.64 51.44 0.03 13.38 4.49 1.05 C:3.13092 U:1.05066
sda 16.12 24.32 105.01 1.00 4693.49 202.60 2346.75 101.30 46.19 1.32 12.50 9.05 95.91 C:132.513 U:95.9391
sda 5.59 28.97 41.06 12.39 1594.41 330.87 797.20 165.43 36.02 4.65 86.99 17.69 94.54 C:464.962 U:94.553
sda 3.90 19.02 32.83 0.70 1198.80 157.76 599.40 78.88 40.45 1.18 35.19 27.71 92.91 C:117.992 U:92.9116
sda 4.40 10.69 31.57 0.90 1294.71 92.71 647.35 46.35 42.73 1.20 36.71 27.76 90.13 C:119.197 U:90.1367
sda 5.00 19.20 34.40 0.90 1316.80 160.80 658.40 80.40 41.86 1.16 32.92 26.02 91.86 C:116.208 U:91.8506
sda 3.20 19.52 31.53 5.21 1257.26 198.60 628.63 99.30 39.63 1.66 45.07 24.86 91.31 C:165.587 U:91.3356
sda 4.40 13.19 41.96 0.60 1313.89 110.29 656.94 55.14 33.46 1.14 26.84 22.37 95.20 C:114.231 U:95.2067
sda 18.10 6.00 43.00 0.70 2532.80 52.80 1266.40 26.40 59.17 1.65 37.86 22.43 98.02 C:165.448 U:98.0191
sda 27.10 22.50 56.30 9.30 3708.80 254.40 1854.40 127.20 60.41 3.81 58.17 14.78 96.94 C:381.595 U:96.9568
sda 3.60 13.00 40.90 1.40 1260.80 115.20 630.40 57.60 32.53 1.04 24.48 21.77 92.08 C:103.55 U:92.0871
sda 3.90 9.01 39.24 0.90 1145.15 79.28 572.57 39.64 30.50 1.23 30.75 23.75 95.34 C:123.43 U:95.3325
sda 1.50 13.59 41.16 0.80 684.92 115.08 342.46 57.54 19.07 0.99 23.62 22.27 93.45 C:99.1095 U:93.4449
sda 1.00 21.90 67.60 11.70 709.60 268.80 354.80 134.40 12.34 8.16 102.87 11.65 92.40 C:815.759 U:92.3845
sda 0.00 18.12 0.00 14.51 0.00 261.06 0.00 130.53 17.99 1.16 79.71 1.69 2.45 C:115.659 U:2.45219
Utilisation seems to be the same as %util
But i still don´t get the meaning of Concurrency.
Can anybody help?
Rudolf
30 Nov 10 at 7:29 am
Sometime, i have very high await and svctm, but very low %util:
Device: rrqm/s wrqm/s r/s w/s rsec/
s wsec/s avgrq-sz avgqu-sz await svctm %util Conc. Util.
Thu Nov 4 02:33:13 CET 2010 dm-11 0.00 0.00 0.00 0.02 0.0
0 0.03 2.00 0.01 476.00 476.00 0.79 C:0.952 U:0.952
Also calculating the above Concurrency and Utilisation (here shown as C: and U:) does not help me to understand what is happening on the server.
Rudolf
30 Nov 10 at 10:37 am
@ Rudolf
“”Sometime, i have very high await and svctm, but very low %util:”"
await=service queue + service time
for example
await=2+15 can create a high await time:
I think you have low IOPS and high service time will generate low util because you dont have much IOPS….I feel you have old hard disk thats why its taking much time in service time. I doubt on your service time.
Zahid Haseeb
16 Mar 12 at 4:06 pm