Swap-In Bug

Update: The bug was fixed as of Squid v1.1.17.


The profiling experiment for Squid caching proxy of the Utrecht University in Netherlands revealed a strange pattern in swap-in requests. It turned out to be a performance bug in Squid. This page discusses the phenomenon and suggests a cure for the bug. Note that there are other interesting observations based on this portion of the experiment, but this page is devoted exclusively to one particular problem described below.

The proxy was running Squid v1.1.15. At the time of writing, the Change Log for v1.1.16 did not mention any relevant modifications.

Introduction

Disk Resp. Time Vs. Size

To illustrate the problem, we plot the median disk response time of a request versus file size. Response times for files smaller than 16 KB were grouped using 1 KB granularity. Larger files used 2 KB granularity to get enough entries per group. The graph is based on 24 hour data.

Squid attempts to swap files using blocks of fixed length (e.g. 8 KB). For each I/O direction, we plot the total request response time and the time it takes to swap the first block. The "total" and "1st delay" curves for files smaller than 8 KB are the same (explained below).

Since various per I/O delays dominate disk transfer time, the size of an I/O is not very important (the number of I/Os is). This explains step-like shape of the "total" curves: If I/O block size is 8 KB, the times to read 5 KB and 8 KB are the same! Note that the first "step" on each "total" curve corresponds to the size of the Squid I/O block for this server (8 KB). "Steps" for requests larger than 3 blocks (24 KB) are not distinct: There are not enough files of that size to get a "stable" median; also disk and network delays in-between disk I/Os may spoil the picture.

The problem

The first disk delay always includes OS overhead on opening a file. Consecutive I/Os for the same file (if any) do not have this overhead. Clearly, the first delay should be the same for any file size. This holds for swap-out requests: The first delay for large files is the same as for small ones. However, this is not the case for swap-in requests(!): The first delay drops sharply for files larger than one I/O block (8 KB).

Explanation

We have spent quite some time looking for a bug in the profiling code... :)

Here is an explanation. When reading a block, Squid does not specify the actual file length. In other words, if the file size is 5 KB, then SwapIn handler will request 8 KB from diskHandleRead() via file_read() function. The diskHandleRead() function calls read(2) to get an 8 KB block but gets only 5 KB. Note that the EOF is not detected at this point since read(2) returns a positive number! Thus, diskHandleRead() assumes that there are 3 KB left in the file and schedules the second read to fetch them. The second call to read(2) returns 0 indicating an EOF condition. However, we did two read(2)'s instead of one, and a read(2) is a very expensive system call.

The SwapOut handler does not have this problem. It requests the exact amount of data from file_write().

Consequences

We get one extra read(2) call for files of all sizes because their last I/O block is almost never exactly 8 KB. The largest performance loss (more than 80% according to the graph) is for files smaller than 8 KB, of course.

The Cure

The following patch was supplied by Duane Wessels. It simply prevents diskHandleWrite() from immediately requesting more data if we read less than was requested. The code to be removed is in bold.

Index: src/disk.c
===================================================================
RCS file: /surf1/CVS/squid/src/disk.c,v
retrieving revision 1.56.2.7
diff -w -u -r1.56.2.7 disk.c
--- disk.c      1997/07/21 22:59:20     1.56.2.7
+++ disk.c      1997/10/01 16:31:55
@@ -401,6 +401,7 @@
        ctrl_dat->cur_len += len;
     ctrl_dat->offset = lseek(fd, 0L, SEEK_CUR);
 
+#if BAD_IDEA
     /* reschedule if need more data. */
     if (ctrl_dat->cur_len < ctrl_dat->req_len) {
        commSetSelect(fd,
@@ -410,6 +411,7 @@
            0);
        return DISK_OK;
     } else {
+#endif
        /* all data we need is here. */
        /* call handler */
        ctrl_dat->handler(fd,
@@ -419,7 +421,9 @@
            ctrl_dat->client_data);
        safe_free(ctrl_dat);
        return DISK_OK;
+#if BAD_IDEA
     }
+#endif
 }



http://www.cs.ndsu.nodak.edu/~rousskov/research/cache/squid/profiling/stats/ruu/swapin.bug.html
This page is maintained by Alex Rousskov (rousskov@plains.NoDak.edu).