Debugging the machine overload
One of the most common issues on Linux machines that is directly related to diagnosing low throughput is working with kernel overload, also known as CPU load. Before discussing the issue, it is important to understand what CPU load is.
CPU Load in a Nutshell
CPU load indicates the average number of processes waiting in the queue for their turn to be computed by the kernel. CPU load is expressed in numerical units and typically ranges from 1-50. To see the output of the specific load of a server, you can enter the command “uptime”.
➜ ~ uptime 12:55 up 1 day, 3:51, 1 user, load averages: 1.76 1.81 1.61
The output shows three values of the averaged load:
- Average load for 1 minute
- Average load for 5 minutes
- Average load for 15 minutes
This allows us to see the fluctuation of the load over time and easily determine if there was a spike in CPU load that may have caused strain on the server.
The load value is variable and when load = 1, it means that one calculation is tied to that CPU and is being computed on it. For example, if you have an 8-core processor and the load is at 1, each of the 8 cores is being used by one process. It is worth noting that, hypothetically, if the load is 0, then the system is completely idle.
How High is High Enough?
There is no simple answer to this question as it depends on the situation. A load value of 50 may not necessarily slow down the system, while a load of 10 could render the system completely unusable. Generally, the lower the load value, the better. Ideally, the load should be around 1 for most of the time.
What really matters when debugging the system is not the load value, but rather why the value is what it is. There are generally three ways that can cause high CPU load:
- CPU bounded load – load due to kernel overload
- OOM load – load due to insufficient RAM
- I/O load – load due to high I/O (disk usage)
TOP Command and its Output
The “top” command is used to visualize Linux processes. When you enter the command, you will see a lot of information, some of which are informative (uptime, number of users). Here is the beginning of the top command header:
top - 13:13:18 up 11 days, 12:55, 2 users, load average: 0.00, 0.00, 0.00
Tasks: 126 total, 1 running, 125 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 1.6 sy, 0.0 ni, 98.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 1946.2 total, 92.6 free, 131.9 used, 1721.6 buff/cache
MiB Swap: 3814.0 total, 3455.1 free, 358.9 used. 1615.9 avail Mem
In addition to the header, you will also see a list of various kernel processes. The %CPU column shows the percentage of CPU usage for each process. Here is a breakdown of the abbreviations used in the %CPU column:
- us: user CPU time. This shows the percentage of kernel power used by a user of the system. This could be Apache, MySQL, or a shell script. If the user CPU value is high, it is a CPU bounded load problem.
- sy: system CPU time. This shows the time used for system calculations, usually something that the kernel calculates. If the system CPU value is high, it is a CPU bounded load problem.
- id: CPU idle time. This shows the time when the processor could be doing something but is not. The higher the CPU idle percentage, the better.
- wa: CPU I/O wait. This shows the time when the processor wants to compute something but cannot because it is waiting for a read or write operation. If this value is high, it is often a problem with disks.
Three Main Factors for Diagnosing CPU Load
CPU bounded load
This type of load has become quite negligible nowadays due to the exponential development and supply of highly oversized server processors to the market and end consumers. However, we may still encounter it, for example, on shared hosting, cheap VPSs, or poorly written programs. We can easily identify a CPU bounded problem by the high values of user CPU time and system CPU time, as well as by observing a specific process in the top
command that is consuming a high %CPU value.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
24475 root 20 0 246m 66m 3504 S 90 0.7 0:00.91 update_conf.pl
3911 mysql 20 0 1932m 374m 7664 S 1 3.7 8:27.17 mysqld
In this case, we can see that a shell script running as root named update_conf.pl
is taking up 90% of the computational resources. There are two possible scenarios for a CPU bounded load:
- One process is consuming one specific core at 99%
- Multiple processes are competing for one core
In both cases, it is relatively easy to identify them in the top
command and the solution is simple – add computational resources to the machine. Finally, note that if an application is written as multithreaded, it may create multiple threads on one CPU, thus also overwhelming the CPU, which can lead to increased load even though no specific processes are visible in top
that are consuming the core.
out of memory load
Another reason for high load is the saturation of RAM, which typically results in the system starting to swap heavily. Since swaps are typically located on slow disks (compared to the speed of RAM), every process that moves from RAM to swap will start to crawl. This typically triggers a spiral where other processes that follow the slowed down processes in RAM also start to crawl, and the system starts to slowly collapse. It is important to be careful when diagnosing OOM loads, as if the system is writing to swap and swap is on disks, the diagnosis and debugging can easily be confused with I/O load. Therefore, before diagnosing an OOM load, it is advisable to check the RAM status in the top
command.
MiB Mem : 8985.5 total, 4336.9 free, 2261.0 used, 2387.6 buff/cache
MiB Swap: 3814.0 total, 3802.7 free, 11.3 used. 5592.9 avail Mem
In the Mem
and Swap
sections, we see the following values:
- Total: total capacity
- Free: how much RAM is free and unused
- Used: used RAM
Above all, the used
value often leads to confusion because it does not tell us anything about how data outside of RAM is being handled. If we delve a little deeper into the theory of how Linux systems work with cache, we will see the reason why.
File Cache in Linux Systems
When a file is accessed, the Linux kernel temporarily stores it in its RAM and when it is not needed for a while, it deletes it again. If a sufficient amount of RAM is available, the kernel tries to store everything possible in RAM. Then, when I want to access the same file again, the access speed is multiplied and the entire system starts to perform with high speed for the user’s data work. As the system runs, other processes occasionally need RAM, so the kernel deletes some files that are not accessed much and frees up space in RAM for processes. In many ways, it is much better to let the kernel work with RAM and not bother it with creating RAM disks. Therefore, in the top
command, we will not see as much in the used
section as we will see with the free
command with the -hm
flag (human readable).
➜ ~ free -hm
total used free shared buff/cache available
Mem: 8.8Gi 2.2Gi 4.2Gi 756Mi 2.3Gi 5.5Gi
Swap: 3.7Gi 11Mi 3.7Gi
In the Mem
section, we see the following values:
- Total: total RAM capacity
- Used: how much RAM is used
- Free: how much RAM is free and unused
- Shared: how many resources from RAM outside are shared or accessed by different programs, accessing the same content that is cached on a particular medium
- Buff/cache: This is the combined value of two values
- Buffer: how much is in the buffer for kernel needs
- Cache: how many files are maintained in the page cache
- Available: how much RAM is available if needed
If we suspect that our RAM is running out, we will notice it not only because our swap is running out, but also in the top
command in the %MEM
column for individual processes:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
401 mongodb 20 0 1520184 114892 28420 S 0.7 21.9 11:06.30 mongod
409 elastic+ 20 0 4836616 1.3g 22652 S 0.7 22.4 7:10.59 java
491 graylog 20 0 4849688 911716 19772 S 0.7 24.9 10:44.12 java
1 root 20 0 169400 6092 4680 S 0.0 0.1 0:02.04 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_par_gp
The solution to the OOM load problem is easy, just increase the RAM.
I/O Load
I/O load is the final type of load that often occurs when Linux systems slow down, and we can recognize it mainly by the increased value of CPU iowait. However, we must be careful not to confuse this phenomenon with OOM load. The issue of I/O load is very complex and has been a key pillar of low throughput diagnostic issues for us. This problem could be the subject of a book, but we will limit ourselves to the essentials. Our basic tools for working with diagnostic issues are:
- iostat
- iotop
iostat
Iostat is a relatively complicated and complex tool, so for the purposes of this article, we will limit ourselves to the essentials. Iostat shows us basic input and output statistics. Iostat is not part of most distributions and needs to be installed via the sysstat package. The output of the extended iostat looks like this:
➜ ~ iostat -xh
Linux 4.19.0-16-amd64 (ispconfig.cnnc.cz) 04/03/2021 _x86_64_ (4 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
2.2% 0.0% 1.4% 0.2% 0.0% 96.2%
r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util Device
0.72 1.93 17.9k 40.4k 0.00 2.51 0.4% 56.6% 0.55 2.11 0.00 24.9k 21.0k 0.41 0.1% sda
0.79 0.85 9.1k 19.0k 0.00 1.88 0.2% 68.9% 13.46 11.64 0.02 11.5k 22.4k 3.40 0.6% sde
0.06 7.07 1.3k 138.9k 0.00 1.64 0.4% 18.9% 0.56 0.64 0.00 22.5k 19.7k 0.27 0.2% sdc
0.00 0.00 0.1k 0.0k 0.00 0.00 0.0% 10.5% 0.42 0.74 0.00 27.8k 3.2k 0.32 0.0% sdb
0.00 0.00 0.1k 0.0k 0.00 0.00 0.0% 36.2% 0.39 0.78 0.00 25.0k 5.9k 0.32 0.0% sdd
In the header, we see the sum of avg-cpu similar to top. The content is much more interesting, from which we collect only what is essential to us. Before the actual output, the prefix r is for reading and the prefix w is for writing:
rrqm/s
: the number of merged read requests per second that were in the queue on the devicewrqm/s
: the number of merged write requests per second that were in the queue on the deviceareq-sz
: the average request size for writing before it is processedawait
: the average time in ms for requests (read or write) before they are processedaqu-sz
(formerlyavgqu-sz
): the average time a request spent in the queue before being processedsvctm
: Note – this point no longer makes sense, it will be removed in the future.%util
: how much the device is utilized in percentagesDevice
: which device is being used
For I/O CPU debugging purposes, we are only interested in three values:
Device
Util
%iowait
These will tell us whether it is an I/O wait and then which disk is busy and how busy it is. As we can see, iostat
is not a sufficient tool on its own. We need something more.
iotop
Iotop is an analog to the top
command for I/O operations and it will help us find the process that is overloading our machine.
In the output, we see a relatively clear description of the total reads, total writes, current reads, and current writes.
Total DISK READ: 0.00 B/s | Total DISK WRITE: 31.20 K/s
Current DISK READ: 0.00 B/s | Current DISK WRITE: 54.60 K/s
Below that, we see individual processes and their read/write values.
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
561 be/4 mongodb 0.00 B/s 7.80 K/s 0.00 % 0.08 % mongod --config /etc/mongod.conf [WTJourn.Flusher]
215 be/3 root 0.00 B/s 0.00 B/s 0.00 % 0.06 % [jbd2/sda2-8]
758 be/4 graylog 0.00 B/s 3.90 K/s 0.00 % 0.03 % java -Xms1g -Xmx1g -XX:NewRatio=1 -se~/server/server.conf -np [scheduled-25]
568 be/4 mongodb 0.00 B/s 15.60 K/s 0.00 % 0.00 % mongod --config /etc/mongod.conf [ftdc]
582 be/4 elastics 0.00 B/s 3.90 K/s 0.00 % 0.00 % java -Xms1g -Xmx1g -XX:+UseConcMarkSw~/elasticsearch.pid --quiet [VM Thread]
The solution is not entirely trivial and is up for discussion. It could be a disk or controller failure, low throughput on switches, an overloaded array that cannot keep up with some application, or many other phenomena.