Debugging the machine overload

One of the most common issues on Linux machines that is directly related to diagnosing low throughput is working with kernel overload, also known as CPU load. Before discussing the issue, it is important to understand what CPU load is.

CPU Load in a Nutshell

CPU load indicates the average number of processes waiting in the queue for their turn to be computed by the kernel. CPU load is expressed in numerical units and typically ranges from 1-50. To see the output of the specific load of a server, you can enter the command “uptime”.




➜ ~ uptime 12:55 up 1 day, 3:51, 1 user, load averages: 1.76 1.81 1.61

The output shows three values of the averaged load:

Average load for 1 minute
Average load for 5 minutes
Average load for 15 minutes

This allows us to see the fluctuation of the load over time and easily determine if there was a spike in CPU load that may have caused strain on the server.

The load value is variable and when load = 1, it means that one calculation is tied to that CPU and is being computed on it. For example, if you have an 8-core processor and the load is at 1, each of the 8 cores is being used by one process. It is worth noting that, hypothetically, if the load is 0, then the system is completely idle.

How High is High Enough?

There is no simple answer to this question as it depends on the situation. A load value of 50 may not necessarily slow down the system, while a load of 10 could render the system completely unusable. Generally, the lower the load value, the better. Ideally, the load should be around 1 for most of the time.

What really matters when debugging the system is not the load value, but rather why the value is what it is. There are generally three ways that can cause high CPU load:

CPU bounded load – load due to kernel overload
OOM load – load due to insufficient RAM
I/O load – load due to high I/O (disk usage)

TOP Command and its Output

The “top” command is used to visualize Linux processes. When you enter the command, you will see a lot of information, some of which are informative (uptime, number of users). Here is the beginning of the top command header:




top - 13:13:18 up 11 days, 12:55,  2 users,  load average: 0.00, 0.00, 0.00
Tasks: 126 total,   1 running, 125 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  1.6 sy,  0.0 ni, 98.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   1946.2 total,     92.6 free,    131.9 used,   1721.6 buff/cache
MiB Swap:   3814.0 total,   3455.1 free,    358.9 used.   1615.9 avail Mem

In addition to the header, you will also see a list of various kernel processes. The %CPU column shows the percentage of CPU usage for each process. Here is a breakdown of the abbreviations used in the %CPU column:

us: user CPU time. This shows the percentage of kernel power used by a user of the system. This could be Apache, MySQL, or a shell script. If the user CPU value is high, it is a CPU bounded load problem.
sy: system CPU time. This shows the time used for system calculations, usually something that the kernel calculates. If the system CPU value is high, it is a CPU bounded load problem.
id: CPU idle time. This shows the time when the processor could be doing something but is not. The higher the CPU idle percentage, the better.
wa: CPU I/O wait. This shows the time when the processor wants to compute something but cannot because it is waiting for a read or write operation. If this value is high, it is often a problem with disks.

Three Main Factors for Diagnosing CPU Load

CPU bounded load

This type of load has become quite negligible nowadays due to the exponential development and supply of highly oversized server processors to the market and end consumers. However, we may still encounter it, for example, on shared hosting, cheap VPSs, or poorly written programs. We can easily identify a CPU bounded problem by the high values of user CPU time and system CPU time, as well as by observing a specific process in the top command that is consuming a high %CPU value.




  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
24475 root      20   0  246m  66m 3504 S   90  0.7   0:00.91 update_conf.pl
 3911 mysql     20   0 1932m 374m 7664 S    1  3.7   8:27.17 mysqld

In this case, we can see that a shell script running as root named update_conf.pl is taking up 90% of the computational resources. There are two possible scenarios for a CPU bounded load:

One process is consuming one specific core at 99%
Multiple processes are competing for one core

In both cases, it is relatively easy to identify them in the top command and the solution is simple – add computational resources to the machine. Finally, note that if an application is written as multithreaded, it may create multiple threads on one CPU, thus also overwhelming the CPU, which can lead to increased load even though no specific processes are visible in top that are consuming the core.

out of memory load

Another reason for high load is the saturation of RAM, which typically results in the system starting to swap heavily. Since swaps are typically located on slow disks (compared to the speed of RAM), every process that moves from RAM to swap will start to crawl. This typically triggers a spiral where other processes that follow the slowed down processes in RAM also start to crawl, and the system starts to slowly collapse. It is important to be careful when diagnosing OOM loads, as if the system is writing to swap and swap is on disks, the diagnosis and debugging can easily be confused with I/O load. Therefore, before diagnosing an OOM load, it is advisable to check the RAM status in the top command.




MiB Mem :   8985.5 total,   4336.9 free,   2261.0 used,   2387.6 buff/cache
MiB Swap:   3814.0 total,   3802.7 free,     11.3 used.   5592.9 avail Mem

In the Mem and Swap sections, we see the following values:

Total: total capacity
Free: how much RAM is free and unused
Used: used RAM

Above all, the used value often leads to confusion because it does not tell us anything about how data outside of RAM is being handled. If we delve a little deeper into the theory of how Linux systems work with cache, we will see the reason why.

File Cache in Linux Systems

When a file is accessed, the Linux kernel temporarily stores it in its RAM and when it is not needed for a while, it deletes it again. If a sufficient amount of RAM is available, the kernel tries to store everything possible in RAM. Then, when I want to access the same file again, the access speed is multiplied and the entire system starts to perform with high speed for the user’s data work. As the system runs, other processes occasionally need RAM, so the kernel deletes some files that are not accessed much and frees up space in RAM for processes. In many ways, it is much better to let the kernel work with RAM and not bother it with creating RAM disks. Therefore, in the top command, we will not see as much in the used section as we will see with the free command with the -hm flag (human readable).




➜ ~ free -hm
              total        used        free      shared  buff/cache   available
Mem:          8.8Gi       2.2Gi       4.2Gi       756Mi       2.3Gi       5.5Gi
Swap:         3.7Gi        11Mi       3.7Gi

In the Mem section, we see the following values:

Total: total RAM capacity
Used: how much RAM is used
Free: how much RAM is free and unused
Shared: how many resources from RAM outside are shared or accessed by different programs, accessing the same content that is cached on a particular medium
Buff/cache: This is the combined value of two values
Buffer: how much is in the buffer for kernel needs
Cache: how many files are maintained in the page cache
Available: how much RAM is available if needed

If we suspect that our RAM is running out, we will notice it not only because our swap is running out, but also in the top command in the %MEM column for individual processes:




PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  401 mongodb   20   0 1520184 114892  28420 S   0.7  21.9  11:06.30 mongod
  409 elastic+  20   0 4836616   1.3g  22652 S   0.7  22.4   7:10.59 java
  491 graylog   20   0 4849688 911716  19772 S   0.7  24.9  10:44.12 java
    1 root      20   0  169400   6092   4680 S   0.0   0.1   0:02.04 systemd
    2 root      20   0       0      0      0 S   0.0   0.0   0:00.00 kthreadd
    3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp
    4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_par_gp

The solution to the OOM load problem is easy, just increase the RAM.

I/O Load

I/O load is the final type of load that often occurs when Linux systems slow down, and we can recognize it mainly by the increased value of CPU iowait. However, we must be careful not to confuse this phenomenon with OOM load. The issue of I/O load is very complex and has been a key pillar of low throughput diagnostic issues for us. This problem could be the subject of a book, but we will limit ourselves to the essentials. Our basic tools for working with diagnostic issues are:

iostat
iotop

iostat

Iostat is a relatively complicated and complex tool, so for the purposes of this article, we will limit ourselves to the essentials. Iostat shows us basic input and output statistics. Iostat is not part of most distributions and needs to be installed via the sysstat package. The output of the extended iostat looks like this:




➜ ~ iostat -xh
Linux 4.19.0-16-amd64 (ispconfig.cnnc.cz)   04/03/2021  _x86_64_    (4 CPU)
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.2%    0.0%    1.4%    0.2%    0.0%   96.2%
     r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util Device
    0.72    1.93     17.9k     40.4k     0.00     2.51   0.4%  56.6%    0.55    2.11   0.00    24.9k    21.0k   0.41   0.1% sda
    0.79    0.85      9.1k     19.0k     0.00     1.88   0.2%  68.9%   13.46   11.64   0.02    11.5k    22.4k   3.40   0.6% sde
    0.06    7.07      1.3k    138.9k     0.00     1.64   0.4%  18.9%    0.56    0.64   0.00    22.5k    19.7k   0.27   0.2% sdc
    0.00    0.00      0.1k      0.0k     0.00     0.00   0.0%  10.5%    0.42    0.74   0.00    27.8k     3.2k   0.32   0.0% sdb
    0.00    0.00      0.1k      0.0k     0.00     0.00   0.0%  36.2%    0.39    0.78   0.00    25.0k     5.9k   0.32   0.0% sdd

In the header, we see the sum of avg-cpu similar to top. The content is much more interesting, from which we collect only what is essential to us. Before the actual output, the prefix r is for reading and the prefix w is for writing:

rrqm/s: the number of merged read requests per second that were in the queue on the device
wrqm/s: the number of merged write requests per second that were in the queue on the device
areq-sz: the average request size for writing before it is processed
await: the average time in ms for requests (read or write) before they are processed
aqu-sz (formerly avgqu-sz): the average time a request spent in the queue before being processed
svctm: Note – this point no longer makes sense, it will be removed in the future.
%util: how much the device is utilized in percentages
Device: which device is being used

For I/O CPU debugging purposes, we are only interested in three values:

Device
Util
%iowait

These will tell us whether it is an I/O wait and then which disk is busy and how busy it is. As we can see, iostat is not a sufficient tool on its own. We need something more.

iotop

Iotop is an analog to the top command for I/O operations and it will help us find the process that is overloading our machine.

In the output, we see a relatively clear description of the total reads, total writes, current reads, and current writes.




Total DISK READ:         0.00 B/s | Total DISK WRITE:        31.20 K/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:      54.60 K/s

Below that, we see individual processes and their read/write values.

 


TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
  561 be/4 mongodb     0.00 B/s    7.80 K/s  0.00 %  0.08 % mongod --config /etc/mongod.conf [WTJourn.Flusher]
  215 be/3 root        0.00 B/s    0.00 B/s  0.00 %  0.06 % [jbd2/sda2-8]
  758 be/4 graylog     0.00 B/s    3.90 K/s  0.00 %  0.03 % java -Xms1g -Xmx1g -XX:NewRatio=1 -se~/server/server.conf -np [scheduled-25]
  568 be/4 mongodb     0.00 B/s   15.60 K/s  0.00 %  0.00 % mongod --config /etc/mongod.conf [ftdc]
  582 be/4 elastics    0.00 B/s    3.90 K/s  0.00 %  0.00 % java -Xms1g -Xmx1g -XX:+UseConcMarkSw~/elasticsearch.pid --quiet [VM Thread]

The solution is not entirely trivial and is up for discussion. It could be a disk or controller failure, low throughput on switches, an overloaded array that cannot keep up with some application, or many other phenomena.