Linux performance optimization - memory problem troubleshooting
content
Related commands
Some fields used by both cachestat and cachetop are explained by man as follows
- TIME Timestamp.
-
- HITS Number of page cache hits.
-
- MISSES Number of page cache misses.
-
- DIRTIES
- Number of dirty pages added to the page cache.
-
- READ_HIT%
- Read hit percent of page cache usage.
-
- WRITE_HIT%
- Write hit percent of page cache usage.
-
- BUFFERS_MB
- Buffers size taken from /proc/meminfo.
-
- CACHED_MB
- Cached amount of data in current page cache taken from /proc/meminfo.
Ubuntu binary install both tools
- sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 4052245BD4284CDD
- echo "deb https://repo.iovisor.org/apt/ $(lsb_release -cs) $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/iovisor.list
- sudo apt-get update
- sudo apt-get install bcc-tools libbcc-examples linux-headers-$( uname -r)
Install bcc on Centos
- # install ELRepo
- rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org _ _
- rpm -Uvh https://www.elrepo.org/elrepo-release-7.0-3.el7.elrepo.noarch.rpm _ _ _ _ _ _
-
- # install new kernel
- yum remove -y kernel-headers kernel-tools kernel-tools-libs
- yum --enablerepo= "elrepo-kernel" install -y kernel-ml kernel-ml-devel kernel-ml-headers kernel-ml-tools kernel-ml-tools-libs kernel-ml-tools-libs-devel
-
- #Restart after updating Grub
- grub2-mkconfig -o /boot/grub2/grub.cfg
- grub2-set-default 0
- reboot
-
- #Confirm that the kernel has been upgraded to 4.20.0.-1.el7.elrepo.x86_64 after restarting
- uname -r
-
- # install bbc-tools
- yum install -y bcc-tools
-
- #Configure PATH path
- export PATH= $PATH :/usr/share/bcc/tools
-
- #Verify that the installation was successful
- cachestat
Install pcstat based on binary
- if [ $( uname -m) == "x86_64" ] ; then
- curl -L -o pcstat https://github.com/tobert/pcstat/raw/2014-05-02-01/pcstat.x86_64
- else
- curl -L -o pcstat https://github.com/tobert/pcstat/raw/2014-05-02-01/pcstat.x86_32
- fi
- chmod 755 pcstat
Results of executing pcstat
- pcstat /bin/cat hehe.log
- |----------+----------------+------------+-------- ---+---------|
- | Name | Size | Pages | Cached | Percent |
- |----------+----------------+------------+-------- ---+---------|
- | /bin/cat | 35064 | 9 | 0 | 000.000 |
- |hehe.log|25|1|0|000.000|
- |----------+----------------+------------+-------- ---+---------|
-
- cat hehe.log
- aaaaaaa
- bbbbbbbbbb
- ccccc
-
- #Execute the second time, the data will be cached
- pcstat /bin/cat hehe.log
- |----------+----------------+------------+-------- ---+---------|
- | Name | Size | Pages | Cached | Percent |
- |----------+----------------+------------+-------- ---+---------|
- | /bin/cat | 35064 | 9 | 9 | 100.000 |
- | hehe.log | 25 | 1 | 1 | 100.000 |
- |----------+----------------+------------+-------- ---+---------|
The size of /bin/cat is 35064 bytes, and the size of a page is 4K, so 35064/(4*1024.0) = 8.5, which means 9 pages are occupied
Test for cache hits
Write a file with dd and read the file repeatedly
- dd if =/dev/sda1 of=file bs=1M count=512
- echo 3 > /proc/sys/vm/drop_caches
-
- # This time the cache is empty
- pcstat file
- |----------+----------------+------------+-------- ---+---------|
- | Name | Size | Pages | Cached | Percent |
- |----------+----------------+------------+-------- ---+---------|
- |file|536870912|131072|0|000.000|
- |----------+----------------+------------+-------- ---+---------|
-
test read data
- dd if =file of=/dev/null bs=1M
- 512+0 records in
- 512+0 records out
- 536870912 bytes (537 MB, 512 MiB) copied, 5.04981 s, 106 MB/s
-
- cachetop
- PID UID CMD HITS MISSES DIRTIES READ_HIT% WRITE_HIT%
- 3928 root python 5 0 0 100.0% 0.0%
- 3972 root python 5 0 0 100.0% 0.0%
- 4066 root dd 86868 85505 0 50.4% 49.6%
-
-
- # second read
- dd if =file of=/dev/null bs=1M
- 512+0 records in
- 512+0 records out
- 536870912 bytes (537 MB, 512 MiB) copied, 0.182855 s, 2.9 GB/s
-
- cachetop
- PID UID CMD HITS MISSES DIRTIES READ_HIT% WRITE_HIT%
- 4079 root bash 197 0 0 100.0% 0.0%
- 4079 root dd 131605 0 0 100.0% 0.0%
It can be seen that the performance has been greatly improved during the second read, and then look at the pcstat situation
- pcstat file
- |----------+----------------+------------+-------- ---+---------|
- | Name | Size | Pages | Cached | Percent |
- |----------+----------------+------------+-------- ---+---------|
- | file | 536870912 | 131072 | 131072 | 100.000 |
- |----------+----------------+------------+-------- ---+---------|
Test direct I/O
Read a file with dd and add the direct flag
- dd if =file of=/dev/null bs=1M iflag=direct
- 512+0 records in
- 512+0 records out
- 536870912 bytes (537 MB, 512 MiB) copied, 4.91659 s, 109 MB/s
Observe the operation through the monitor command
- cachetop 3
- 14:14:13 Buffers MB: 9 / Cached MB: 614 / Sort: HITS / Order: ascending
- PID UID CMD HITS MISSES DIRTIES READ_HIT% WRITE_HIT%
- 4161 root python 1 0 0 100.0% 0.0%
- 4162 root dd 518 0 0 100.0% 0.0%
The result of dd monitoring here is that HITS is 518 per second, and cachetop monitors
518*4/1024.0/3.0 every 3 seconds, that is, reading 0.67M of data per second,
see the result through strace dd, and then read the file file. When the O_DIRECT flag is indeed used
- openat (AT_FDCWD, "file" , O_RDONLY|O_DIRECT) = 3
- dup2 ( 3 , 0 ) = 0
- close ( 3 ) = 0
- lseek ( 0 , 0 , SEEK_CUR) = 0
- openat (AT_FDCWD, "/dev/null" , O_WRONLY|O_CREAT|O_TRUNC, 0666 ) = 3
Looking at dstat, the iowait is also very high during the time dd reads.
Remove the direct I/O option of dd and execute it again
- echo 3 > /proc/sys/vm/drop_caches
- dd if =file of=/dev/null bs=1M
- 512+0 records in
- 512+0 records out
- 536870912 bytes (537 MB, 512 MiB) copied, 4.91158 s, 109 MB/s
-
-
- cachetop
- PID UID CMD HITS MISSES DIRTIES READ_HIT% WRITE_HIT%
- 4397 root python 2 0 0 100.0% 0.0%
- 4398 root dd 34198 33027 0 50.9% 49.1%
The result of dd monitoring here is that the HITS per second is 34198, and the cachetop monitors
34198*4/1024.0/3.0 every 3 seconds, that is, reads 44M of data per second, this time it is normal
A note on the O_DIRECT flag
- O_DIRECT (since Linux 2.4.10)
- Try to minimize cache effects of the I/O to and from this
- file. In general this will degrade performance, but it is
- useful in special situations, such as when applications do
- their own caching. File I/O is done directly to/from user-
- space buffers. The O_DIRECT flag on its own makes an effort
- to transfer data synchronously, but does not give the
- guarantees of the O_SYNC flag that data and necessary metadata
- are transferred. To guarantee synchronous I/O, O_SYNC must be
- used in addition to O_DIRECT. See NOTES below for further
- discussion.
Direct I/O generally means that the upper-layer application has its own cache system, so there is no need for operating system-level cache.
Direct reading and writing of disks is generally used for storage systems, such as databases and file systems. When reading and writing, you can bypass the file system layer of the operating system.
memory leak check
When the system allocates memory space to a process, the user space memory includes multiple different memory segments, such as read-only segment, data segment, heap, stack, file mapping, etc. These memory segments are the basic methods of memory usage by applications,
such as those defined in the program. Local variables, such as int a, char data[64]
stack memory are automatically allocated and managed by the system. Once the program runs beyond the scope of this local variable, the stack memory will be automatically reclaimed by the system, so there will be no memory leak problem.
The heap memory is allocated and managed by the application itself. Unless the program exits, the heap memory will not be automatically released by the system. The program needs to explicitly call the library function free() to release them. If the program does not release the heap memory correctly, it will cause memory leak
Various segments for leaks
1. Read-only segment, including program code and constants, because it is read-only, no new memory will be allocated, and no memory leak will occur
2. Data segment, including panorama variables and static variables , the size of these variables has been determined when they are defined, and no memory leaks will occur
. 3. Memory mapping segment, including dynamic linking and shared memory, where shared memory is dynamically allocated and managed by the program. If you forget to reclaim it, it will be similar
to heap memory. Although the process can be killed by the OOM
mechanism, a series of reactions may be triggered before OOM, resulting in serious performance problems. For
example, other processes in the memory of the Western Regions may not be able to allocate new memory, and the memory will start again if the memory is insufficient. The system's cache recycling and SWAP mechanism further lead to I/O performance problems
A problematic program
- #include <stdio.h >
- # include <stdlib.h>
- # include <pthread.h>
- # include <unistd.h>
-
- long long * fibonacci ( long long *n0, long long *n1) {
-
- long long *v = ( long long *) calloc ( 1024 , sizeof ( long long ));
- *v = *n0 + *n1;
- return v;
- }
-
- void * child ( void *arg) {
- long long n0 = 0 ;
- long long n1 = 1 ;
- long long *v = NULL ;
- int n = 2 ;
- for (n = 2 ; n > 0 ; n++) {
- v = fibonacci (&n0, &n1);
- n0 = n1;
- n1 = *v;
- printf ( "%dth => %lld\n" , n, *v);
- sleep ( 1 );
- /* did not call free */
- //free(v);
- }
- }
-
-
- int main ( void ) {
- pthread_t tid;
- pthread_create (&tid, NULL , child, NULL );
- pthread_join (tid, NULL );
- printf ( "main thread exit\n" );
- return 0 ;
- }
-
- //Results of the
- 2th => 1
- 3th => 2
- 4th => 3
- 5th => 5
- 6th => 8
- 7th => 13
- 8th => 21
- 9th => 34
- 10th => 55
- 11th => 89
- 12th => 144
- 13th => 233
- 14th => 377
- 15th => 610
- 16th => 987
- 17th => 1597
- 18th => 2584
- 19th => 4181
- 20th => 6765
- 21st => 10946
- 22th => 17711
- 23rd => 28657
- 24th => 46368
- 25th => 75025
- 26th => 121393
- 27th => 196418
- 28th => 317811
- 29th => 514229
- 30th => 832040
- 31st => 1346269
- 32th => 2178309
- 33rd => 3524578
- 34th => 5702887
- 35th => 9227465
- 36th => 14930352
Execute this code (add -lpthread when compiling), use vmstat, and memleak to observe the following
- vmstat 1
- procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu -----
- rb swpd free buff cache si so bi bo in cs us sy id wa st
- 0 0 0 3049700 96684 806428 0 0 25 5 53 99 0 0 100 0 0
- 0 0 0 3049692 96684 806464 0 0 0 0 151 238 0 0 100 0 0
- 0 0 0 3049692 96692 806456 0 0 0 36 148 232 0 0 100 0 0
- 0 0 0 3049436 96692 806464 0 0 0 0 156 243 0 0 100 0 0
- 0 0 0 3049436 96692 806464 0 0 0 0 177 262 1 0 100 0 0
- 0 0 0 3049468 96692 806464 0 0 0 0 126 222 0 0 100 0 0
- . . . . . .
- 0 0 0 3049376 96700 806456 0 0 0 16 146 243 0 0 100 1 0
- . . . . . .
- 1 0 0 3049392 96700 806480 0 0 0 0 160 246 0 0 100 0 0
- . . . . .
- 0 0 0 3049392 96700 806480 0 0 0 0 163 257 0 0 100 0 0
- 0 0 0 3049040 96700 806480 0 0 0 0 175 287 0 1 100 0 0
- 0 0 0 3049144 96700 806480 0 0 0 0 138 234 1 0 100 0 0
- . . . . . .
- 0 0 0 3049176 96700 806480 0 0 0 0 169 267 1 0 100 0 0
-
-
-
-
- memleak -p 7438 -a
- Attaching to pid 7438, Ctrl+C to quit.
- [13:24:11] Top 10 stacks with outstanding allocations:
- addr = 7f1ec401d010 size = 8192
- addr = 7f1ec4021030 size = 8192
- addr = 7f1ec401b000 size = 8192
- addr = 7f1ec401f020 size = 8192
- 32768 bytes in 4 allocations from stack
- fibonacci+0x1f [hehe]
- child+0x56 [hehe]
- start_thread+0xdb [libpthread-2.27.so]
- [13:24:16] Top 10 stacks with outstanding allocations:
- addr = 7f1ec401d010 size = 8192
- addr = 7f1ec402b080 size = 8192
- addr = 7f1ec4027060 size = 8192
- addr = 7f1ec4029070 size = 8192
- addr = 7f1ec4021030 size = 8192
- addr = 7f1ec401b000 size = 8192
- addr = 7f1ec4023040 size = 8192
- addr = 7f1ec4025050 size = 8192
- addr = 7f1ec401f020 size = 8192
- 73728 bytes in 9 allocations from stack
- fibonacci+0x1f [hehe]
- child+0x56 [hehe]
- start_thread+0xdb [libpthread-2.27.so]
- [13:24:21] Top 10 stacks with outstanding allocations:
- addr = 7f1ec401d010 size = 8192
- addr = 7f1ec402b080 size = 8192
- addr = 7f1ec4027060 size = 8192
- addr = 7f1ec4029070 size = 8192
- addr = 7f1ec402d090 size = 8192
- addr = 7f1ec40350d0 size = 8192
- addr = 7f1ec4021030 size = 8192
- addr = 7f1ec401b000 size = 8192
- addr = 7f1ec402f0a0 size = 8192
- addr = 7f1ec40310b0 size = 8192
- addr = 7f1ec4023040 size = 8192
- addr = 7f1ec40330c0 size = 8192
- addr = 7f1ec4025050 size = 8192
- addr = 7f1ec401f020 size = 8192
- 114688 bytes in 14 allocations from stack
- fibonacci+0x1f [hehe]
- child+0x56 [hehe]
- start_thread+0xdb [libpthread-2.27.so]
- [13:24:26] Top 10 stacks with outstanding allocations:
- addr = 7f1ec401d010 size = 8192
- addr = 7f1ec402b080 size = 8192
- addr = 7f1ec4027060 size = 8192
- addr = 7f1ec403b100 size = 8192
- addr = 7f1ec40390f0 size = 8192
- addr = 7f1ec4029070 size = 8192
- addr = 7f1ec402d090 size = 8192
- addr = 7f1ec403f120 size = 8192
- addr = 7f1ec40350d0 size = 8192
- addr = 7f1ec403d110 size = 8192
- addr = 7f1ec4021030 size = 8192
- addr = 7f1ec401b000 size = 8192
- addr = 7f1ec402f0a0 size = 8192
- addr = 7f1ec40310b0 size = 8192
- addr = 7f1ec40370e0 size = 8192
- addr = 7f1ec4023040 size = 8192
- addr = 7f1ec40330c0 size = 8192
- addr = 7f1ec4025050 size = 8192
- addr = 7f1ec401f020 size = 8192
- 155648 bytes in 19 allocations from stack
- fibonacci+0x1f [hehe]
- child+0x56 [hehe]
- start_thread+0xdb [libpthread-2.27.so]
In reality, it is much more complicated than this example. For example,
malloc and free usually do not appear as acceptances, but require you to release memory on each exception handling path and success path.
In a multi-threaded program, the memory allocated in a county, May be accessed and freed in another thread
To complicate matters, in third-party library functions, implicitly allocated memory may need to be explicitly freed by the application
To avoid memory leaks, it is important to develop good programming habits, For example, after allocating memory, you must first write the code for memory release, and then develop other logic.
refer to
[Centos7]bbc tools installation