Linux performance optimization - memory problem troubleshooting

content

Related commands

Test for cache hits

Test direct I/O

refer to


 

Related commands

Some fields used by both cachestat and cachetop are explained by man as follows

  1. TIME Timestamp.
  2. HITS Number of page cache hits.
  3. MISSES Number of page cache misses.
  4. DIRTIES
  5. Number of dirty pages added to the page cache.
  6. READ_HIT%
  7. Read hit percent of page cache usage.
  8. WRITE_HIT%
  9. Write hit percent of page cache usage.
  10. BUFFERS_MB
  11. Buffers size taken from /proc/meminfo.
  12. CACHED_MB
  13. Cached amount of data in current page cache taken from /proc/meminfo.

Ubuntu binary install both tools

  1. sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 4052245BD4284CDD
  2. echo "deb https://repo.iovisor.org/apt/ $(lsb_release -cs) $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/iovisor.list
  3. sudo apt-get update
  4. sudo apt-get install bcc-tools libbcc-examples linux-headers-$( uname -r)

Install bcc on Centos

  1. # install ELRepo
  2. rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org _ _
  3. rpm -Uvh https://www.elrepo.org/elrepo-release-7.0-3.el7.elrepo.noarch.rpm _ _ _ _ _ _
  4. # install new kernel
  5. yum remove -y kernel-headers kernel-tools kernel-tools-libs
  6. yum --enablerepo= "elrepo-kernel" install -y kernel-ml kernel-ml-devel kernel-ml-headers kernel-ml-tools kernel-ml-tools-libs kernel-ml-tools-libs-devel
  7. #Restart after updating Grub
  8. grub2-mkconfig -o /boot/grub2/grub.cfg
  9. grub2-set-default 0
  10. reboot
  11. #Confirm that the kernel has been upgraded to 4.20.0.-1.el7.elrepo.x86_64 after restarting
  12. uname -r
  13. # install bbc-tools
  14. yum install -y bcc-tools
  15. #Configure PATH path
  16. export PATH= $PATH :/usr/share/bcc/tools
  17. #Verify that the installation was successful
  18. cachestat

Install pcstat based on binary

  1. if [ $( uname -m) == "x86_64" ] ; then
  2. curl -L -o pcstat https://github.com/tobert/pcstat/raw/2014-05-02-01/pcstat.x86_64
  3. else
  4. curl -L -o pcstat https://github.com/tobert/pcstat/raw/2014-05-02-01/pcstat.x86_32
  5. fi
  6. chmod 755 pcstat

Results of executing pcstat

  1. pcstat /bin/cat hehe.log
  2. |----------+----------------+------------+-------- ---+---------|
  3. | Name | Size | Pages | Cached | Percent |
  4. |----------+----------------+------------+-------- ---+---------|
  5. | /bin/cat | 35064 | 9 | 0 | 000.000 |
  6. |hehe.log|25|1|0|000.000|
  7. |----------+----------------+------------+-------- ---+---------|
  8. cat hehe.log
  9. aaaaaaa
  10. bbbbbbbbbb
  11. ccccc
  12. #Execute the second time, the data will be cached
  13. pcstat /bin/cat hehe.log
  14. |----------+----------------+------------+-------- ---+---------|
  15. | Name | Size | Pages | Cached | Percent |
  16. |----------+----------------+------------+-------- ---+---------|
  17. | /bin/cat | 35064 | 9 | 9 | 100.000 |
  18. | hehe.log | 25 | 1 | 1 | 100.000 |
  19. |----------+----------------+------------+-------- ---+---------|

The size of /bin/cat is 35064 bytes, and the size of a page is 4K, so 35064/(4*1024.0) = 8.5, which means 9 pages are occupied

 

 

Test for cache hits

Write a file with dd and read the file repeatedly

  1. dd if =/dev/sda1 of=file bs=1M count=512
  2. echo 3 > /proc/sys/vm/drop_caches
  3. # This time the cache is empty
  4. pcstat file
  5. |----------+----------------+------------+-------- ---+---------|
  6. | Name | Size | Pages | Cached | Percent |
  7. |----------+----------------+------------+-------- ---+---------|
  8. |file|536870912|131072|0|000.000|
  9. |----------+----------------+------------+-------- ---+---------|

test read data

  1. dd if =file of=/dev/null bs=1M
  2. 512+0 records in
  3. 512+0 records out
  4. 536870912 bytes (537 MB, 512 MiB) copied, 5.04981 s, 106 MB/s
  5. cachetop
  6. PID UID CMD HITS MISSES DIRTIES READ_HIT% WRITE_HIT%
  7. 3928 root python 5 0 0 100.0% 0.0%
  8. 3972 root python 5 0 0 100.0% 0.0%
  9. 4066 root dd 86868 85505 0 50.4% 49.6%
  10. # second read
  11. dd if =file of=/dev/null bs=1M
  12. 512+0 records in
  13. 512+0 records out
  14. 536870912 bytes (537 MB, 512 MiB) copied, 0.182855 s, 2.9 GB/s
  15. cachetop
  16. PID UID CMD HITS MISSES DIRTIES READ_HIT% WRITE_HIT%
  17. 4079 root bash 197 0 0 100.0% 0.0%
  18. 4079 root dd 131605 0 0 100.0% 0.0%

It can be seen that the performance has been greatly improved during the second read, and then look at the pcstat situation

  1. pcstat file
  2. |----------+----------------+------------+-------- ---+---------|
  3. | Name | Size | Pages | Cached | Percent |
  4. |----------+----------------+------------+-------- ---+---------|
  5. | file | 536870912 | 131072 | 131072 | 100.000 |
  6. |----------+----------------+------------+-------- ---+---------|

 

 

Test direct I/O

Read a file with dd and add the direct flag

  1. dd if =file of=/dev/null bs=1M iflag=direct
  2. 512+0 records in
  3. 512+0 records out
  4. 536870912 bytes (537 MB, 512 MiB) copied, 4.91659 s, 109 MB/s

Observe the operation through the monitor command

  1. cachetop 3
  2. 14:14:13 Buffers MB: 9 / Cached MB: 614 / Sort: HITS / Order: ascending
  3. PID UID CMD HITS MISSES DIRTIES READ_HIT% WRITE_HIT%
  4. 4161 root python 1 0 0 100.0% 0.0%
  5. 4162 root dd 518 0 0 100.0% 0.0%

The result of dd monitoring here is that HITS is 518 per second, and cachetop monitors
518*4/1024.0/3.0 every 3 seconds, that is, reading 0.67M of data per second,
see the result through strace dd, and then read the file file. When the O_DIRECT flag is indeed used

  1. openat (AT_FDCWD, "file" , O_RDONLY|O_DIRECT) = 3
  2. dup2 ( 3 , 0 ) = 0
  3. close ( 3 ) = 0
  4. lseek ( 0 , 0 , SEEK_CUR) = 0
  5. openat (AT_FDCWD, "/dev/null" , O_WRONLY|O_CREAT|O_TRUNC, 0666 ) = 3

Looking at dstat, the iowait is also very high during the time dd reads.

 

Remove the direct I/O option of dd and execute it again

  1. echo 3 > /proc/sys/vm/drop_caches
  2. dd if =file of=/dev/null bs=1M
  3. 512+0 records in
  4. 512+0 records out
  5. 536870912 bytes (537 MB, 512 MiB) copied, 4.91158 s, 109 MB/s
  6. cachetop
  7. PID UID CMD HITS MISSES DIRTIES READ_HIT% WRITE_HIT%
  8. 4397 root python 2 0 0 100.0% 0.0%
  9. 4398 root dd 34198 33027 0 50.9% 49.1%

The result of dd monitoring here is that the HITS per second is 34198, and the cachetop monitors
34198*4/1024.0/3.0 every 3 seconds, that is, reads 44M of data per second, this time it is normal

A note on the O_DIRECT flag

  1. O_DIRECT (since Linux 2.4.10)
  2. Try to minimize cache effects of the I/O to and from this
  3. file. In general this will degrade performance, but it is
  4. useful in special situations, such as when applications do
  5. their own caching. File I/O is done directly to/from user-
  6. space buffers. The O_DIRECT flag on its own makes an effort
  7. to transfer data synchronously, but does not give the
  8. guarantees of the O_SYNC flag that data and necessary metadata
  9. are transferred. To guarantee synchronous I/O, O_SYNC must be
  10. used in addition to O_DIRECT. See NOTES below for further
  11. discussion.

Direct I/O generally means that the upper-layer application has its own cache system, so there is no need for operating system-level cache.

Direct reading and writing of disks is generally used for storage systems, such as databases and file systems. When reading and writing, you can bypass the file system layer of the operating system.

 

 

memory leak check

When the system allocates memory space to a process, the user space memory includes multiple different memory segments, such as read-only segment, data segment, heap, stack, file mapping, etc. These memory segments are the basic methods of memory usage by applications,
such as those defined in the program. Local variables, such as int a, char data[64]
stack memory are automatically allocated and managed by the system. Once the program runs beyond the scope of this local variable, the stack memory will be automatically reclaimed by the system, so there will be no memory leak problem.

The heap memory is allocated and managed by the application itself. Unless the program exits, the heap memory will not be automatically released by the system. The program needs to explicitly call the library function free() to release them. If the program does not release the heap memory correctly, it will cause memory leak

Various segments for leaks
1. Read-only segment, including program code and constants, because it is read-only, no new memory will be allocated, and no memory leak will occur
2. Data segment, including panorama variables and static variables , the size of these variables has been determined when they are defined, and no memory leaks will occur
. 3. Memory mapping segment, including dynamic linking and shared memory, where shared memory is dynamically allocated and managed by the program. If you forget to reclaim it, it will be similar
  to heap memory. Although the process can be killed by the OOM
mechanism, a series of reactions may be triggered before OOM, resulting in serious performance problems. For
example, other processes in the memory of the Western Regions may not be able to allocate new memory, and the memory will start again if the memory is insufficient. The system's cache recycling and SWAP mechanism further lead to I/O performance problems

A problematic program

  1. #include <stdio.h >
  2. # include <stdlib.h>
  3. # include <pthread.h>
  4. # include <unistd.h>
  5. long long * fibonacci ( long long *n0, long long *n1) {
  6. long long *v = ( long long *) calloc ( 1024 , sizeof ( long long ));
  7. *v = *n0 + *n1;
  8. return v;
  9. }
  10. void * child ( void *arg) {
  11. long long n0 = 0 ;
  12. long long n1 = 1 ;
  13. long long *v = NULL ;
  14. int n = 2 ;
  15. for (n = 2 ; n > 0 ; n++) {
  16. v = fibonacci (&n0, &n1);
  17. n0 = n1;
  18. n1 = *v;
  19. printf ( "%dth => %lld\n" , n, *v);
  20. sleep ( 1 );
  21. /* did not call free */
  22. //free(v);
  23. }
  24. }
  25. int main ( void ) {
  26. pthread_t tid;
  27. pthread_create (&tid, NULL , child, NULL );
  28. pthread_join (tid, NULL );
  29. printf ( "main thread exit\n" );
  30. return 0 ;
  31. }
  32. //Results of the
  33. 2th => 1
  34. 3th => 2
  35. 4th => 3
  36. 5th => 5
  37. 6th => 8
  38. 7th => 13
  39. 8th => 21
  40. 9th => 34
  41. 10th => 55
  42. 11th => 89
  43. 12th => 144
  44. 13th => 233
  45. 14th => 377
  46. 15th => 610
  47. 16th => 987
  48. 17th => 1597
  49. 18th => 2584
  50. 19th => 4181
  51. 20th => 6765
  52. 21st => 10946
  53. 22th => 17711
  54. 23rd => 28657
  55. 24th => 46368
  56. 25th => 75025
  57. 26th => 121393
  58. 27th => 196418
  59. 28th => 317811
  60. 29th => 514229
  61. 30th => 832040
  62. 31st => 1346269
  63. 32th => 2178309
  64. 33rd => 3524578
  65. 34th => 5702887
  66. 35th => 9227465
  67. 36th => 14930352

Execute this code (add -lpthread when compiling), use vmstat, and memleak to observe the following

  1. vmstat 1
  2. procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu -----
  3. rb swpd free buff cache si so bi bo in cs us sy id wa st
  4. 0 0 0 3049700 96684 806428 0 0 25 5 53 99 0 0 100 0 0
  5. 0 0 0 3049692 96684 806464 0 0 0 0 151 238 0 0 100 0 0
  6. 0 0 0 3049692 96692 806456 0 0 0 36 148 232 0 0 100 0 0
  7. 0 0 0 3049436 96692 806464 0 0 0 0 156 243 0 0 100 0 0
  8. 0 0 0 3049436 96692 806464 0 0 0 0 177 262 1 0 100 0 0
  9. 0 0 0 3049468 96692 806464 0 0 0 0 126 222 0 0 100 0 0
  10. . . . . . .
  11. 0 0 0 3049376 96700 806456 0 0 0 16 146 243 0 0 100 1 0
  12. . . . . . .
  13. 1 0 0 3049392 96700 806480 0 0 0 0 160 246 0 0 100 0 0
  14. . . . . .
  15. 0 0 0 3049392 96700 806480 0 0 0 0 163 257 0 0 100 0 0
  16. 0 0 0 3049040 96700 806480 0 0 0 0 175 287 0 1 100 0 0
  17. 0 0 0 3049144 96700 806480 0 0 0 0 138 234 1 0 100 0 0
  18. . . . . . .
  19. 0 0 0 3049176 96700 806480 0 0 0 0 169 267 1 0 100 0 0
  20. memleak -p 7438 -a
  21. Attaching to pid 7438, Ctrl+C to quit.
  22. [13:24:11] Top 10 stacks with outstanding allocations:
  23. addr = 7f1ec401d010 size = 8192
  24. addr = 7f1ec4021030 size = 8192
  25. addr = 7f1ec401b000 size = 8192
  26. addr = 7f1ec401f020 size = 8192
  27. 32768 bytes in 4 allocations from stack
  28. fibonacci+0x1f [hehe]
  29. child+0x56 [hehe]
  30. start_thread+0xdb [libpthread-2.27.so]
  31. [13:24:16] Top 10 stacks with outstanding allocations:
  32. addr = 7f1ec401d010 size = 8192
  33. addr = 7f1ec402b080 size = 8192
  34. addr = 7f1ec4027060 size = 8192
  35. addr = 7f1ec4029070 size = 8192
  36. addr = 7f1ec4021030 size = 8192
  37. addr = 7f1ec401b000 size = 8192
  38. addr = 7f1ec4023040 size = 8192
  39. addr = 7f1ec4025050 size = 8192
  40. addr = 7f1ec401f020 size = 8192
  41. 73728 bytes in 9 allocations from stack
  42. fibonacci+0x1f [hehe]
  43. child+0x56 [hehe]
  44. start_thread+0xdb [libpthread-2.27.so]
  45. [13:24:21] Top 10 stacks with outstanding allocations:
  46. addr = 7f1ec401d010 size = 8192
  47. addr = 7f1ec402b080 size = 8192
  48. addr = 7f1ec4027060 size = 8192
  49. addr = 7f1ec4029070 size = 8192
  50. addr = 7f1ec402d090 size = 8192
  51. addr = 7f1ec40350d0 size = 8192
  52. addr = 7f1ec4021030 size = 8192
  53. addr = 7f1ec401b000 size = 8192
  54. addr = 7f1ec402f0a0 size = 8192
  55. addr = 7f1ec40310b0 size = 8192
  56. addr = 7f1ec4023040 size = 8192
  57. addr = 7f1ec40330c0 size = 8192
  58. addr = 7f1ec4025050 size = 8192
  59. addr = 7f1ec401f020 size = 8192
  60. 114688 bytes in 14 allocations from stack
  61. fibonacci+0x1f [hehe]
  62. child+0x56 [hehe]
  63. start_thread+0xdb [libpthread-2.27.so]
  64. [13:24:26] Top 10 stacks with outstanding allocations:
  65. addr = 7f1ec401d010 size = 8192
  66. addr = 7f1ec402b080 size = 8192
  67. addr = 7f1ec4027060 size = 8192
  68. addr = 7f1ec403b100 size = 8192
  69. addr = 7f1ec40390f0 size = 8192
  70. addr = 7f1ec4029070 size = 8192
  71. addr = 7f1ec402d090 size = 8192
  72. addr = 7f1ec403f120 size = 8192
  73. addr = 7f1ec40350d0 size = 8192
  74. addr = 7f1ec403d110 size = 8192
  75. addr = 7f1ec4021030 size = 8192
  76. addr = 7f1ec401b000 size = 8192
  77. addr = 7f1ec402f0a0 size = 8192
  78. addr = 7f1ec40310b0 size = 8192
  79. addr = 7f1ec40370e0 size = 8192
  80. addr = 7f1ec4023040 size = 8192
  81. addr = 7f1ec40330c0 size = 8192
  82. addr = 7f1ec4025050 size = 8192
  83. addr = 7f1ec401f020 size = 8192
  84. 155648 bytes in 19 allocations from stack
  85. fibonacci+0x1f [hehe]
  86. child+0x56 [hehe]
  87. start_thread+0xdb [libpthread-2.27.so]

In reality, it is much more complicated than this example. For example,
malloc and free usually do not appear as acceptances, but require you to release memory on each exception handling path and success path.
In a multi-threaded program, the memory allocated in a county, May be accessed and freed in another thread
To complicate matters, in third-party library functions, implicitly allocated memory may need to be explicitly freed by the application
To avoid memory leaks, it is important to develop good programming habits, For example, after allocating memory, you must first write the code for memory release, and then develop other logic.

 

 

 

refer to

bcc github

pcstat github

cachetop source

[Centos7]bbc tools installation

bcc tool installation

man open

add /docs/INSTALL-CENTOS

 

 

 

 

 

 

 

 

Related: Linux performance optimization - memory problem troubleshooting