This is an old revision of the document!
Table of Contents
Proc
The /proc file system, doesn't contain 'real' files. Most of the 'files' within /proc have a file size of 0.
/proc simply acts as an interface to internal data structures in the kernel. It can be used to obtain information about the system (such as memory, disks mounted, hardware configuration, etc.) and to change certain kernel parameters at runtime (sysctl).
It not only allows access to process data but also allows you to request the kernel status by reading files in the hierarchy.
Process-Specific Sub-directories
The /proc directory contains (among other things) one sub-directory for each process running on the system, which is named after the process ID (PID).
Each process sub-directory has the following entries.
File | Content |
---|---|
clear_refs | Clears page referenced bits shown in smaps output. |
cmdline | Command line arguments. |
cpu | Current and last cpu in which it was executed (2.4)(smp). |
cwd | Link to the current working directory. |
environ | Values of environment variables. |
exe | Link to the executable of this process. |
fd | Directory, which contains all file descriptors. |
maps | Memory maps to executables and library files (2.4). |
mem | Memory held by this process. |
root | Link to the root directory of this process. |
stat | Process status. |
statm | Process memory status information. |
status | Process status in human readable form. |
wchan | If CONFIG_KALLSYMS is set, a pre-decoded wchan. |
pagemap | Page table. |
stack | Report full stack trace, enable via CONFIG_STACKTRACE. |
smaps | An extension based on maps, showing the memory consumption of each mapping. |
Kernel data
Similar to the process entries, the kernel data files give information about the running kernel. The files used to obtain this information are contained in /proc and are listed here. Not all of these will be present in your system. It depends on the kernel configuration and the loaded modules, which files are there, and which are missing.
File | Content |
---|---|
apm | Advanced power management info. |
buddyinfo | Kernel memory allocator information (2.5). |
bus | Directory containing bus specific information. |
cmdline | Kernel command line. |
cpuinfo | Info about the CPU. |
devices | Available devices (block and character). |
dma | Used DMS channels. |
filesystems | Supported filesystems. |
driver | Various drivers grouped here, currently rtc (2.4). |
execdomains | Execdomains, related to security (2.4). |
fb | Frame Buffer devices (2.4). |
fs | File system parameters, currently nfs/exports (2.4). |
ide | Directory containing info about the IDE subsystem. |
interrupts | Interrupt usage. |
iomem | Memory map (2.4). |
ioports | I/O port usage. |
irq | Masks for irq to cpu affinity (2.4) (smp?). |
isapnp | ISA PnP (Plug&Play) Info (2.4). |
kcore | Kernel core image (can be ELF or A.OUT (deprecated in 2.4)). |
kmsg | Kernel messages. |
ksyms | Kernel symbol table. |
loadavg | Load average of last 1, 5 & 15 minutes. |
locks | Kernel locks. |
meminfo | Memory info. |
misc | Miscellaneous. |
modules | List of loaded modules. |
mounts | Mounted filesystems. |
net | Networking info. |
pagetypeinfo | Additional page allocator information (see text) (2.5). |
parports | Parallel Ports |
partitions | Table of partitions known to the system. |
pci | Deprecated info of PCI bus (new way → /proc/bus/pci/, decoupled by lspci (2.4). |
rtc | Real time clock. |
scsi | SCSI info. |
slabinfo | Slab pool info. |
softirqs | softirq usage. |
stat | Overall statistics. |
swaps | Swap space utilization. |
sys | System info. |
sysvipc | Info of SysVIPC Resources (msg, sem, shm) (2.4). |
tty | Info of tty drivers. |
uptime | System uptime. |
version | Kernel version. |
video | bttv info of video resources (2.4). |
vmallocinfo | Show vmalloced areas. |
Per Process Parameters
oom-killer score
/proc/<pid>/oom_adj and /proc/<pid>/oom_score_adj adjust the oom-killer score.
These file can be used to adjust the badness heuristic used to select which process gets killed in out of memory conditions.
The badness heuristic assigns a value to each candidate task ranging from 0 (never kill) to 1000 (always kill) to determine which process is targeted. The units are roughly a proportion along that range of allowed memory the process may allocate from based on an estimation of its current memory and swap use. For example, if a task is using all allowed memory, its badness score will be 1000. If it is using half of its allowed memory, its score will be 500.
There is an additional factor included in the badness score: root processes are given 3% extra memory over other tasks.
The amount of “allowed” memory depends on the context in which the oom killer was called. If it is due to the memory assigned to the allocating task's cpuset being exhausted, the allowed memory represents the set of mems assigned to that cpuset. If it is due to a mempolicy's node(s) being exhausted, the allowed memory represents the set of mempolicy nodes. If it is due to a memory limit (or swap limit) being reached, the allowed memory is that configured limit. Finally, if it is due to the entire system being out of memory, the allowed memory represents all allocatable resources.
The value of /proc/<pid>/oom_score_adj is added to the badness score before it is used to determine which task to kill. Acceptable values range from -1000 (OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX). This allows userspace to polarize the preference for oom killing either by always preferring a certain task or completely disabling it. The lowest possible value, -1000, is equivalent to disabling oom killing entirely for that task since it will always report a badness score of 0.
Consequently, it is very simple for userspace to define the amount of memory to consider for each task. Setting a /proc/<pid>/oom_score_adj value of +500, for example, is roughly equivalent to allowing the remainder of tasks sharing the same system, cpuset, mempolicy, or memory controller resources to use at least 50% more memory. A value of -500, on the other hand, would be roughly equivalent to discounting 50% of the task's allowed memory from being considered as scoring against the task.
For backwards compatibility with previous kernels, /proc/<pid>/oom_adj may also be used to tune the badness score. Its acceptable values range from -16 (OOM_ADJUST_MIN) to +15 (OOM_ADJUST_MAX) and a special value of -17 (OOM_DISABLE) to disable oom killing entirely for that task. Its value is scaled linearly with /proc/<pid>/oom_score_adj.
Writing to /proc/<pid>/oom_score_adj or /proc/<pid>/oom_adj will change the other with its scaled value.
NOTICE: /proc/<pid>/oom_adj is deprecated and will be removed, please see Documentation/feature-removal-schedule.txt.
Caveat: When a parent task is selected, the oom killer will sacrifice any first generation children with separate address spaces instead, if possible. This avoids servers and important system daemons from being killed and loses the minimal amount of work.
/proc/<pid>/oom_score displays the current oom-killer score. This file can be used to check the current score used by the oom-killer is for any given <pid>. Use it together with /proc/<pid>/oom_adj to tune which process should be killed in an out-of-memory situation.
IO accounting fields
/proc/<pid>/io display the IO accounting fields, which are IO statistics for each running process.
dd if=/dev/zero of=/tmp/test.dat & [1] 3828 cat /proc/3828/io rchar: 323934931 wchar: 323929600 syscr: 632687 syscw: 632675 read_bytes: 0 write_bytes: 323932160 cancelled_write_bytes: 0
where:
- rchar is count of chars read. The number of bytes which this task has caused to be read from storage. This is simply the sum of bytes which this process passed to read() and pread(). It includes things like tty IO and it is unaffected by whether or not actual physical disk IO was required (the read might have been satisfied from pagecache)
- wchar is count of chars written. The number of bytes which this task has caused, or shall cause to be written to disk. Similar caveats apply here as with rchar.
- syscr is a count of read syscalls. Attempt to count the number of read I/O operations, i.e. syscalls like read() and pread().
- syscw is a count of writen syscalls. Attempt to count the number of write I/O operations, i.e. syscalls like write() and pwrite().
- read_bytes is count of bytes read. Attempt to count the number of bytes which this process really did cause to be fetched from the storage layer. Done at the submit_bio() level, so it is accurate for block-backed filesystems. TODO: Add status regarding NFS and CIFS at a later time.
- write_bytes is count of bytes written. Attempt to count the number of bytes which this process caused to be sent to the storage layer. This is done at page-dirtying time.
- cancelled_write_bytes The big inaccuracy here is truncate. If a process writes 1MB to a file and then deletes the file, it will in fact perform no writeout. But it will have been accounted as having caused 1MB of write. In other words: The number of bytes which this process caused to not happen, by truncating pagecache. A task can cause “negative” IO too. If this task truncates some dirty pagecache, some IO which another task has been accounted for (in its write_bytes) will not be happening. We _could_ just subtract that from the truncating task's write_bytes, but there is information loss in doing that.
WARNING: At its current implementation state, this is a bit racy on 32-bit machines: if process A reads process B's /proc/pid/io while process B is updating one of those 64-bit counters, process A could see an intermediate result.
More information about this can be found within the taskstats documentation in Documentation/accounting.