Table of Contents

Kernel - Memory Policy

What is Linux Memory Policy?

In the Linux kernel, “memory policy” determines from which node the kernel will allocate memory in a NUMA (Non-uniform memory access) system or in an emulated NUMA system.

Linux has supported platforms with Non-Uniform Memory Access architectures since 2.4.?.

The current memory policy support was added to Linux 2.6 around May 2004.

This document attempts to describe the concepts and APIs of the 2.6 memory policy support.

NOTE: Memory policies should not be confused with cpusets which is an administrative mechanism for restricting the nodes from which memory may be allocated by a set of processes.

Memory policies are a programming interface that a NUMA-aware application can take advantage of.

When both cpusets and policies are applied to a task, the restrictions of the cpuset takes priority.

See “MEMORY POLICIES AND CPUSETS” below for more details.


Memory Policy Concepts

Scope of Memory Policies

The Linux kernel supports _scopes_ of memory policy, described here from most general to most specific:


Components of Memory Policies

A Linux memory policy consists of a “mode”, optional mode flags, and an optional set of nodes. The mode determines the behavior of the policy, the optional mode flags determine the behavior of the mode, and the optional set of nodes can be viewed as the arguments to the policy behavior.

Internally, memory policies are implemented by a reference counted structure, struct mempolicy. Details of this structure will be discussed in context, below, as required to explain the behavior.

Linux memory policy supports the following 4 behavioral modes:

Linux memory policy supports the following optional mode flags:


Memory Policy Reference Counting

To resolve use/free races, struct mempolicy contains an atomic reference count field. Internal interfaces, mpol_get()/mpol_put() increment and decrement this reference count, respectively. mpol_put() will only free the structure back to the mempolicy kmem cache when the reference count goes to zero.

When a new memory policy is allocated, its reference count is initialized to '1', representing the reference held by the task that is installing the new policy. When a pointer to a memory policy structure is stored in another structure, another reference is added, as the task's reference will be dropped on completion of the policy installation.

During run-time “usage” of the policy, we attempt to minimize atomic operations on the reference count, as this can lead to cache lines bouncing between cpus and NUMA nodes. “Usage” here means one of the following:

  1. Querying of the policy, either by the task itself [using the get_mempolicy() API discussed below] or by another task using the /proc/<pid>/numa_maps interface.

  2. Examination of the policy to determine the policy mode and associated node or node lists, if any, for page allocation. This is considered a “hot path”. Note that for MPOL_BIND, the “usage” extends across the entire allocation process, which may sleep during page reclamation, because the BIND policy nodemask is used, by reference, to filter ineligible nodes.

We can avoid taking an extra reference during the usages listed above as follows:

  1. We never need to get/free the system default policy as this is never changed nor freed, once the system is up and running.

  2. For querying the policy, we do not need to take an extra reference on the target task's task policy nor vma policies because we always acquire the task's mm's mmap_sem for read during the query. The set_mempolicy() and mbind() APIs [see below] always acquire the mmap_sem for write when installing or replacing task or vma policies. Thus, there is no possibility of a task or thread freeing a policy while another task or thread is querying it.

  3. Page allocation usage of task or vma policy occurs in the fault path where we hold them mmap_sem for read. Again, because replacing the task or vma policy requires that the mmap_sem be held for write, the policy can't be freed out from under us while we're using it for page allocation.

  4. Shared policies require special consideration. One task can replace a shared memory policy while another task, with a distinct mmap_sem, is querying or allocating a page based on the policy. To resolve this potential race, the shared policy infrastructure adds an extra reference to the shared policy during lookup while holding a spin lock on the shared policy management structure. This requires that we drop this extra reference when we're finished “using” the policy. We must drop the extra reference on shared policies in the same query/allocation paths used for non-shared policies. For this reason, shared policies are marked as such, and the extra reference is dropped “conditionally”, i.e. only for shared policies.

Because of this extra reference counting, and because we must lookup shared policies in a tree structure under spinlock, shared policies are more expensive to use in the page allocation path. This is especially true for shared policies on shared memory regions shared by tasks running on different NUMA nodes. This extra overhead can be avoided by always falling back to task or system default policy for shared memory regions, or by prefaulting the entire shared memory region into memory and locking it down. However, this might not be appropriate for all applications.


Memory Policy APIs

Linux supports 3 system calls for controlling memory policy. These APIS always affect only the calling task, the calling task's address space, or some shared object mapped into the calling task's address space.

NOTE: The headers that define these APIs and the parameter data types for user space applications reside in a package that is not part of the Linux kernel. The kernel system call interfaces, with the 'sys_' prefix, are defined in <linux/syscalls.h>; the mode and flag definitions are defined in <linux/mempolicy.h>.


Set [Task] Memory Policy:

long set_mempolicy(int mode, const unsigned long *nmask, unsigned long maxnode);

Set's the calling task's “task/process memory policy” to mode specified by the 'mode' argument and the set of nodes defined by 'nmask'. 'nmask' points to a bit mask of node ids containing at least 'maxnode' ids. Optional mode flags may be passed by combining the 'mode' argument with the flag (for example: MPOL_INTERLEAVE | MPOL_F_STATIC_NODES).

See the set_mempolicy(2) man page for more details


long get_mempolicy(int *mode,
                   const unsigned long *nmask, unsigned long maxnode,
                   void *addr, int flags);

Queries the “task/process memory policy” of the calling task, or the policy or location of a specified virtual address, depending on the 'flags' argument.

See the get_mempolicy(2) man page for more details


Install VMA/Shared Policy for a Range of Task's Address Space

long mbind(void *start, unsigned long len, int mode,
           const unsigned long *nmask, unsigned long maxnode,
           unsigned flags);

mbind() installs the policy specified by (mode, nmask, maxnodes) as a VMA policy for the range of the calling task's address space specified by the 'start' and 'len' arguments. Additional actions may be requested via the 'flags' argument.

See the mbind(2) man page for more details.


Memory Policy Command Line Interface

Although not strictly part of the Linux implementation of memory policy, a command line tool, numactl(8), exists that allows one to:

The numactl(8) tool is packaged with the run-time version of the library containing the memory policy system call wrappers. Some distributions package the headers and compile-time libraries in a separate development package.


Memory Policies and Cpusets

Memory policies work within cpusets as described above. For memory policies that require a node or set of nodes, the nodes are restricted to the set of nodes whose memories are allowed by the cpuset constraints. If the nodemask specified for the policy contains nodes that are not allowed by the cpuset and MPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes specified for the policy and the set of nodes with memory is used. If the result is the empty set, the policy is considered invalid and cannot be installed. If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped onto and folded into the task's set of allowed nodes as previously described.

The interaction of memory policies and cpusets can be problematic when tasks in two cpusets share access to a memory region, such as shared memory segments created by shmget() or mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and any of the tasks install shared policy on the region, only nodes whose memories are allowed in both cpusets may be used in the policies. Obtaining this information requires “stepping outside” the memory policy APIs to use the cpuset information and requires that one know in what cpusets other task might be attaching to the shared region. Furthermore, if the cpusets' allowed memory sets are disjoint, “local” allocation is the only valid policy.


References

https://www.cyberciti.biz/files/linux-kernel/Documentation/vm/numa_memory_policy.txt