X-Git-Url: http://ftp.safe.ca/?a=blobdiff_plain;f=Documentation%2Fvm%2Fnuma_memory_policy.txt;h=bad16d3f6a473afe913b0be76e783bcb946bc937;hb=08ce5f16ee466ffc5bf243800deeecd77d9eaf50;hp=27b9507a3769ac650f14f4837b1e20103a736bec;hpb=45c4745af381851b0406d8e4db99e62e265691c2;p=safe%2Fjmp%2Flinux-2.6 diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt index 27b9507..bad16d3 100644 --- a/Documentation/vm/numa_memory_policy.txt +++ b/Documentation/vm/numa_memory_policy.txt @@ -147,35 +147,18 @@ Components of Memory Policies Linux memory policy supports the following 4 behavioral modes: - Default Mode--MPOL_DEFAULT: The behavior specified by this mode is - context or scope dependent. - - As mentioned in the Policy Scope section above, during normal - system operation, the System Default Policy is hard coded to - contain the Default mode. - - In this context, default mode means "local" allocation--that is - attempt to allocate the page from the node associated with the cpu - where the fault occurs. If the "local" node has no memory, or the - node's memory can be exhausted [no free pages available], local - allocation will "fallback to"--attempt to allocate pages from-- - "nearby" nodes, in order of increasing "distance". - - Implementation detail -- subject to change: "Fallback" uses - a per node list of sibling nodes--called zonelists--built at - boot time, or when nodes or memory are added or removed from - the system [memory hotplug]. These per node zonelist are - constructed with nodes in order of increasing distance based - on information provided by the platform firmware. - - When a task/process policy or a shared policy contains the Default - mode, this also means "local allocation", as described above. - - In the context of a VMA, Default mode means "fall back to task - policy"--which may or may not specify Default mode. Thus, Default - mode can not be counted on to mean local allocation when used - on a non-shared region of the address space. However, see - MPOL_PREFERRED below. + Default Mode--MPOL_DEFAULT: This mode is only used in the memory + policy APIs. Internally, MPOL_DEFAULT is converted to the NULL + memory policy in all policy scopes. Any existing non-default policy + will simply be removed when MPOL_DEFAULT is specified. As a result, + MPOL_DEFAULT means "fall back to the next most specific policy scope." + + For example, a NULL or default task policy will fall back to the + system default policy. A NULL or default vma policy will fall + back to the task policy. + + When specified in one of the memory policy APIs, the Default mode + does not use the optional set of nodes. It is an error for the set of nodes specified for this policy to be non-empty. @@ -187,19 +170,17 @@ Components of Memory Policies MPOL_PREFERRED: This mode specifies that the allocation should be attempted from the single node specified in the policy. If that - allocation fails, the kernel will search other nodes, exactly as - it would for a local allocation that started at the preferred node - in increasing distance from the preferred node. "Local" allocation - policy can be viewed as a Preferred policy that starts at the node + allocation fails, the kernel will search other nodes, in order of + increasing distance from the preferred node based on information + provided by the platform firmware. containing the cpu where the allocation takes place. Internally, the Preferred policy uses a single node--the - preferred_node member of struct mempolicy. A "distinguished - value of this preferred_node, currently '-1', is interpreted - as "the node containing the cpu where the allocation takes - place"--local allocation. This is the way to specify - local allocation for a specific range of addresses--i.e. for - VMA policies. + preferred_node member of struct mempolicy. When the internal + mode flag MPOL_F_LOCAL is set, the preferred_node is ignored and + the policy is interpreted as local allocation. "Local" allocation + policy can be viewed as a Preferred policy that starts at the node + containing the cpu where the allocation takes place. It is possible for the user to specify that local allocation is always preferred by passing an empty nodemask with this mode. @@ -311,6 +292,74 @@ Components of Memory Policies MPOL_PREFERRED policies that were created with an empty nodemask (local allocation). +MEMORY POLICY REFERENCE COUNTING + +To resolve use/free races, struct mempolicy contains an atomic reference +count field. Internal interfaces, mpol_get()/mpol_put() increment and +decrement this reference count, respectively. mpol_put() will only free +the structure back to the mempolicy kmem cache when the reference count +goes to zero. + +When a new memory policy is allocated, it's reference count is initialized +to '1', representing the reference held by the task that is installing the +new policy. When a pointer to a memory policy structure is stored in another +structure, another reference is added, as the task's reference will be dropped +on completion of the policy installation. + +During run-time "usage" of the policy, we attempt to minimize atomic operations +on the reference count, as this can lead to cache lines bouncing between cpus +and NUMA nodes. "Usage" here means one of the following: + +1) querying of the policy, either by the task itself [using the get_mempolicy() + API discussed below] or by another task using the /proc//numa_maps + interface. + +2) examination of the policy to determine the policy mode and associated node + or node lists, if any, for page allocation. This is considered a "hot + path". Note that for MPOL_BIND, the "usage" extends across the entire + allocation process, which may sleep during page reclaimation, because the + BIND policy nodemask is used, by reference, to filter ineligible nodes. + +We can avoid taking an extra reference during the usages listed above as +follows: + +1) we never need to get/free the system default policy as this is never + changed nor freed, once the system is up and running. + +2) for querying the policy, we do not need to take an extra reference on the + target task's task policy nor vma policies because we always acquire the + task's mm's mmap_sem for read during the query. The set_mempolicy() and + mbind() APIs [see below] always acquire the mmap_sem for write when + installing or replacing task or vma policies. Thus, there is no possibility + of a task or thread freeing a policy while another task or thread is + querying it. + +3) Page allocation usage of task or vma policy occurs in the fault path where + we hold them mmap_sem for read. Again, because replacing the task or vma + policy requires that the mmap_sem be held for write, the policy can't be + freed out from under us while we're using it for page allocation. + +4) Shared policies require special consideration. One task can replace a + shared memory policy while another task, with a distinct mmap_sem, is + querying or allocating a page based on the policy. To resolve this + potential race, the shared policy infrastructure adds an extra reference + to the shared policy during lookup while holding a spin lock on the shared + policy management structure. This requires that we drop this extra + reference when we're finished "using" the policy. We must drop the + extra reference on shared policies in the same query/allocation paths + used for non-shared policies. For this reason, shared policies are marked + as such, and the extra reference is dropped "conditionally"--i.e., only + for shared policies. + + Because of this extra reference counting, and because we must lookup + shared policies in a tree structure under spinlock, shared policies are + more expensive to use in the page allocation path. This is expecially + true for shared policies on shared memory regions shared by tasks running + on different NUMA nodes. This extra overhead can be avoided by always + falling back to task or system default policy for shared memory regions, + or by prefaulting the entire shared memory region into memory and locking + it down. However, this might not be appropriate for all applications. + MEMORY POLICY APIs Linux supports 3 system calls for controlling memory policy. These APIS