2. Usage Examples and Syntax
2.1 Basic Usage
2.2 Attaching processes
+ 2.3 Mounting hierarchies by name
+ 2.4 Notification API
3. Kernel API
3.1 Overview
3.2 Synchronization
state attached to each cgroup in the hierarchy. Each hierarchy has
an instance of the cgroup virtual filesystem associated with it.
-At any one time there may be multiple active hierachies of task
+At any one time there may be multiple active hierarchies of task
cgroups. Each hierarchy is a partition of all tasks in the system.
User level code may create and destroy cgroups by name in an
/ \
Prof (15%) students (5%)
-Browsers like firefox/lynx go into the WWW network class, while (k)nfsd go
+Browsers like Firefox/Lynx go into the WWW network class, while (k)nfsd go
into NFS network class.
-At the same time firefox/lynx will share an appropriate CPU/Memory class
+At the same time Firefox/Lynx will share an appropriate CPU/Memory class
depending on who launched it (prof/student).
With the ability to classify tasks differently for different resources
Each cgroup is represented by a directory in the cgroup file system
containing the following files describing that cgroup:
- - tasks: list of tasks (by pid) attached to that cgroup
+ - tasks: list of tasks (by pid) attached to that cgroup. This list
+ is not guaranteed to be sorted. Writing a thread id into this file
+ moves the thread into this cgroup.
+ - cgroup.procs: list of tgids in the cgroup. This list is not
+ guaranteed to be sorted or free of duplicate tgids, and userspace
+ should sort/uniquify the list if this property is required.
+ This is a read-only file, for now.
- notify_on_release flag: run the release agent on exit?
- release_agent: the path to use for release notifications (this file
exists in the top cgroup only)
Creating, modifying, using the cgroups can be done through the cgroup
virtual filesystem.
-To mount a cgroup hierarchy will all available subsystems, type:
+To mount a cgroup hierarchy with all available subsystems, type:
# mount -t cgroup xxx /dev/cgroup
The "xxx" is not interpreted by the cgroup code, but will appear in
To mount a cgroup hierarchy with just the cpuset and numtasks
subsystems, type:
-# mount -t cgroup -o cpuset,numtasks hier1 /dev/cgroup
+# mount -t cgroup -o cpuset,memory hier1 /dev/cgroup
To change the set of subsystems bound to a mounted hierarchy, just
remount with different options:
+# mount -o remount,cpuset,ns hier1 /dev/cgroup
-# mount -o remount,cpuset,ns /dev/cgroup
+Now memory is removed from the hierarchy and ns is added.
+
+Note this will add ns to the hierarchy but won't remove memory or
+cpuset, because the new options are appended to the old ones:
+# mount -o remount,ns /dev/cgroup
+
+To Specify a hierarchy's release_agent:
+# mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \
+ xxx /dev/cgroup
+
+Note that specifying 'release_agent' more than once will return failure.
Note that changing the set of subsystems is currently only supported
when the hierarchy consists of a single (root) cgroup. Supporting
tree of the cgroups in the system. For instance, /dev/cgroup
is the cgroup that holds the whole system.
+If you want to change the value of release_agent:
+# echo "/sbin/new_release_agent" > /dev/cgroup/release_agent
+
+It can also be changed via remount.
+
If you want to create a new cgroup under /dev/cgroup:
# cd /dev/cgroup
# mkdir my_cgroup
In this directory you can find several files:
# ls
-notify_on_release tasks
+cgroup.procs notify_on_release tasks
(plus whatever files added by the attached subsystems)
Now attach your shell to this cgroup:
# echo 0 > tasks
+2.3 Mounting hierarchies by name
+--------------------------------
+
+Passing the name=<x> option when mounting a cgroups hierarchy
+associates the given name with the hierarchy. This can be used when
+mounting a pre-existing hierarchy, in order to refer to it by name
+rather than by its set of active subsystems. Each hierarchy is either
+nameless, or has a unique name.
+
+The name should match [\w.-]+
+
+When passing a name=<x> option for a new hierarchy, you need to
+specify subsystems manually; the legacy behaviour of mounting all
+subsystems when none are explicitly specified is not supported when
+you give a subsystem a name.
+
+The name of the subsystem appears as part of the hierarchy description
+in /proc/mounts and /proc/<pid>/cgroups.
+
+2.4 Notification API
+--------------------
+
+There is mechanism which allows to get notifications about changing
+status of a cgroup.
+
+To register new notification handler you need:
+ - create a file descriptor for event notification using eventfd(2);
+ - open a control file to be monitored (e.g. memory.usage_in_bytes);
+ - write "<event_fd> <control_fd> <args>" to cgroup.event_control.
+ Interpretation of args is defined by control file implementation;
+
+eventfd will be woken up by control file implementation or when the
+cgroup is removed.
+
+To unregister notification handler just close eventfd.
+
+NOTE: Support of notifications should be implemented for the control
+file. See documentation for the subsystem.
+
3. Kernel API
=============
- add an entry in linux/cgroup_subsys.h
- define a cgroup_subsys object called <name>_subsys
+If a subsystem can be compiled as a module, it should also have in its
+module initcall a call to cgroup_load_subsys(), and in its exitcall a
+call to cgroup_unload_subsys(). It should also set its_subsys.module =
+THIS_MODULE in its .c file.
+
Each subsystem may export the following methods. The only mandatory
methods are create/destroy. Any others that are null are presumed to
be successful no-ops.
newly-created cgroup if an error occurs after this subsystem's
create() method has been called for the new cgroup).
-void pre_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp);
+int pre_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp);
Called before checking the reference count on each subsystem. This may
be useful for subsystems which have some extra references even if
-there are not tasks in the cgroup.
+there are not tasks in the cgroup. If pre_destroy() returns error code,
+rmdir() will fail with it. From this behavior, pre_destroy() can be
+called multiple times against a cgroup.
int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
- struct task_struct *task)
+ struct task_struct *task, bool threadgroup)
(cgroup_mutex held by caller)
Called prior to moving a task into a cgroup; if the subsystem
task is passed, then a successful result indicates that *any*
unspecified task can be moved into the cgroup. Note that this isn't
called on a fork. If this method returns 0 (success) then this should
-remain valid while the caller holds cgroup_mutex.
+remain valid while the caller holds cgroup_mutex and it is ensured that either
+attach() or cancel_attach() will be called in future. If threadgroup is
+true, then a successful result indicates that all threads in the given
+thread's threadgroup can be moved together.
+
+void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
+ struct task_struct *task, bool threadgroup)
+(cgroup_mutex held by caller)
+
+Called when a task attach operation has failed after can_attach() has succeeded.
+A subsystem whose can_attach() has some side-effects should provide this
+function, so that the subsytem can implement a rollback. If not, not necessary.
+This will be called only about subsystems whose can_attach() operation have
+succeeded.
void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
- struct cgroup *old_cgrp, struct task_struct *task)
+ struct cgroup *old_cgrp, struct task_struct *task,
+ bool threadgroup)
(cgroup_mutex held by caller)
Called after the task has been attached to the cgroup, to allow any
post-attachment activity that requires memory allocations or blocking.
+If threadgroup is true, the subsystem should take care of all threads
+in the specified thread's threadgroup. Currently does not support any
+subsystem that might need the old_cgrp for every thread in the group.
void fork(struct cgroup_subsy *ss, struct task_struct *task)
void post_clone(struct cgroup_subsys *ss, struct cgroup *cgrp)
(cgroup_mutex held by caller)
-Called at the end of cgroup_clone() to do any paramater
+Called at the end of cgroup_clone() to do any parameter
initialization which might be required before a task could attach. For
example in cpusets, no task may attach before 'cpus' and 'mems' are set
up.