safe/jmp/linux-2.6
16 years agosched: clean up pick_next_highest_task_rt()
Ingo Molnar [Fri, 25 Jan 2008 20:08:14 +0000 (21:08 +0100)]
sched: clean up pick_next_highest_task_rt()

clean up pick_next_highest_task_rt().

Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agosched: RT-balance on new task
Steven Rostedt [Fri, 25 Jan 2008 20:08:14 +0000 (21:08 +0100)]
sched: RT-balance on new task

rt-balance when creating new tasks.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agosched: RT-balance, optimize cpu search
Steven Rostedt [Fri, 25 Jan 2008 20:08:13 +0000 (21:08 +0100)]
sched: RT-balance, optimize cpu search

This patch removes several cpumask operations by keeping track
of the first of the CPUS that is of the lowest priority. When
the search for the lowest priority runqueue is completed, all
the bits up to the first CPU with the lowest priority runqueue
is cleared.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agosched: RT-balance, optimize
Gregory Haskins [Fri, 25 Jan 2008 20:08:13 +0000 (21:08 +0100)]
sched: RT-balance, optimize

We can cheaply track the number of bits set in the cpumask for the lowest
priority CPUs.  Therefore, compute the mask's weight and use it to skip
the optimal domain search logic when there is only one CPU available.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agosched: break out early if RT task cannot be migrated
Gregory Haskins [Fri, 25 Jan 2008 20:08:13 +0000 (21:08 +0100)]
sched: break out early if RT task cannot be migrated

We don't need to bother searching if the task cannot be migrated

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agosched: RT-balance, avoid overloading
Steven Rostedt [Fri, 25 Jan 2008 20:08:12 +0000 (21:08 +0100)]
sched: RT-balance, avoid overloading

This patch changes the searching for a run queue by a waking RT task
to try to pick another runqueue if the currently running task
is an RT task.

The reason is that RT tasks behave different than normal
tasks. Preempting a normal task to run a RT task to keep
its cache hot is fine, because the preempted non-RT task
may wait on that same runqueue to run again unless the
migration thread comes along and pulls it off.

RT tasks behave differently. If one is preempted, it makes
an active effort to continue to run. So by having a high
priority task preempt a lower priority RT task, that lower
RT task will then quickly try to run on another runqueue.
This will cause that lower RT task to replace its nice
hot cache (and TLB) with a completely cold one. This is
for the hope that the new high priority RT task will keep
 its cache hot.

Remeber that this high priority RT task was just woken up.
So it may likely have been sleeping for several milliseconds,
and will end up with a cold cache anyway. RT tasks run till
they voluntarily stop, or are preempted by a higher priority
task. This means that it is unlikely that the woken RT task
will have a hot cache to wake up to. So pushing off a lower
RT task is just killing its cache for no good reason.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agosched: wake-balance fixes
Gregory Haskins [Fri, 25 Jan 2008 20:08:12 +0000 (21:08 +0100)]
sched: wake-balance fixes

We have logic to detect whether the system has migratable tasks, but we are
not using it when deciding whether to push tasks away.  So we add support
for considering this new information.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agosched: optimize RT affinity
Gregory Haskins [Fri, 25 Jan 2008 20:08:11 +0000 (21:08 +0100)]
sched: optimize RT affinity

The current code base assumes a relatively flat CPU/core topology and will
route RT tasks to any CPU fairly equally.  In the real world, there are
various toplogies and affinities that govern where a task is best suited to
run with the smallest amount of overhead.  NUMA and multi-core CPUs are
prime examples of topologies that can impact cache performance.

Fortunately, linux is already structured to represent these topologies via
the sched_domains interface.  So we change our RT router to consult a
combination of topology and affinity policy to best place tasks during
migration.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agosched: pre-route RT tasks on wakeup
Gregory Haskins [Fri, 25 Jan 2008 20:08:10 +0000 (21:08 +0100)]
sched: pre-route RT tasks on wakeup

In the original patch series that Steven Rostedt and I worked on together,
we both took different approaches to low-priority wakeup path.  I utilized
"pre-routing" (push the task away to a less important RQ before activating)
approach, while Steve utilized a "post-routing" approach.  The advantage of
my approach is that you avoid the overhead of a wasted activate/deactivate
cycle and peripherally related burdens.  The advantage of Steve's method is
that it neatly solves an issue preventing a "pull" optimization from being
deployed.

In the end, we ended up deploying Steve's idea.  But it later dawned on me
that we could get the best of both worlds by deploying both ideas together,
albeit slightly modified.

The idea is simple:  Use a "light-weight" lookup for pre-routing, since we
only need to approximate a good home for the task.  And we also retain the
post-routing push logic to clean up any inaccuracies caused by a condition
of "priority mistargeting" caused by the lightweight lookup.  Most of the
time, the pre-routing should work and yield lower overhead.  In the cases
where it doesnt, the post-router will bat cleanup.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agosched: RT balancing: include current CPU
Gregory Haskins [Fri, 25 Jan 2008 20:08:10 +0000 (21:08 +0100)]
sched: RT balancing: include current CPU

It doesn't hurt if we allow the current CPU to be included in the
search.  We will just simply skip it later if the current CPU turns out
to be the lowest.

We will use this later in the series

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agosched: break out search for RT tasks
Gregory Haskins [Fri, 25 Jan 2008 20:08:10 +0000 (21:08 +0100)]
sched: break out search for RT tasks

Isolate the search logic into a function so that it can be used later
in places other than find_locked_lowest_rq().

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agosched: de-SCHED_OTHER-ize the RT path
Gregory Haskins [Fri, 25 Jan 2008 20:08:09 +0000 (21:08 +0100)]
sched: de-SCHED_OTHER-ize the RT path

The current wake-up code path tries to determine if it can optimize the
wake-up to "this_cpu" by computing load calculations.  The problem is that
these calculations are only relevant to SCHED_OTHER tasks where load is king.
For RT tasks, priority is king.  So the load calculation is completely wasted
bandwidth.

Therefore, we create a new sched_class interface to help with
pre-wakeup routing decisions and move the load calculation as a function
of CFS task's class.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agosched: clean up this_rq use in kernel/sched_rt.c
Gregory Haskins [Fri, 25 Jan 2008 20:08:09 +0000 (21:08 +0100)]
sched: clean up this_rq use in kernel/sched_rt.c

"this_rq" is normally used to denote the RQ on the current cpu
(i.e. "cpu_rq(this_cpu)").  So clean up the usage of this_rq to be
more consistent with the rest of the code.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agosched: add RT-balance cpu-weight
Gregory Haskins [Fri, 25 Jan 2008 20:08:07 +0000 (21:08 +0100)]
sched: add RT-balance cpu-weight

Some RT tasks (particularly kthreads) are bound to one specific CPU.
It is fairly common for two or more bound tasks to get queued up at the
same time.  Consider, for instance, softirq_timer and softirq_sched.  A
timer goes off in an ISR which schedules softirq_thread to run at RT50.
Then the timer handler determines that it's time to smp-rebalance the
system so it schedules softirq_sched to run.  So we are in a situation
where we have two RT50 tasks queued, and the system will go into
rt-overload condition to request other CPUs for help.

This causes two problems in the current code:

1) If a high-priority bound task and a low-priority unbounded task queue
   up behind the running task, we will fail to ever relocate the unbounded
   task because we terminate the search on the first unmovable task.

2) We spend precious futile cycles in the fast-path trying to pull
   overloaded tasks over.  It is therefore optimial to strive to avoid the
   overhead all together if we can cheaply detect the condition before
   overload even occurs.

This patch tries to achieve this optimization by utilizing the hamming
weight of the task->cpus_allowed mask.  A weight of 1 indicates that
the task cannot be migrated.  We will then utilize this information to
skip non-migratable tasks and to eliminate uncessary rebalance attempts.

We introduce a per-rq variable to count the number of migratable tasks
that are currently running.  We only go into overload if we have more
than one rt task, AND at least one of them is migratable.

In addition, we introduce a per-task variable to cache the cpus_allowed
weight, since the hamming calculation is probably relatively expensive.
We only update the cached value when the mask is updated which should be
relatively infrequent, especially compared to scheduling frequency
in the fast path.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agosched: disable standard balancer for RT tasks
Steven Rostedt [Fri, 25 Jan 2008 20:08:07 +0000 (21:08 +0100)]
sched: disable standard balancer for RT tasks

Since we now take an active approach to load balancing, we don't need to
balance RT tasks via the normal task balancer. In fact, this code was
found to pull RT tasks away from CPUS that the active movement performed,
resulting in large latencies.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agosched: push RT tasks from overloaded CPUs
Steven Rostedt [Fri, 25 Jan 2008 20:08:07 +0000 (21:08 +0100)]
sched: push RT tasks from overloaded CPUs

This patch adds pushing of overloaded RT tasks from a runqueue that is
having tasks (most likely RT tasks) added to the run queue.

TODO: We don't cover the case of waking of new RT tasks (yet).

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agosched: pull RT tasks from overloaded runqueues
Steven Rostedt [Fri, 25 Jan 2008 20:08:07 +0000 (21:08 +0100)]
sched: pull RT tasks from overloaded runqueues

This patch adds the algorithm to pull tasks from RT overloaded runqueues.

When a pull RT is initiated, all overloaded runqueues are examined for
a RT task that is higher in prio than the highest prio task queued on the
target runqueue. If another runqueue holds a RT task that is of higher
prio than the highest prio task on the target runqueue is found it is pulled
to the target runqueue.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agosched: add rt-overload tracking
Steven Rostedt [Fri, 25 Jan 2008 20:08:06 +0000 (21:08 +0100)]
sched: add rt-overload tracking

This patch adds an RT overload accounting system. When a runqueue has
more than one RT task queued, it is marked as overloaded. That is that it
is a candidate to have RT tasks pulled from it.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agosched: add RT task pushing
Steven Rostedt [Fri, 25 Jan 2008 20:08:05 +0000 (21:08 +0100)]
sched: add RT task pushing

This patch adds an algorithm to push extra RT tasks off a run queue to
other CPU runqueues.

When more than one RT task is added to a run queue, this algorithm takes
an assertive approach to push the RT tasks that are not running onto other
run queues that have lower priority.  The way this works is that the highest
RT task that is not running is looked at and we examine the runqueues on
the CPUS for that tasks affinity mask. We find the runqueue with the lowest
prio in the CPU affinity of the picked task, and if it is lower in prio than
the picked task, we push the task onto that CPU runqueue.

We continue pushing RT tasks off the current runqueue until we don't push any
more.  The algorithm stops when the next highest RT task can't preempt any
other processes on other CPUS.

TODO: The algorithm may stop when there are still RT tasks that can be
 migrated. Specifically, if the highest non running RT task CPU affinity
 is restricted to CPUs that are running higher priority tasks, there may
 be a lower priority task queued that has an affinity with a CPU that is
 running a lower priority task that it could be migrated to.  This
 patch set does not address this issue.

Note: checkpatch reveals two over 80 character instances. I'm not sure
 that breaking them up will help visually, so I left them as is.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agosched: track highest prio task queued
Steven Rostedt [Fri, 25 Jan 2008 20:08:04 +0000 (21:08 +0100)]
sched: track highest prio task queued

This patch adds accounting to each runqueue to keep track of the
highest prio task queued on the run queue. We only care about
RT tasks, so if the run queue does not contain any active RT tasks
its priority will be considered MAX_RT_PRIO.

This information will be used for later patches.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agosched: count # of queued RT tasks
Steven Rostedt [Fri, 25 Jan 2008 20:08:03 +0000 (21:08 +0100)]
sched: count # of queued RT tasks

This patch adds accounting to keep track of the number of RT tasks running
on a runqueue. This information will be used in later patches.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agosoftlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks
Ingo Molnar [Fri, 25 Jan 2008 20:08:02 +0000 (21:08 +0100)]
softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks

this patch extends the soft-lockup detector to automatically
detect hung TASK_UNINTERRUPTIBLE tasks. Such hung tasks are
printed the following way:

 ------------------>
 INFO: task prctl:3042 blocked for more than 120 seconds.
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message
 prctl         D fd5e3793     0  3042   2997
        f6050f38 00000046 00000001 fd5e3793 00000009 c06d8264 c06dae80 00000286
        f6050f40 f6050f00 f7d34d90 f7d34fc8 c1e1be80 00000001 f6050000 00000000
        f7e92d00 00000286 f6050f18 c0489d1a f6050f40 00006605 00000000 c0133a5b
 Call Trace:
  [<c04883a5>] schedule_timeout+0x6d/0x8b
  [<c04883d8>] schedule_timeout_uninterruptible+0x15/0x17
  [<c0133a76>] msleep+0x10/0x16
  [<c0138974>] sys_prctl+0x30/0x1e2
  [<c0104c52>] sysenter_past_esp+0x5f/0xa5
  =======================
 2 locks held by prctl/3042:
 #0:  (&sb->s_type->i_mutex_key#5){--..}, at: [<c0197d11>] do_fsync+0x38/0x7a
 #1:  (jbd_handle){--..}, at: [<c01ca3d2>] journal_start+0xc7/0xe9
 <------------------

the current default timeout is 120 seconds. Such messages are printed
up to 10 times per bootup. If the system has crashed already then the
messages are not printed.

if lockdep is enabled then all held locks are printed as well.

this feature is a natural extension to the softlockup-detector (kernel
locked up without scheduling) and to the NMI watchdog (kernel locked up
with IRQs disabled).

[ Gautham R Shenoy <ego@in.ibm.com>: CPU hotplug fixes. ]
[ Andrew Morton <akpm@linux-foundation.org>: build warning fix. ]

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
16 years agocpu-hotplug: fix build on !CONFIG_SMP
Ingo Molnar [Fri, 25 Jan 2008 20:08:02 +0000 (21:08 +0100)]
cpu-hotplug: fix build on !CONFIG_SMP

fix build on !CONFIG_SMP.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agocpu-hotplug: replace per-subsystem mutexes with get_online_cpus()
Gautham R Shenoy [Fri, 25 Jan 2008 20:08:02 +0000 (21:08 +0100)]
cpu-hotplug: replace per-subsystem mutexes with get_online_cpus()

This patch converts the known per-subsystem mutexes to get_online_cpus
put_online_cpus. It also eliminates the CPU_LOCK_ACQUIRE and
CPU_LOCK_RELEASE hotplug notification events.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agocpu-hotplug: replace lock_cpu_hotplug() with get_online_cpus()
Gautham R Shenoy [Fri, 25 Jan 2008 20:08:02 +0000 (21:08 +0100)]
cpu-hotplug: replace lock_cpu_hotplug() with get_online_cpus()

Replace all lock_cpu_hotplug/unlock_cpu_hotplug from the kernel and use
get_online_cpus and put_online_cpus instead as it highlights the
refcount semantics in these operations.

The new API guarantees protection against the cpu-hotplug operation, but
it doesn't guarantee serialized access to any of the local data
structures. Hence the changes needs to be reviewed.

In case of pseries_add_processor/pseries_remove_processor, use
cpu_maps_update_begin()/cpu_maps_update_done() as we're modifying the
cpu_present_map there.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agocpu-hotplug: refcount based cpu hotplug
Gautham R Shenoy [Fri, 25 Jan 2008 20:08:01 +0000 (21:08 +0100)]
cpu-hotplug: refcount based cpu hotplug

This patch implements a Refcount + Waitqueue based model for
cpu-hotplug.

Now, a thread which wants to prevent cpu-hotplug, will bump up a global
refcount and the thread which wants to perform a cpu-hotplug operation
will block till the global refcount goes to zero.

The readers, if any, during an ongoing cpu-hotplug operation are blocked
until the cpu-hotplug operation is over.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Paul Jackson <pj@sgi.com> [For !CONFIG_HOTPLUG_CPU ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agosched: group scheduler, fix fairness of cpu bandwidth allocation for task groups
Srivatsa Vaddagiri [Fri, 25 Jan 2008 20:08:00 +0000 (21:08 +0100)]
sched: group scheduler, fix fairness of cpu bandwidth allocation for task groups

The current load balancing scheme isn't good enough for precise
group fairness.

For example: on a 8-cpu system, I created 3 groups as under:

a = 8 tasks (cpu.shares = 1024)
b = 4 tasks (cpu.shares = 1024)
c = 3 tasks (cpu.shares = 1024)

a, b and c are task groups that have equal weight. We would expect each
of the groups to receive 33.33% of cpu bandwidth under a fair scheduler.

This is what I get with the latest scheduler git tree:

Signed-off-by: Ingo Molnar <mingo@elte.hu>
--------------------------------------------------------------------------------
Col1  | Col2    | Col3  |  Col4
------|---------|-------|-------------------------------------------------------
a     | 277.676 | 57.8% | 54.1%  54.1%  54.1%  54.2%  56.7%  62.2%  62.8% 64.5%
b     | 116.108 | 24.2% | 47.4%  48.1%  48.7%  49.3%
c     |  86.326 | 18.0% | 47.5%  47.9%  48.5%
--------------------------------------------------------------------------------

Explanation of o/p:

Col1 -> Group name
Col2 -> Cumulative execution time (in seconds) received by all tasks of that
group in a 60sec window across 8 cpus
Col3 -> CPU bandwidth received by the group in the 60sec window, expressed in
        percentage. Col3 data is derived as:
Col3 = 100 * Col2 / (NR_CPUS * 60)
Col4 -> CPU bandwidth received by each individual task of the group.
Col4 = 100 * cpu_time_recd_by_task / 60

[I can share the test case that produces a similar o/p if reqd]

The deviation from desired group fairness is as below:

a = +24.47%
b = -9.13%
c = -15.33%

which is quite high.

After the patch below is applied, here are the results:

--------------------------------------------------------------------------------
Col1  | Col2    | Col3  |  Col4
------|---------|-------|-------------------------------------------------------
a     | 163.112 | 34.0% | 33.2%  33.4%  33.5%  33.5%  33.7%  34.4%  34.8% 35.3%
b     | 156.220 | 32.5% | 63.3%  64.5%  66.1%  66.5%
c     | 160.653 | 33.5% | 85.8%  90.6%  91.4%
--------------------------------------------------------------------------------

Deviation from desired group fairness is as below:

a = +0.67%
b = -0.83%
c = +0.17%

which is far better IMO. Most of other runs have yielded a deviation within
+-2% at the most, which is good.

Why do we see bad (group) fairness with current scheuler?
=========================================================

Currently cpu's weight is just the summation of individual task weights.
This can yield incorrect results. For ex: consider three groups as below
on a 2-cpu system:

CPU0 CPU1
---------------------------
A (10)  B(5)
C(5)
---------------------------

Group A has 10 tasks, all on CPU0, Group B and C have 5 tasks each all
of which are on CPU1. Each task has the same weight (NICE_0_LOAD =
1024).

The current scheme would yield a cpu weight of 10240 (10*1024) for each cpu and
the load balancer will think both CPUs are perfectly balanced and won't
move around any tasks. This, however, would yield this bandwidth:

A = 50%
B = 25%
C = 25%

which is not the desired result.

What's changing in the patch?
=============================

- How cpu weights are calculated when CONFIF_FAIR_GROUP_SCHED is
  defined (see below)
- API Change
- Two tunables introduced in sysfs (under SCHED_DEBUG) to
  control the frequency at which the load balance monitor
  thread runs.

The basic change made in this patch is how cpu weight (rq->load.weight) is
calculated. Its now calculated as the summation of group weights on a cpu,
rather than summation of task weights. Weight exerted by a group on a
cpu is dependent on the shares allocated to it and also the number of
tasks the group has on that cpu compared to the total number of
(runnable) tasks the group has in the system.

Let,
W(K,i)  = Weight of group K on cpu i
T(K,i)  = Task load present in group K's cfs_rq on cpu i
T(K)    = Total task load of group K across various cpus
S(K)  = Shares allocated to group K
NRCPUS = Number of online cpus in the scheduler domain to
    which group K is assigned.

Then,
W(K,i) = S(K) * NRCPUS * T(K,i) / T(K)

A load balance monitor thread is created at bootup, which periodically
runs and adjusts group's weight on each cpu. To avoid its overhead, two
min/max tunables are introduced (under SCHED_DEBUG) to control the rate
at which it runs.

Fixes from: Peter Zijlstra <a.p.zijlstra@chello.nl>

- don't start the load_balance_monitor when there is only a single cpu.
- rename the kthread because its currently longer than TASK_COMM_LEN

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agosched: introduce a mutex and corresponding API to serialize access to doms_curarray
Srivatsa Vaddagiri [Fri, 25 Jan 2008 20:08:00 +0000 (21:08 +0100)]
sched: introduce a mutex and corresponding API to serialize access to doms_curarray

doms_cur[] array represents various scheduling domains which are
mutually exclusive. Currently cpusets code can modify this array (by
calling partition_sched_domains()) as a result of user modifying
sched_load_balance flag for various cpusets.

This patch introduces a mutex and corresponding API (only when
CONFIG_FAIR_GROUP_SCHED is defined) which allows a reader to safely read
the doms_cur[] array w/o worrying abt concurrent modifications to the
array.

The fair group scheduler code (introduced in next patch of this series)
makes use of this mutex to walk thr' doms_cur[] array while rebalancing
shares of task groups across cpus.

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agosched: group scheduling, change how cpu load is calculated
Srivatsa Vaddagiri [Fri, 25 Jan 2008 20:08:00 +0000 (21:08 +0100)]
sched: group scheduling, change how cpu load is calculated

This patch changes how the cpu load exerted by fair_sched_class tasks
is calculated. Load exerted by fair_sched_class tasks on a cpu is now
a summation of the group weights, rather than summation of task weights.
Weight exerted by a group on a cpu is dependent on the shares allocated
to it.

This version of patch has a minor impact on code size, but should have
no runtime/functional impact for !CONFIG_FAIR_GROUP_SCHED.

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agosched: group scheduling, minor fixes
Srivatsa Vaddagiri [Fri, 25 Jan 2008 20:07:59 +0000 (21:07 +0100)]
sched: group scheduling, minor fixes

Minor bug fixes for the group scheduler:

- Use a mutex to serialize add/remove of task groups and also when
  changing shares of a task group. Use the same mutex when printing
  cfs_rq debugging stats for various task groups.

- Use list_for_each_entry_rcu in for_each_leaf_cfs_rq macro (when
  walking task group list)

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agosched: group scheduling code cleanup
Srivatsa Vaddagiri [Fri, 25 Jan 2008 20:07:59 +0000 (21:07 +0100)]
sched: group scheduling code cleanup

Minor cleanups:

- Fix coding style
- remove obsolete comment

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agosched: remove printk_clock references from ia64
Ingo Molnar [Fri, 25 Jan 2008 20:07:59 +0000 (21:07 +0100)]
sched: remove printk_clock references from ia64

remove remaining printk_clock references from ia64.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agosched: remove printk_clock()
Ingo Molnar [Fri, 25 Jan 2008 20:07:59 +0000 (21:07 +0100)]
sched: remove printk_clock()

printk_clock() is obsolete - it has been replaced with cpu_clock().

Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agosched: fix CONFIG_PRINT_TIME's reliance on sched_clock()
Ingo Molnar [Fri, 25 Jan 2008 20:07:58 +0000 (21:07 +0100)]
sched: fix CONFIG_PRINT_TIME's reliance on sched_clock()

Stefano Brivio reported weird printk timestamp behavior during
CPU frequency changes:

  http://bugzilla.kernel.org/show_bug.cgi?id=9475

fix CONFIG_PRINT_TIME's reliance on sched_clock() and use cpu_clock()
instead.

Reported-and-bisected-by: Stefano Brivio <stefano.brivio@polimi.it>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
16 years agoprintk: make printk more robust by not allowing recursion
Ingo Molnar [Fri, 25 Jan 2008 20:07:58 +0000 (21:07 +0100)]
printk: make printk more robust by not allowing recursion

make printk more robust by allowing recursion only if there's a crash
going on. Also add recursion detection.

I've tested it with an artificially injected printk recursion - instead
of a lockup or spontaneous reboot or other crash, the output was a well
controlled:

[   41.057335] SysRq : <2>BUG: recent printk recursion!
[   41.057335] loglevel0-8 reBoot Crashdump show-all-locks(D) tErm Full kIll saK showMem Nice powerOff showPc show-all-timers(Q) unRaw Sync showTasks Unmount shoW-blocked-tasks

also do all this printk-debug logic with irqs disabled.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reviewed-by: Nick Piggin <npiggin@suse.de>
16 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris...
Linus Torvalds [Fri, 25 Jan 2008 16:44:29 +0000 (08:44 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/jmorris/selinux-2.6

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/selinux-2.6:
  selinux: make mls_compute_sid always polyinstantiate
  security/selinux: constify function pointer tables and fields
  security: add a secctx_to_secid() hook
  security: call security_file_permission from rw_verify_area
  security: remove security_sb_post_mountroot hook
  Security: remove security.h include from mm.h
  Security: remove security_file_mmap hook sparse-warnings (NULL as 0).
  Security: add get, set, and cloning of superblock security information
  security/selinux: Add missing "space"

16 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hskinnemoen...
Linus Torvalds [Fri, 25 Jan 2008 16:40:02 +0000 (08:40 -0800)]
Merge branch 'for-linus' of git://git./linux/kernel/git/hskinnemoen/avr32-2.6

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hskinnemoen/avr32-2.6:
  [AVR32] extint: Set initial irq type to low level
  [AVR32] extint: change set_irq_type() handling
  [AVR32] NMI debugging
  [AVR32] constify function pointer tables
  [AVR32] ATNGW100: Update defconfig
  [AVR32] ATSTK1002: Update defconfig
  [AVR32] Kconfig: Choose daughterboard instead of CPU
  [AVR32] Add support for ATSTK1003 and ATSTK1004
  [AVR32] Clean up external DAC setup code
  [AVR32] ATSTK1000: Move gpio-leds setup to setup.c
  [AVR32] Add support for AT32AP7001 and AT32AP7002
  [AVR32] Provide more CPU information in /proc/cpuinfo and dmesg
  [AVR32] Oprofile support
  [AVR32] Include instrumentation menu
  Disable VGA text console for AVR32 architecture
  [AVR32] Enable debugging only when needed
  ptrace: Call arch_ptrace_attach() when request=PTRACE_TRACEME
  [AVR32] Remove redundant try_to_freeze() call from do_signal()
  [AVR32] Drop GFP_COMP for DMA memory allocations

16 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw
Linus Torvalds [Fri, 25 Jan 2008 16:39:18 +0000 (08:39 -0800)]
Merge git://git./linux/kernel/git/steve/gfs2-2.6-nmw

* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: (56 commits)
  [GFS2] Allow journal recovery on read-only mount
  [GFS2] Lockup on error
  [GFS2] Fix page_mkwrite truncation race path
  [GFS2] Fix typo
  [GFS2] Fix write alloc required shortcut calculation
  [GFS2] gfs2_alloc_required performance
  [GFS2] Remove unneeded i_spin
  [GFS2] Reduce inode size by moving i_alloc out of line
  [GFS2] Fix assert in log code
  [GFS2] Fix problems relating to execution of files on GFS2
  [GFS2] Initialize extent_list earlier
  [GFS2] Allow page migration for writeback and ordered pages
  [GFS2] Remove unused variable
  [GFS2] Fix log block mapper
  [GFS2] Minor correction
  [GFS2] Eliminate the no longer needed sd_statfs_mutex
  [GFS2] Incremental patch to fix compiler warning
  [GFS2] Function meta_read optimization
  [GFS2] Only fetch the dinode once in block_map
  [GFS2] Reorganize function gfs2_glmutex_lock
  ...

16 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Linus Torvalds [Fri, 25 Jan 2008 16:38:25 +0000 (08:38 -0800)]
Merge git://git./linux/kernel/git/herbert/crypto-2.6

* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (125 commits)
  [CRYPTO] twofish: Merge common glue code
  [CRYPTO] hifn_795x: Fixup container_of() usage
  [CRYPTO] cast6: inline bloat--
  [CRYPTO] api: Set default CRYPTO_MINALIGN to unsigned long long
  [CRYPTO] tcrypt: Make xcbc available as a standalone test
  [CRYPTO] xcbc: Remove bogus hash/cipher test
  [CRYPTO] xcbc: Fix algorithm leak when block size check fails
  [CRYPTO] tcrypt: Zero axbuf in the right function
  [CRYPTO] padlock: Only reset the key once for each CBC and ECB operation
  [CRYPTO] api: Include sched.h for cond_resched in scatterwalk.h
  [CRYPTO] salsa20-asm: Remove unnecessary dependency on CRYPTO_SALSA20
  [CRYPTO] tcrypt: Add select of AEAD
  [CRYPTO] salsa20: Add x86-64 assembly version
  [CRYPTO] salsa20_i586: Salsa20 stream cipher algorithm (i586 version)
  [CRYPTO] gcm: Introduce rfc4106
  [CRYPTO] api: Show async type
  [CRYPTO] chainiv: Avoid lock spinning where possible
  [CRYPTO] seqiv: Add select AEAD in Kconfig
  [CRYPTO] scatterwalk: Handle zero nbytes in scatterwalk_map_and_copy
  [CRYPTO] null: Allow setkey on digest_null
  ...

16 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-2.6
Linus Torvalds [Fri, 25 Jan 2008 16:34:42 +0000 (08:34 -0800)]
Merge git://git./linux/kernel/git/gregkh/driver-2.6

This can be broken down into these major areas:
 - Documentation updates (language translations and fixes, as
   well as kobject and kset documenatation updates.)
 - major kset/kobject/ktype rework and fixes.  This cleans up the
   kset and kobject and ktype relationship and architecture,
   making sense of things now, and good documenation and samples
   are provided for others to use.  Also the attributes for
   kobjects are much easier to handle now.  This cleaned up a LOT
   of code all through the kernel, making kobjects easier to use
   if you want to.
 - struct bus_type has been reworked to now handle the lifetime
   rules properly, as the kobject is properly dynamic.
 - struct driver has also been reworked, and now the lifetime
   issues are resolved.
 - the block subsystem has been converted to use struct device
   now, and not "raw" kobjects.  This patch has been in the -mm
   tree for over a year now, and finally all the issues are
   worked out with it.  Older distros now properly work with new
   kernels, and no userspace updates are needed at all.
 - nozomi driver is added.  This has also been in -mm for a long
   time, and many people have asked for it to go in.  It is now
   in good enough shape to do so.
 - lots of class_device conversions to use struct device instead.
   The tree is almost all cleaned up now, only SCSI and IB is the
   remaining code to fix up...

* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-2.6: (196 commits)
  Driver core: coding style fixes
  Kobject: fix coding style issues in kobject c files
  Kobject: fix coding style issues in kobject.h
  Driver core: fix coding style issues in device.h
  spi: use class iteration api
  scsi: use class iteration api
  rtc: use class iteration api
  power supply : use class iteration api
  ieee1394: use class iteration api
  Driver Core: add class iteration api
  Driver core: Cleanup get_device_parent() in device_add() and device_move()
  UIO: constify function pointer tables
  Driver Core: constify the name passed to platform_device_register_simple
  driver core: fix build with SYSFS=n
  sysfs: make SYSFS_DEPRECATED depend on SYSFS
  Driver core: use LIST_HEAD instead of call to INIT_LIST_HEAD in __init
  kobject: add sample code for how to use ksets/ktypes/kobjects
  kobject: add sample code for how to use kobjects in a simple manner.
  kobject: update the kobject/kset documentation
  kobject: remove old, outdated documentation.
  ...

16 years agoslab: fix bootstrap on memoryless node
Pekka Enberg [Fri, 25 Jan 2008 06:20:51 +0000 (08:20 +0200)]
slab: fix bootstrap on memoryless node

If the node we're booting on doesn't have memory, bootstrapping kmalloc()
caches resorts to fallback_alloc() which requires ->nodelists set for all
nodes.  Fix that by calling set_up_list3s() for CACHE_CACHE in
kmem_cache_init().

As kmem_getpages() is called with GFP_THISNODE set, this used to work before
because of breakage in 2.6.22 and before with GFP_THISNODE returning pages from
the wrong node if a node had no memory. So it may have worked accidentally and
in an unsafe manner because the pages would have been associated with the wrong
node which could trigger bug ons and locking troubles.

Tested-by: Mel Gorman <mel@csn.ul.ie>
Tested-by: Olaf Hering <olaf@aepfle.de>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
[ With additional one-liner by Olaf Hering  - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agofix oops on rmmod capidrv
Karsten Keil [Fri, 25 Jan 2008 10:55:28 +0000 (11:55 +0100)]
fix oops on rmmod capidrv

Fix overwriting the stack with the version string
(it is currently 10 bytes + zero) when unloading the
capidrv module. Safeguard against overwriting it
should the version string grow in the future.

Should fix Kernel Bug Tracker Bug 9696.

Signed-off-by: Gerd v. Egidy <gerd.von.egidy@intra2net.com>
Acked-by: Karsten Keil <kkeil@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years ago[GFS2] Allow journal recovery on read-only mount
Abhijith Das [Fri, 18 Jan 2008 20:06:37 +0000 (14:06 -0600)]
[GFS2] Allow journal recovery on read-only mount

This patch allows gfs2 to perform journal recovery even if it is mounted
read-only. Strictly speaking, a read-only mount should not be writing to
the filesystem, but we do this only to perform journal recovery. A
read-only mount will fail if we don't recover the dirty journal. Also,
when gfs2 is used as a root filesystem, it will be mounted read-only
before being mounted read-write during the boot sequence. A failed
read-only mount will panic the machine during bootup.

Signed-off-by: Abhijith Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Lockup on error
Bob Peterson [Sun, 20 Jan 2008 03:50:24 +0000 (21:50 -0600)]
[GFS2] Lockup on error

I spotted this bug while I was digging around.  Looks like it could cause
a lockup in some rare error condition.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Fix page_mkwrite truncation race path
Steven Whitehouse [Thu, 17 Jan 2008 15:12:03 +0000 (15:12 +0000)]
[GFS2] Fix page_mkwrite truncation race path

There was a bug in the truncation/invalidation race path for
->page_mkwrite for gfs2. It ought to return 0 so that the effect is the
same as if the page was truncated at any of the other points at which
the page_lock is dropped. This will result in the restart of the whole
page fault path. If it was due to a real truncation (as opposed to an
invalidate because we let a glock go) then the ->fault path will pick
that up when it gets called again.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Fix typo
Bob Peterson [Wed, 16 Jan 2008 14:45:39 +0000 (08:45 -0600)]
[GFS2] Fix typo

This patch fixes a minor typo.  Surprisingly, it still compiled.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Fix write alloc required shortcut calculation
Steven Whitehouse [Wed, 16 Jan 2008 14:24:05 +0000 (14:24 +0000)]
[GFS2] Fix write alloc required shortcut calculation

The comparison was being made against the wrong quantity.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] gfs2_alloc_required performance
Bob Peterson [Fri, 11 Jan 2008 19:44:50 +0000 (13:44 -0600)]
[GFS2] gfs2_alloc_required performance

This is a small I/O performance enhancement to gfs2.  (Actually, it is a rework of
an earlier version I got wrong).  The idea here is to check if the write extends
past the last block in the file.  If so, the function can save itself a lot of
time and trouble because it knows an allocate will be required.  Benchmarks like
iozone should see better performance.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Remove unneeded i_spin
Bob Peterson [Fri, 11 Jan 2008 19:31:12 +0000 (13:31 -0600)]
[GFS2] Remove unneeded i_spin

This patch removes a vestigial variable "i_spin" from the gfs2_inode
structure.  This not only saves us memory (>300000 of these in memory
for the oom test) it also saves us time because we don't have to
spend time initializing it (i.e. slightly better performance).

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Reduce inode size by moving i_alloc out of line
Steven Whitehouse [Thu, 10 Jan 2008 15:18:55 +0000 (15:18 +0000)]
[GFS2] Reduce inode size by moving i_alloc out of line

It is possible to reduce the size of GFS2 inodes by taking the i_alloc
structure out of the gfs2_inode. This patch allocates the i_alloc
structure whenever its needed, and frees it afterward. This decreases
the amount of low memory we use at the expense of requiring a memory
allocation for each page or partial page that we write. A quick test
with postmark shows that the overhead is not measurable and I also note
that OCFS2 use the same approach.

In the future I'd like to solve the problem by shrinking down the size
of the members of the i_alloc structure, but for now, this reduces the
immediate problem of using too much low-memory on x86 and doesn't add
too much overhead.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Fix assert in log code
Steven Whitehouse [Thu, 10 Jan 2008 14:49:43 +0000 (14:49 +0000)]
[GFS2] Fix assert in log code

Although the values were all being calculated correctly, there was a
race in the assert due to the way it was using atomic variables. This
changes the value we assert on so that we get the same effect by testing
a different variable. This prevents the assert triggering when it shouldn't.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Fix problems relating to execution of files on GFS2
Steven Whitehouse [Tue, 8 Jan 2008 08:14:30 +0000 (08:14 +0000)]
[GFS2] Fix problems relating to execution of files on GFS2

This patch fixes a couple of problems which affected the execution of files
on GFS2. The first is that there was a corner case where inodes were not
always uptodate at the point at which permissions checks were being carried
out, this was resulting in refusal of execute permission, but only on the
first lookup, subsequent requests worked correctly. The second was a problem
relating to incorrect updating of file sizes which was introduced with the
write_begin/end code for GFS2 a little while back.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Abhijith Das <adas@redhat.com>
16 years ago[GFS2] Initialize extent_list earlier
Bob Peterson [Thu, 3 Jan 2008 15:24:53 +0000 (09:24 -0600)]
[GFS2] Initialize extent_list earlier

Here is a patch for the latest upstream GFS2 code:
The journal extent map needs to be initialized sooner than it
currently is.  Otherwise failed mount attempts (e.g. not enough
journals, etc.) may panic trying to access the uninitialized list.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Allow page migration for writeback and ordered pages
Steven Whitehouse [Thu, 3 Jan 2008 11:31:38 +0000 (11:31 +0000)]
[GFS2] Allow page migration for writeback and ordered pages

To improve performance on NUMA, we use the VM's standard page
migration for writeback and ordered pages. Probably we could
also do the same for journaled data, but that would need a
careful audit of the code, so will be the subject of a later
patch.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Remove unused variable
Steven Whitehouse [Wed, 2 Jan 2008 10:16:56 +0000 (10:16 +0000)]
[GFS2] Remove unused variable

The go_drop_th function is never called or referenced.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Fix log block mapper
Steven Whitehouse [Fri, 14 Dec 2007 14:04:34 +0000 (14:04 +0000)]
[GFS2] Fix log block mapper

A missing offset in the calculation.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Minor correction
Bob Peterson [Wed, 12 Dec 2007 23:52:13 +0000 (17:52 -0600)]
[GFS2] Minor correction

This is a small correction to my previously posted patch1.
It just changes a divide to a shift.  It's faster and doesn't
introduce odd dependencies on 32-bit compiles.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Eliminate the no longer needed sd_statfs_mutex
Bob Peterson [Wed, 12 Dec 2007 17:44:41 +0000 (11:44 -0600)]
[GFS2] Eliminate the no longer needed sd_statfs_mutex

This patch eliminates the unneeded sd_statfs_mutex mutex but preserves
the ordering as discussed.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Incremental patch to fix compiler warning
Bob Peterson [Wed, 12 Dec 2007 15:24:08 +0000 (09:24 -0600)]
[GFS2] Incremental patch to fix compiler warning

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Function meta_read optimization
Bob Peterson [Wed, 12 Dec 2007 01:29:17 +0000 (19:29 -0600)]
[GFS2] Function meta_read optimization

This patch optimizes function gfs2_meta_read.  Basically, gfs2_meta_wait
was being called regardless of whether a disk read was requested.
This just pulls that wait into the if that triggers the read.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Only fetch the dinode once in block_map
Bob Peterson [Wed, 12 Dec 2007 01:16:09 +0000 (19:16 -0600)]
[GFS2] Only fetch the dinode once in block_map

Function gfs2_block_map was often looking up the disk inode twice.
This optimizes it so that only does it once.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Reorganize function gfs2_glmutex_lock
Bob Peterson [Wed, 12 Dec 2007 01:13:54 +0000 (19:13 -0600)]
[GFS2] Reorganize function gfs2_glmutex_lock

This patch optimizes the function gfs2_glmutex_lock.
The basic theory is: Why bother initializing a holder, setting up
wait bits and then waiting on them, if you know the glock can be
yours.  So the holder stuff is placed inside the if checking if the
glock is locked.  This one needs careful scrutiny because changing
anything to do with locking should strike terror into one's heart.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Run through full bitmaps quicker in gfs2_bitfit
Bob Peterson [Wed, 12 Dec 2007 01:00:16 +0000 (19:00 -0600)]
[GFS2] Run through full bitmaps quicker in gfs2_bitfit

I eliminated the passing of an unused parameter into gfs2_bitfit called rgd.

This also changes the gfs2_bitfit code that searches for free (or used) blocks.
Before, the code was trying to check for bytes that indicated 4 blocks in
the undesired state.  The problem is, it was spending more time trying to
do this than it actually was saving.  This version only optimizes the case
where we're looking for free blocks, and it checks a machine word at a time.
So on 32-bit machines, it will check 32-bits (16 blocks) and on 64-bit
machines, it will check 64-bits (32 blocks) at a time.  The compiler
optimizes that quite well and we save some time, especially when running
through full bitmaps (like the bitmaps allocated for the journals).

There's probably a more elegant or optimized way to do this, but I haven't
thought of it yet.  I'm open to suggestions.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Get rid of useless "found" variable in quota.c
Bob Peterson [Wed, 12 Dec 2007 00:51:25 +0000 (18:51 -0600)]
[GFS2] Get rid of useless "found" variable in quota.c

This just eliminates an unused variable from the quota code.
Not likely to be a time saver.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Journal extent mapping
Bob Peterson [Wed, 12 Dec 2007 00:49:21 +0000 (18:49 -0600)]
[GFS2] Journal extent mapping

This patch saves a little time when gfs2 writes to the journals by
keeping a mapping between logical and physical blocks on disk.
That's better than constantly looking up indirect pointers in
buffers, when the journals are several levels of indirection
(which they typically are).

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Remove function gfs2_get_block
Bob Peterson [Mon, 10 Dec 2007 20:13:27 +0000 (14:13 -0600)]
[GFS2] Remove function gfs2_get_block

This patch is just a cleanup.  Function gfs2_get_block() just calls
function gfs2_block_map reversing the last two parameters.  By
reversing the parameters, gfs2_block_map() may be called directly
and function gfs2_get_block may be eliminated altogether.
Since this function is done for every block operation,
this streamlines the code and makes it a little bit more efficient.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] use pid for plock owner for nfs clients
David Teigland [Thu, 6 Dec 2007 15:35:25 +0000 (09:35 -0600)]
[GFS2] use pid for plock owner for nfs clients

The fl_owner is that of lockd when posix locks arrive from nfs
clients, so it can't be used to distinguish between lock holders.
Use fl_pid as owner instead; it's the pid of the process on the
nfs client.

Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Remove unused variable
Steven Whitehouse [Fri, 30 Nov 2007 08:17:15 +0000 (08:17 +0000)]
[GFS2] Remove unused variable

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] patch to check for recursive lock requests in gfs2_rename code path
Abhijith Das [Thu, 29 Nov 2007 20:13:54 +0000 (14:13 -0600)]
[GFS2] patch to check for recursive lock requests in gfs2_rename code path

A certain scenario in the rename code path triggers a kernel BUG()
because it accidentally does recursive locking The first lock is
requested to unlink an already existing inode (replacing a file) and the
second lock is requested when the destination directory needs to alloc
some space. It is rare that these two
events happen during the same rename call, and even more rare that these
two instances try to lock the same rgrp. It is, however, possible.
https://bugzilla.redhat.com/show_bug.cgi?id=404711

Signed-off-by: Abhijith Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Remove lock methods for lock_nolock protocol
Wendy Cheng [Thu, 29 Nov 2007 22:56:51 +0000 (17:56 -0500)]
[GFS2] Remove lock methods for lock_nolock protocol

GFS2 supports two modes of locking - lock_nolock for single node filesystem
and lock_dlm for cluster mode locking. The gfs2 lock methods are removed from
file operation table for lock_nolock protocol. This would allow VFS to handle
posix lock and flock logics just like other in-tree filesystems without
duplication.

Signed-off-by: S. Wendy Cheng <wcheng@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Remove unrequired code
Fabio M. Di Nitto [Wed, 28 Nov 2007 15:22:09 +0000 (16:22 +0100)]
[GFS2] Remove unrequired code

Signed-off-by: Fabio M. Di Nitto <fabbione@ubuntu.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Fix build warnings
Fabio Massimo Di Nitto [Tue, 27 Nov 2007 05:16:42 +0000 (06:16 +0100)]
[GFS2] Fix build warnings

Hi Steven,

Steven Whitehouse wrote:
> Hi,
>
> Now in the -nmw git tree. Thanks,
>
> Steve.
>
> On Wed, 2007-11-21 at 11:54 -0600, Ryan O'Hara wrote:

this patch introduces a bunch of build warnings by leaving around

struct inode *inode = &ip->i_inode;

The patch in attachment cleans them up. Please apply.

Signed-off-by: Fabio Massimo Di Nitto <fabbione@ubuntu.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] remove unnecessary permission checks
Ryan O'Hara [Wed, 21 Nov 2007 17:54:54 +0000 (11:54 -0600)]
[GFS2] remove unnecessary permission checks

Remove read/write permission() checks from xattr operations.
VFS layer is already handling permission for xattrs via the
xattr_permission() call, so there is no need for gfs2 to
check permissions. Futhermore, using permission() for SELinux
xattrs ops is incorrect.

Signed-off-by: Ryan O'Hara <rohara@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Fix runtime issue with UP kernels
Fabio Massimo Di Nitto [Fri, 16 Nov 2007 09:50:40 +0000 (09:50 +0000)]
[GFS2] Fix runtime issue with UP kernels

The issue is indeed UP vs SMP and it is totally random.

spin_is_locked() is a bad assertion because there is no correct answer on UP.
on UP spin_is_locked() has to return either one value or another, always.

This means that in my setup I am lucky enough to trigger the issue and your you
are lucky enough not to.

the patch in attachment removes the bogus calls to BUG_ON and according to David
(in CC and thanks for the long explanation on the problem) we can rely upon
things like lockdep to find problem that might be trying to catch.

Signed-off-by: Fabio M. Di Nitto <fabbione@ubuntu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] tidy up error message
David Teigland [Thu, 15 Nov 2007 15:01:13 +0000 (09:01 -0600)]
[GFS2] tidy up error message

Print error with log_error() to be consistent with others.

Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Fabio M. Di Nitto <fabbione@ubuntu.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Check for installation of mount helpers for DLM mounts
Fabio Massimo Di Nitto [Thu, 15 Nov 2007 13:48:52 +0000 (13:48 +0000)]
[GFS2] Check for installation of mount helpers for DLM mounts

The patch is a fix to abort mount if the mount.gfs* and possible
umount.* are missing from /sbin.

While we do what we can to guarantee that they are installed properly in
userland (CVS HEAD), we want to make sure that mount still aborts properly.

The only sign of missing helpers is that lock_dlm will receive no mount options
at all. According to David the problem does not exist for lock_nolock as the
helpers are not required.

The patch has been tested for both gfs and gfs2 and it works as expected. The
lack of mount.gfs* will generate an error that is propagated to mount:

oot@node1:~# mount -t  gfs2 /dev/nbd2 /mnt/
mount: wrong fs type, bad option, bad superblock on /dev/nbd2,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

[ 3513.303346] GFS2: fsid=: Trying to join cluster "lock_dlm", "gutsy:gfs2"
[ 3513.304546] DLM/GFS2/GFS ERROR: (u)mount helpers are not installed properly!
[ 3513.306290] GFS2: fsid=: can't mount proto=lock_dlm, table=gutsy:gfs2, hostdata=

You might want to notice that it will also avoid mount to hang or fail silently
or with strange errors that will require the cluster to reboot/restart before
you can actually mount the filesystem again.

Signed-off-by: Fabio M. Di Nitto <fabbione@ubuntu.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Don't periodically update the jindex
Steven Whitehouse [Fri, 9 Nov 2007 10:07:21 +0000 (10:07 +0000)]
[GFS2] Don't periodically update the jindex

We only care about the content of the jindex in two cases,
one is when we mount the fs and the other is when we need
to recover another journal. In both cases we have to update
the jindex anyway, so there is no point in updating it
periodically between times, so this removes it to simplify
gfs2_logd.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Move gfs2_logd into log.c
Steven Whitehouse [Fri, 9 Nov 2007 10:01:41 +0000 (10:01 +0000)]
[GFS2] Move gfs2_logd into log.c

This means that we can mark gfs2_ail1_empty static and prepares
the way for further changes.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Use atomic_t for journal free blocks counter
Steven Whitehouse [Thu, 8 Nov 2007 14:55:03 +0000 (14:55 +0000)]
[GFS2] Use atomic_t for journal free blocks counter

This patch changes the counter which keeps track of the free
blocks in the journal to an atomic_t in preparation for the
following patch which will update the log reservation code.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Don't add glocks to the journal
Steven Whitehouse [Thu, 8 Nov 2007 14:25:12 +0000 (14:25 +0000)]
[GFS2] Don't add glocks to the journal

The only reason for adding glocks to the journal was to keep track
of which locks required a log flush prior to release. We add a
flag to the glock to allow this check to be made in a simpler way.

This reduces the size of a glock (by 12 bytes on i386, 24 on x86_64)
and means that we can avoid extra work during the journal flush.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] check kthread_should_stop when waiting
David Teigland [Wed, 7 Nov 2007 15:03:56 +0000 (09:03 -0600)]
[GFS2] check kthread_should_stop when waiting

Use wait_event_interruptible() in the lock_dlm thread instead
of an open coded equivalent, and include a kthread_should_stop()
check in the wait test so we don't miss a kthread_stop().

Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Given device ID rather than s_id in "id" sysfs file
Bob Peterson [Fri, 2 Nov 2007 14:37:15 +0000 (09:37 -0500)]
[GFS2] Given device ID rather than s_id in "id" sysfs file

This patch changes the /sys/fs/gfs2/<s_id>/id file to give the device
id "major:minor" rather than the s_id.  That enables gfs2_tool to
match devices properly (by id, not name) when locating the tuning files.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Remove flags no longer required
Steven Whitehouse [Fri, 2 Nov 2007 09:14:31 +0000 (09:14 +0000)]
[GFS2] Remove flags no longer required

The HIF_MUTEX and HIF_PROMOTE flags were set on the glock holders
depending upon which of the two waiters lists they were going to
be queued upon. They were then tested when the holders were taken
off the lists to ensure that the right type of holder was being
dequeued.

Since we are already using separate lists, there doesn't seem a
lot of point having these flags as well, and since setting them
and testing them is in the fast path for locking and unlocking
glock, this patch removes them.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Reorder writeback for glock sync
Steven Whitehouse [Fri, 2 Nov 2007 08:39:34 +0000 (08:39 +0000)]
[GFS2] Reorder writeback for glock sync

Previously we were doing (write data, wait for data, write metadata, wait
for metadata). After this patch we so (write metadata, write data, wait for
data, wait for metadata) which should be more efficient.

Also I noticed that the drop_bh and xmote_bh functions were almost
identical. In fact the only difference was a single test, and that
test is such that in the drop_bh case, it would always evaluate to
the correct result. As such we can use the xmote_bh functions in
all the places where we were using the drop_bh function and remove
the drop_bh functions.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Add sync_page to metadata address space operations
Steven Whitehouse [Thu, 1 Nov 2007 09:34:14 +0000 (09:34 +0000)]
[GFS2] Add sync_page to metadata address space operations

This set of address space operations was missing a sync_page
operation.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Remove "reclaim limit"
Steven Whitehouse [Thu, 1 Nov 2007 09:26:54 +0000 (09:26 +0000)]
[GFS2] Remove "reclaim limit"

This call to reclaim glocks is not needed, and in particular we don't want it
in the fast path for locking glocks. The limit was entirely arbitrary anyway
and we can't expect users to adjust things like this, the remaining code will
do the right thing on its own.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Remove unused variables
Steven Whitehouse [Wed, 31 Oct 2007 14:24:33 +0000 (14:24 +0000)]
[GFS2] Remove unused variables

These haven't been used for some time, remove them.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Use correct include file in ops_address.c
Steven Whitehouse [Thu, 18 Oct 2007 10:15:50 +0000 (11:15 +0100)]
[GFS2] Use correct include file in ops_address.c

Something changed in the upstream kernel, and it needs this
one-liner to allow ops_address.c to build.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Don't hold page lock when starting transaction
Steven Whitehouse [Wed, 17 Oct 2007 13:05:41 +0000 (14:05 +0100)]
[GFS2] Don't hold page lock when starting transaction

This is an addendum to the new AOPs work which moves the point
at which we take the page lock so that we don't get it until
the last possible moment. This resolves a conflict between
starting transactions and the page lock.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Add writepages for GFS2 jdata
Steven Whitehouse [Wed, 17 Oct 2007 08:04:24 +0000 (09:04 +0100)]
[GFS2] Add writepages for GFS2 jdata

This patch resolves a lock ordering issue where we had been getting
a transaction lock in the wrong order with respect to the page lock.
By using writepages rather than just writepage, it is then possible
to start a transaction before locking the page, and thus matching the
locking order elsewhere in the code.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Split gfs2_writepage into three cases
Steven Whitehouse [Fri, 28 Sep 2007 12:49:05 +0000 (13:49 +0100)]
[GFS2] Split gfs2_writepage into three cases

This patch splits gfs2_writepage into separate functions for each of
the three cases: writeback, ordered and journalled. As a result
it becomes a lot easier to see what each one is doing. The common
code is moved into gfs2_writepage_common.

This fixes a performance bug where we were doing more work than
strictly required in the ordered write case.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Introduce gfs2_set_aops()
Steven Whitehouse [Wed, 17 Oct 2007 07:47:38 +0000 (08:47 +0100)]
[GFS2] Introduce gfs2_set_aops()

Just like ext3 we now have three sets of address space operations
to cover the cases of writeback, ordered and journalled data
writes. This means that the individual operations can now become
less complicated as we are able to remove some of the tests for
file data mode from the code.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Add gfs2_is_writeback()
Steven Whitehouse [Wed, 17 Oct 2007 07:35:19 +0000 (08:35 +0100)]
[GFS2] Add gfs2_is_writeback()

This adds a function "gfs2_is_writeback()" along the lines of the
existing "gfs2_is_jdata()" in order to clean up the code and make
the various tests for the inode mode more obvious. It also fixes
the PageChecked() logic where we were resetting the flag too early
in the case of an error path.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Remove unused field in struct gfs2_inode
Steven Whitehouse [Tue, 16 Oct 2007 10:47:04 +0000 (11:47 +0100)]
[GFS2] Remove unused field in struct gfs2_inode

Removes a field that is not used.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Remove useless i_cache from inodes
Steven Whitehouse [Mon, 15 Oct 2007 15:29:05 +0000 (16:29 +0100)]
[GFS2] Remove useless i_cache from inodes

The i_cache was designed to keep references to the indirect blocks
used during block mapping so that they didn't have to be looked
up continually. The idea failed because there are too many places
where the i_cache needs to be freed, and this has in the past been
the cause of many bugs.

In addition there was no performance benefit being gained since the
disk blocks in question were cached anyway. So this patch removes
it in order to simplify the code to prepare for other changes which
would otherwise have had to add further support for this feature.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Use ->page_mkwrite() for mmap()
Steven Whitehouse [Mon, 15 Oct 2007 14:40:33 +0000 (15:40 +0100)]
[GFS2] Use ->page_mkwrite() for mmap()

This cleans up the mmap() code path for GFS2 by implementing the
page_mkwrite function for GFS2. We are thus able to use the
generic filemap_fault function for our ->fault() implementation.

This now means that shared writable mappings will be much more
efficiently shared across the cluster if there is a reasonable
proportion of read activity (the greater proportion, the better).

As a side effect, it also reduces the size of the code, removes
special cases from readpage and readpages, and makes the code
path easier to follow.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[GFS2] Clean up internal read function
Steven Whitehouse [Mon, 15 Oct 2007 13:42:35 +0000 (14:42 +0100)]
[GFS2] Clean up internal read function

As requested by Christoph, this patch cleans up GFS2's internal
read function so that it no longer uses the do_generic_mapping_read
function. This function is obsolete and GFS2 is the last user of it.

As a side effect the internal read code gets smaller and easier
to read and gfs2_readpage is split into two. One function has the locking
and the other function has the rest of the logic.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
16 years ago[GFS2] Handle multiple glock demote requests
Wendy Cheng [Fri, 5 Oct 2007 04:27:58 +0000 (00:27 -0400)]
[GFS2] Handle multiple glock demote requests

Fix a race condition where multiple glock demote requests are sent to
a node back-to-back. This patch does a check inside handle_callback()
to see whether a demote request is in progress. If true, it sets a flag
to make sure run_queue() will loop again to handle the new request,
instead of erronously setting gl_demote_state to a different state.

Signed-off-by: S. Wendy Cheng <wcheng@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
16 years ago[AVR32] extint: Set initial irq type to low level
Haavard Skinnemoen [Thu, 24 Jan 2008 15:56:53 +0000 (16:56 +0100)]
[AVR32] extint: Set initial irq type to low level

David Brownell pointed out a mismatch in the avr32 extint code:

> I noticed a small glitch that's not fixed by this patch:  the
> initial type is falling edge, but IRQ_TYPE_NONE is mapped to
> IRQ_TYPE_LEVEL_LOW.  Potentially surprising.

Fix it by setting the initial type (and handler) to low level,
matching the meaning of IRQ_TYPE_NONE.

Signed-off-by: Haavard Skinnemoen <hskinnemoen@atmel.com>
16 years ago[AVR32] extint: change set_irq_type() handling
David Brownell [Wed, 19 Dec 2007 04:50:28 +0000 (20:50 -0800)]
[AVR32] extint: change set_irq_type() handling

Update the AVR32 EIC code to use the new __set_irq_handler_unlocked()
call, getting rid of one more instance of this widespread problem.

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Haavard Skinnemoen <hskinnemoen@atmel.com>