Commit graph

992 commits

Author SHA1 Message Date
Thomas Gleixner
a5fd9733a3 tick/powerclamp: Remove tick_nohz_idle abuse
commit 4dbd27711c "tick: export nohz tick idle symbols for module
use" was merged via the thermal tree without an explicit ack from the
relevant maintainers.

The exports are abused by the intel powerclamp driver which implements
a fake idle state from a sched FIFO task. This causes all kinds of
wreckage in the NOHZ core code which rightfully assumes that
tick_nohz_idle_enter/exit() are only called from the idle task itself.

Recent changes in the NOHZ core lead to a failure of the powerclamp
driver and now people try to hack completely broken and backwards
workarounds into the NOHZ core code. This is completely unacceptable
and just papers over the real problem. There are way more subtle
issues lurking around the corner.

The real solution is to fix the powerclamp driver by rewriting it with
a sane concept, but that's beyond the scope of this.

So the only solution for now is to remove the calls into the core NOHZ
code from the powerclamp trainwreck along with the exports. 

Fixes: d6d71ee4a1 "PM: Introduce Intel PowerClamp Driver"
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Pan Jacob jun <jacob.jun.pan@intel.com>
Cc: LKP <lkp@01.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1412181110110.17382@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-12-19 14:05:52 +01:00
Linus Torvalds
d82012695e Merge branch 'timers-2038-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull more 2038 timer work from Thomas Gleixner:
 "Two more patches for the ongoing 2038 work:

   - New accessors to clock MONOTONIC and REALTIME seconds

  This is a seperate branch as Arnd has follow up work depending on
  this"

* 'timers-2038-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timekeeping: Provide y2038 safe accessor to the seconds portion of CLOCK_REALTIME
  timekeeping: Provide fast accessor to the seconds part of CLOCK_MONOTONIC
2014-12-10 10:13:28 -08:00
Linus Torvalds
a157508c97 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer core updates from Thomas Gleixner:
 "The time(r) departement provides:

   - more infrastructure work on the year 2038 issue

   - a few fixes in the Armada SoC timers

   - the usual pile of fixlets and improvements"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clocksource: armada-370-xp: Use the reference clock on A375 SoC
  watchdog: orion: Use the reference clock on Armada 375 SoC
  clocksource: armada-370-xp: Add missing clock enable
  time: Fix sign bug in NTP mult overflow warning
  time: Remove timekeeping_inject_sleeptime()
  rtc: Update suspend/resume timing to use 64bit time
  rtc/lib: Provide y2038 safe rtc_tm_to_time()/rtc_time_to_tm() replacement
  time: Fixup comments to reflect usage of timespec64
  time: Expose get_monotonic_coarse64() for in-kernel uses
  time: Expose getrawmonotonic64 for in-kernel uses
  time: Provide y2038 safe mktime() replacement
  time: Provide y2038 safe timekeeping_inject_sleeptime() replacement
  time: Provide y2038 safe do_settimeofday() replacement
  time: Complete NTP adjustment threshold judging conditions
  time: Avoid possible NTP adjustment mult overflow.
  time: Rename udelay_test.c to test_udelay.c
  clocksource: sirf: Remove hard-coded clock rate
2014-12-10 08:18:32 -08:00
Linus Torvalds
c30110608c Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "These are the main changes in this cycle:

    - Streamline RCU's use of per-CPU variables, shifting from "cpu"
      arguments to functions to "this_"-style per-CPU variable
      accessors.

    - signal-handling RCU updates.

    - real-time updates.

    - torture-test updates.

    - miscellaneous fixes.

    - documentation updates"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
  rcu: Fix FIXME in rcu_tasks_kthread()
  rcu: More info about potential deadlocks with rcu_read_unlock()
  rcu: Optimize cond_resched_rcu_qs()
  rcu: Add sparse check for RCU_INIT_POINTER()
  documentation: memory-barriers.txt: Correct example for reorderings
  documentation: Add atomic_long_t to atomic_ops.txt
  documentation: Additional restriction for control dependencies
  documentation: Document RCU self test boot params
  rcutorture: Fix rcu_torture_cbflood() memory leak
  rcutorture: Remove obsolete kversion param in kvm.sh
  rcutorture: Remove stale test configurations
  rcutorture: Enable RCU self test in configs
  rcutorture: Add early boot self tests
  torture: Run Linux-kernel binary out of results directory
  cpu: Avoid puts_pending overflow
  rcu: Remove "cpu" argument to rcu_cleanup_after_idle()
  rcu: Remove "cpu" argument to rcu_prepare_for_idle()
  rcu: Remove "cpu" argument to rcu_needs_cpu()
  rcu: Remove "cpu" argument to rcu_note_context_switch()
  rcu: Remove "cpu" argument to rcu_preempt_check_callbacks()
  ...
2014-12-09 20:23:19 -08:00
John Stultz
cb2aa63469 time: Fix sign bug in NTP mult overflow warning
In commit 6067dc5a8c ("time: Avoid possible NTP adjustment
mult overflow") a new check was added to watch for adjustments
that could cause a mult overflow.

Unfortunately the check compares a signed with unsigned value
and ignored the case where the adjustment was negative, which
causes spurious warn-ons on some systems (and seems like it
would result in problematic time adjustments there as well, due
to the early return).

Thus this patch adds a check to make sure the adjustment is
positive before we check for an overflow, and resovles the issue
in my testing.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Debugged-by: pang.xunlei <pang.xunlei@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/1416890145-30048-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-25 07:18:34 +01:00
John Stultz
5322e4c264 time: Fixup comments to reflect usage of timespec64
Fix up a few comments that weren't updated when the
functions were converted to use timespec64 structures.

Acked-by: Arnd Bergmann <arnd.bergmann@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-11-21 11:59:59 -08:00
John Stultz
334334b5f5 time: Expose get_monotonic_coarse64() for in-kernel uses
Adds a timespec64 based get_monotonic_coarse64() implementation
that can be used as we convert internal users of
get_monotonic_coarse away from using timespecs.

Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-11-21 11:59:59 -08:00
John Stultz
cdba2ec538 time: Expose getrawmonotonic64 for in-kernel uses
Adds a timespec64 based getrawmonotonic64() implementation
that can be used as we convert internal users of
getrawmonotonic away from using timespecs.

Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-11-21 11:59:58 -08:00
pang.xunlei
90b6ce9c40 time: Provide y2038 safe mktime() replacement
As part of addressing "y2038 problem" for in-kernel uses, this
patch adds safe mktime64() using time64_t.

After this patch, mktime() is deprecated and all its call sites
will be fixed using mktime64(), after that it can be removed.

Signed-off-by: pang.xunlei <pang.xunlei@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-11-21 11:59:58 -08:00
pang.xunlei
04d9089086 time: Provide y2038 safe timekeeping_inject_sleeptime() replacement
As part of addressing "y2038 problem" for in-kernel uses, this
patch adds timekeeping_inject_sleeptime64() using timespec64.

After this patch, timekeeping_inject_sleeptime() is deprecated
and all its call sites will be fixed using the new interface,
after that it can be removed.

NOTE: timekeeping_inject_sleeptime() is safe actually, but we
want to eliminate timespec eventually, so comes this patch.

Signed-off-by: pang.xunlei <pang.xunlei@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-11-21 11:59:57 -08:00
pang.xunlei
21f7eca555 time: Provide y2038 safe do_settimeofday() replacement
The kernel uses 32-bit signed value(time_t) for seconds elapsed
1970-01-01:00:00:00, thus it will overflow at 2038-01-19 03:14:08
on 32-bit systems. This is widely known as the y2038 problem.

As part of addressing "y2038 problem" for in-kernel uses, this patch
adds safe do_settimeofday64() using timespec64.

After this patch, do_settimeofday() is deprecated and all its call
sites will be fixed using do_settimeofday64(), after that it can be
removed.

Signed-off-by: pang.xunlei <pang.xunlei@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-11-21 11:59:57 -08:00
pang.xunlei
659bc17b80 time: Complete NTP adjustment threshold judging conditions
The clocksource mult-adjustment threshold is [mult-maxadj, mult+maxadj],
timekeeping_adjust() only deals with the upper threshold, but misses the
lower threshold.

This patch adds the lower threshold judging condition.

Signed-off-by: pang.xunlei <pang.xunlei@linaro.org>
[jstultz: Minor fix for > 80 char line]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-11-21 11:59:56 -08:00
pang.xunlei
6067dc5a8c time: Avoid possible NTP adjustment mult overflow.
Ideally, __clocksource_updatefreq_scale, selects the largest shift
value possible for a clocksource. This results in the mult memember of
struct clocksource being particularly large, although not so large
that NTP would adjust the clock to cause it to overflow.

That said, nothing actually prohibits an overflow from occuring, its
just that it "shouldn't" occur.

So while very unlikely, and so far never observed, the value of
(cs->mult+cs->maxadj) may have a chance to reach very near 0xFFFFFFFF,
so there is a possibility it may overflow when doing NTP positive
adjustment

See the following detail: When NTP slewes the clock, kernel goes
through update_wall_time()->...->timekeeping_apply_adjustment():
	tk->tkr.mult += mult_adj;

Since there is no guard against it, its possible tk->tkr.mult may
overflow during this operation.

This patch avoids any possible mult overflow by judging the overflow
case before adding mult_adj to mult, also adds the WARNING message
when capturing such case.

Signed-off-by: pang.xunlei <pang.xunlei@linaro.org>
[jstultz: Reworded commit message]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-11-21 11:59:56 -08:00
John Stultz
fd866e2b11 time: Rename udelay_test.c to test_udelay.c
Kees requested that this test module be renamed for consistency sake,
so this patch renames the udelay_test.c file (recently added to
tip/timers/core for 3.17) to test_udelay.c

Cc: Kees Cook <keescook@chromium.org>
Cc: Greg KH <greg@kroah.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Linux-Next <linux-next@vger.kernel.org>
Cc: David Riley <davidriley@chromium.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-11-21 11:59:55 -08:00
Ingo Molnar
d360b78f99 Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

 - Streamline RCU's use of per-CPU variables, shifting from "cpu"
   arguments to functions to "this_"-style per-CPU variable accessors.

 - Signal-handling RCU updates.

 - Real-time updates.

 - Torture-test updates.

 - Miscellaneous fixes.

 - Documentation updates.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-20 08:57:58 +01:00
Peter Zijlstra
23cfa361f3 sched/cputime: Fix cpu_timer_sample_group() double accounting
While looking over the cpu-timer code I found that we appear to add
the delta for the calling task twice, through:

  cpu_timer_sample_group()
    thread_group_cputimer()
      thread_group_cputime()
        times->sum_exec_runtime += task_sched_runtime();

    *sample = cputime.sum_exec_runtime + task_delta_exec();

Which would make the sample run ahead, making the sleep short.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20141112113737.GI10476@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16 10:04:18 +01:00
Paul E. McKenney
aa6da5140b rcu: Remove "cpu" argument to rcu_needs_cpu()
The "cpu" argument to rcu_needs_cpu() is always the current CPU, so drop
it.  This in turn allows the "cpu" argument to rcu_cpu_has_callbacks()
to be removed, which allows the uses of "cpu" in both functions to be
replaced with a this_cpu_ptr().  Again, the anticipated cross-CPU uses
of these functions has been replaced by NO_HZ_FULL.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2014-11-03 19:20:43 -08:00
Paul E. McKenney
c3377c2da6 rcu: Remove "cpu" argument to rcu_check_callbacks()
The "cpu" argument was kept around on the off-chance that RCU might
offload scheduler-clock interrupts.  However, this offload approach
has been replaced by NO_HZ_FULL, which offloads -all- RCU processing
from qualifying CPUs.  It is therefore time to remove the "cpu" argument
to rcu_check_callbacks(), which this commit does.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2014-11-03 19:20:11 -08:00
Heena Sirwani
dbe7aa622d timekeeping: Provide y2038 safe accessor to the seconds portion of CLOCK_REALTIME
ktime_get_real_seconds() is the replacement function for get_seconds()
returning the seconds portion of CLOCK_REALTIME in a time64_t. For
64bit the function is equivivalent to get_seconds(), but for 32bit it
protects the readout with the timekeeper sequence count. This is
required because 32-bit machines cannot access 64-bit tk->xtime_sec
variable atomically.

[tglx: Massaged changelog and added docbook comment ]

Signed-off-by: Heena Sirwani <heenasirwani@gmail.com>
Reviewed-by: Arnd Bergman <arnd@arndb.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: opw-kernel@googlegroups.com
Link: http://lkml.kernel.org/r/7adcfaa8962b8ad58785d9a2456c3f77d93c0ffb.1414578445.git.heenasirwani@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-10-29 15:15:40 +01:00
Heena Sirwani
9e3680b175 timekeeping: Provide fast accessor to the seconds part of CLOCK_MONOTONIC
This is the counterpart to get_seconds() based on CLOCK_MONOTONIC. The
use case for this interface are kernel internal coarse grained
timestamps which do neither require the nanoseconds fraction of
current time nor the CLOCK_REALTIME properties. Such timestamps can
currently only retrieved by calling ktime_get_ts64() and using the
tv_sec field of the returned timespec64. That's inefficient as it
involves the read of the clocksource, math operations and must be
protected by the timekeeper sequence counter.

To avoid the sequence counter protection we restrict the return value
to unsigned 32bit on 32bit machines. This covers ~136 years of uptime
and therefor an overflow is not expected to hit anytime soon.

To avoid math in the function we calculate the current seconds portion
of CLOCK_MONOTONIC when the timekeeper gets updated in
tk_update_ktime_data() similar to the CLOCK_REALTIME counterpart
xtime_sec.

[ tglx: Massaged changelog, simplified and commented the update
  	function, added docbook comment ]

Signed-off-by: Heena Sirwani <heenasirwani@gmail.com>
Reviewed-by: Arnd Bergman <arnd@arndb.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: opw-kernel@googlegroups.com
Link: http://lkml.kernel.org/r/da0b63f4bdf3478909f92becb35861197da3a905.1414578445.git.heenasirwani@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-10-29 15:15:40 +01:00
Thomas Gleixner
10632008b9 clockevents: Prevent shift out of bounds
Andrey reported that on a kernel with UBSan enabled he found:

     UBSan: Undefined behaviour in ../kernel/time/clockevents.c:75:34

     I guess it should be 1ULL here instead of 1U:
            (!ismax || evt->mult <= (1U << evt->shift)))

That's indeed the correct solution because shift might be 32.

Reported-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-10-25 10:43:15 +02:00
Mathias Krause
6891c4509c posix-timers: Fix stack info leak in timer_create()
If userland creates a timer without specifying a sigevent info, we'll
create one ourself, using a stack local variable. Particularly will we
use the timer ID as sival_int. But as sigev_value is a union containing
a pointer and an int, that assignment will only partially initialize
sigev_value on systems where the size of a pointer is bigger than the
size of an int. On such systems we'll copy the uninitialized stack bytes
from the timer_create() call to userland when the timer actually fires
and we're going to deliver the signal.

Initialize sigev_value with 0 to plug the stack info leak.

Found in the PaX patch, written by the PaX Team.

Fixes: 5a9fa73072 ("posix-timers: kill ->it_sigev_signo and...")
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: PaX Team <pageexec@freemail.hu>
Cc: <stable@vger.kernel.org>	# v2.6.28+
Link: http://lkml.kernel.org/r/1412456799-32339-1-git-send-email-minipli@googlemail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-10-25 10:43:15 +02:00
Linus Torvalds
0429fbc0bd Merge branch 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu consistent-ops changes from Tejun Heo:
 "Way back, before the current percpu allocator was implemented, static
  and dynamic percpu memory areas were allocated and handled separately
  and had their own accessors.  The distinction has been gone for many
  years now; however, the now duplicate two sets of accessors remained
  with the pointer based ones - this_cpu_*() - evolving various other
  operations over time.  During the process, we also accumulated other
  inconsistent operations.

  This pull request contains Christoph's patches to clean up the
  duplicate accessor situation.  __get_cpu_var() uses are replaced with
  with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr().

  Unfortunately, the former sometimes is tricky thanks to C being a bit
  messy with the distinction between lvalues and pointers, which led to
  a rather ugly solution for cpumask_var_t involving the introduction of
  this_cpu_cpumask_var_ptr().

  This converts most of the uses but not all.  Christoph will follow up
  with the remaining conversions in this merge window and hopefully
  remove the obsolete accessors"

* 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits)
  irqchip: Properly fetch the per cpu offset
  percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix
  ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write.
  percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t
  Revert "powerpc: Replace __get_cpu_var uses"
  percpu: Remove __this_cpu_ptr
  clocksource: Replace __this_cpu_ptr with raw_cpu_ptr
  sparc: Replace __get_cpu_var uses
  avr32: Replace __get_cpu_var with __this_cpu_write
  blackfin: Replace __get_cpu_var uses
  tile: Use this_cpu_ptr() for hardware counters
  tile: Replace __get_cpu_var uses
  powerpc: Replace __get_cpu_var uses
  alpha: Replace __get_cpu_var
  ia64: Replace __get_cpu_var uses
  s390: cio driver &__get_cpu_var replacements
  s390: Replace __get_cpu_var uses
  mips: Replace __get_cpu_var uses
  MIPS: Replace __get_cpu_var uses in FPU emulator.
  arm: Replace __this_cpu_ptr with raw_cpu_ptr
  ...
2014-10-15 07:48:18 +02:00
Linus Torvalds
1ee07ef6b5 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Martin Schwidefsky:
 "This patch set contains the main portion of the changes for 3.18 in
  regard to the s390 architecture.  It is a bit bigger than usual,
  mainly because of a new driver and the vector extension patches.

  The interesting bits are:
   - Quite a bit of work on the tracing front.  Uprobes is enabled and
     the ftrace code is reworked to get some of the lost performance
     back if CONFIG_FTRACE is enabled.
   - To improve boot time with CONFIG_DEBIG_PAGEALLOC, support for the
     IPTE range facility is added.
   - The rwlock code is re-factored to improve writer fairness and to be
     able to use the interlocked-access instructions.
   - The kernel part for the support of the vector extension is added.
   - The device driver to access the CD/DVD on the HMC is added, this
     will hopefully come in handy to improve the installation process.
   - Add support for control-unit initiated reconfiguration.
   - The crypto device driver is enhanced to enable the additional AP
     domains and to allow the new crypto hardware to be used.
   - Bug fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (39 commits)
  s390/ftrace: simplify enabling/disabling of ftrace_graph_caller
  s390/ftrace: remove 31 bit ftrace support
  s390/kdump: add support for vector extension
  s390/disassembler: add vector instructions
  s390: add support for vector extension
  s390/zcrypt: Toleration of new crypto hardware
  s390/idle: consolidate idle functions and definitions
  s390/nohz: use a per-cpu flag for arch_needs_cpu
  s390/vtime: do not reset idle data on CPU hotplug
  s390/dasd: add support for control unit initiated reconfiguration
  s390/dasd: fix infinite loop during format
  s390/mm: make use of ipte range facility
  s390/setup: correct 4-level kernel page table detection
  s390/topology: call set_sched_topology early
  s390/uprobes: architecture backend for uprobes
  s390/uprobes: common library for kprobes and uprobes
  s390/rwlock: use the interlocked-access facility 1 instructions
  s390/rwlock: improve writer fairness
  s390/rwlock: remove interrupt-enabling rwlock variant.
  s390/mm: remove change bit override support
  ...
2014-10-14 03:47:00 +02:00
Linus Torvalds
faafcba3b5 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Optimized support for Intel "Cluster-on-Die" (CoD) topologies (Dave
     Hansen)

   - Various sched/idle refinements for better idle handling (Nicolas
     Pitre, Daniel Lezcano, Chuansheng Liu, Vincent Guittot)

   - sched/numa updates and optimizations (Rik van Riel)

   - sysbench speedup (Vincent Guittot)

   - capacity calculation cleanups/refactoring (Vincent Guittot)

   - Various cleanups to thread group iteration (Oleg Nesterov)

   - Double-rq-lock removal optimization and various refactorings
     (Kirill Tkhai)

   - various sched/deadline fixes

  ... and lots of other changes"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
  sched/dl: Use dl_bw_of() under rcu_read_lock_sched()
  sched/fair: Delete resched_cpu() from idle_balance()
  sched, time: Fix build error with 64 bit cputime_t on 32 bit systems
  sched: Improve sysbench performance by fixing spurious active migration
  sched/x86: Fix up typo in topology detection
  x86, sched: Add new topology for multi-NUMA-node CPUs
  sched/rt: Use resched_curr() in task_tick_rt()
  sched: Use rq->rd in sched_setaffinity() under RCU read lock
  sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask'
  sched: Use dl_bw_of() under RCU read lock
  sched/fair: Remove duplicate code from can_migrate_task()
  sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSW
  sched: print_rq(): Don't use tasklist_lock
  sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use task_rq_lock()
  sched: Fix the task-group check in tg_has_rt_tasks()
  sched/fair: Leverage the idle state info when choosing the "idlest" cpu
  sched: Let the scheduler see CPU idle states
  sched/deadline: Fix inter- exclusive cpusets migrations
  sched/deadline: Clear dl_entity params when setscheduling to different class
  sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault()
  ...
2014-10-13 16:23:15 +02:00
Linus Torvalds
47137c6ba1 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "Nothing really exciting this time:

   - a few fixlets in the NOHZ code

   - a new ARM SoC timer abomination.  One should expect that we have
     enough of them already, but they insist on inventing new ones.

   - the usual bunch of ARM SoC timer updates.  That feels like herding
     cats"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clocksource: arm_arch_timer: Consolidate arch_timer_evtstrm_enable
  clocksource: arm_arch_timer: Enable counter access for 32-bit ARM
  clocksource: arm_arch_timer: Change clocksource name if CP15 unavailable
  clocksource: sirf: Disable counter before re-setting it
  clocksource: cadence_ttc: Add support for 32bit mode
  clocksource: tcb_clksrc: Sanitize IRQ request
  clocksource: arm_arch_timer: Discard unavailable timers correctly
  clocksource: vf_pit_timer: Support shutdown mode
  ARM: meson6: clocksource: Add Meson6 timer support
  ARM: meson: documentation: Add timer documentation
  clocksource: sh_tmu: Document r8a7779 binding
  clocksource: sh_mtu2: Document r7s72100 binding
  clocksource: sh_cmt: Document SoC specific bindings
  timerfd: Remove an always true check
  nohz: Avoid tick's double reprogramming in highres mode
  nohz: Fix spurious periodic tick behaviour in low-res dynticks mode
2014-10-09 06:35:05 -04:00
Linus Torvalds
afa3536be8 Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Ingo Molnar:
 "Main changes:

  - Fix the deadlock reported by Dave Jones et al
  - Clean up and fix nohz_full interaction with arch abilities
  - nohz init code consolidation/cleanup"

* 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  nohz: nohz full depends on irq work self IPI support
  nohz: Consolidate nohz full init code
  arm64: Tell irq work about self IPI support
  arm: Tell irq work about self IPI support
  x86: Tell irq work about self IPI support
  irq_work: Force raised irq work to run on irq work interrupt
  irq_work: Introduce arch_irq_work_has_interrupt()
  nohz: Move nohz full init call to tick init
2014-10-09 06:30:57 -04:00
Martin Schwidefsky
fe0f49768d s390/nohz: use a per-cpu flag for arch_needs_cpu
Move the nohz_delay bit from the s390_idle data structure to the
per-cpu flags. Clear the nohz delay flag in __cpu_disable and
remove the cpu hotplug notifier that used to do this.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2014-10-09 09:14:02 +02:00
Kirill Tkhai
f139caf2e8 sched, cleanup, treewide: Remove set_current_state(TASK_RUNNING) after schedule()
schedule(), io_schedule() and schedule_timeout() always return
with TASK_RUNNING state set, so one more setting is unnecessary.

(All places in patch are visible good, only exception is
 kiblnd_scheduler() from:

      drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c

 Its schedule() is one line above standard 3 lines of unified diff)

No places where set_current_state() is used for mb().

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1410529254.3569.23.camel@tkhai
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Anil Belur <askb23@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: David Airlie <airlied@linux.ie>
Cc: David Howells <dhowells@redhat.com>
Cc: Dmitry Eremin <dmitry.eremin@intel.com>
Cc: Frank Blaschka <blaschka@linux.vnet.ibm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Isaac Huang <he.huang@intel.com>
Cc: James E.J. Bottomley <JBottomley@parallels.com>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: J. Bruce Fields <bfields@fieldses.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Liang Zhen <liang.zhen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Masaru Nomura <massa.nomura@gmail.com>
Cc: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Oleg Drokin <green@linuxhacker.ru>
Cc: Peng Tao <bergwolf@gmail.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Robert Love <robert.w.love@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: Ursula Braun <ursula.braun@de.ibm.com>
Cc: Zi Shen Lim <zlim.lnx@gmail.com>
Cc: devel@driverdev.osuosl.org
Cc: dm-devel@redhat.com
Cc: dri-devel@lists.freedesktop.org
Cc: fcoe-devel@open-fcoe.org
Cc: jfs-discussion@lists.sourceforge.net
Cc: linux390@de.ibm.com
Cc: linux-afs@lists.infradead.org
Cc: linux-cris-kernel@axis.com
Cc: linux-kernel@vger.kernel.org
Cc: linux-nfs@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linux-raid@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Cc: qla2xxx-upstream@qlogic.com
Cc: user-mode-linux-devel@lists.sourceforge.net
Cc: user-mode-linux-user@lists.sourceforge.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 12:35:17 +02:00
Frederic Weisbecker
9b01f5bf39 nohz: nohz full depends on irq work self IPI support
The nohz full functionality depends on IRQ work to trigger its own
interrupts. As it's used to restart the tick, we can't rely on the tick
fallback for irq work callbacks, ie: we can't use the tick to restart
the tick itself.

Lets reject the full dynticks initialization if that arch support isn't
available.

As a side effect, this makes sure that nohz kick is never called from
the tick. That otherwise would result in illegal hrtimer self-cancellation
and lockup.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2014-09-13 18:46:41 +02:00
Frederic Weisbecker
4327b15f64 nohz: Consolidate nohz full init code
The supports for CONFIG_NO_HZ_FULL_ALL=y and the nohz_full= kernel
parameter both have their own way to do the same thing: allocate
full dynticks cpumasks, fill them and initialize some state variables.

Lets consolidate that all in the same place.

While at it, convert some regular printk message to warnings when
fundamental allocations fail.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2014-09-13 18:46:40 +02:00
Frederic Weisbecker
76a33061b9 irq_work: Force raised irq work to run on irq work interrupt
The nohz full kick, which restarts the tick when any resource depend
on it, can't be executed anywhere given the operation it does on timers.
If it is called from the scheduler or timers code, chances are that
we run into a deadlock.

This is why we run the nohz full kick from an irq work. That way we make
sure that the kick runs on a virgin context.

However if that's the case when irq work runs in its own dedicated
self-ipi, things are different for the big bunch of archs that don't
support the self triggered way. In order to support them, irq works are
also handled by the timer interrupt as fallback.

Now when irq works run on the timer interrupt, the context isn't blank.
More precisely, they can run in the context of the hrtimer that runs the
tick. But the nohz kick cancels and restarts this hrtimer and cancelling
an hrtimer from itself isn't allowed. This is why we run in an endless
loop:

	Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2
	CPU: 2 PID: 7538 Comm: kworker/u8:8 Not tainted 3.16.0+ #34
	Workqueue: btrfs-endio-write normal_work_helper [btrfs]
	 ffff880244c06c88 000000001b486fe1 ffff880244c06bf0 ffffffff8a7f1e37
	 ffffffff8ac52a18 ffff880244c06c78 ffffffff8a7ef928 0000000000000010
	 ffff880244c06c88 ffff880244c06c20 000000001b486fe1 0000000000000000
	Call Trace:
	 <NMI[<ffffffff8a7f1e37>] dump_stack+0x4e/0x7a
	 [<ffffffff8a7ef928>] panic+0xd4/0x207
	 [<ffffffff8a1450e8>] watchdog_overflow_callback+0x118/0x120
	 [<ffffffff8a186b0e>] __perf_event_overflow+0xae/0x350
	 [<ffffffff8a184f80>] ? perf_event_task_disable+0xa0/0xa0
	 [<ffffffff8a01a4cf>] ? x86_perf_event_set_period+0xbf/0x150
	 [<ffffffff8a187934>] perf_event_overflow+0x14/0x20
	 [<ffffffff8a020386>] intel_pmu_handle_irq+0x206/0x410
	 [<ffffffff8a01937b>] perf_event_nmi_handler+0x2b/0x50
	 [<ffffffff8a007b72>] nmi_handle+0xd2/0x390
	 [<ffffffff8a007aa5>] ? nmi_handle+0x5/0x390
	 [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0
	 [<ffffffff8a008062>] default_do_nmi+0x72/0x1c0
	 [<ffffffff8a008268>] do_nmi+0xb8/0x100
	 [<ffffffff8a7ff66a>] end_repeat_nmi+0x1e/0x2e
	 [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0
	 [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0
	 [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0
	 <<EOE><IRQ[<ffffffff8a0ccd2f>] lock_acquired+0xaf/0x450
	 [<ffffffff8a0f74c5>] ? lock_hrtimer_base.isra.20+0x25/0x50
	 [<ffffffff8a7fc678>] _raw_spin_lock_irqsave+0x78/0x90
	 [<ffffffff8a0f74c5>] ? lock_hrtimer_base.isra.20+0x25/0x50
	 [<ffffffff8a0f74c5>] lock_hrtimer_base.isra.20+0x25/0x50
	 [<ffffffff8a0f7723>] hrtimer_try_to_cancel+0x33/0x1e0
	 [<ffffffff8a0f78ea>] hrtimer_cancel+0x1a/0x30
	 [<ffffffff8a109237>] tick_nohz_restart+0x17/0x90
	 [<ffffffff8a10a213>] __tick_nohz_full_check+0xc3/0x100
	 [<ffffffff8a10a25e>] nohz_full_kick_work_func+0xe/0x10
	 [<ffffffff8a17c884>] irq_work_run_list+0x44/0x70
	 [<ffffffff8a17c8da>] irq_work_run+0x2a/0x50
	 [<ffffffff8a0f700b>] update_process_times+0x5b/0x70
	 [<ffffffff8a109005>] tick_sched_handle.isra.21+0x25/0x60
	 [<ffffffff8a109b81>] tick_sched_timer+0x41/0x60
	 [<ffffffff8a0f7aa2>] __run_hrtimer+0x72/0x470
	 [<ffffffff8a109b40>] ? tick_sched_do_timer+0xb0/0xb0
	 [<ffffffff8a0f8707>] hrtimer_interrupt+0x117/0x270
	 [<ffffffff8a034357>] local_apic_timer_interrupt+0x37/0x60
	 [<ffffffff8a80010f>] smp_apic_timer_interrupt+0x3f/0x50
	 [<ffffffff8a7fe52f>] apic_timer_interrupt+0x6f/0x80

To fix this we force non-lazy irq works to run on irq work self-IPIs
when available. That ability of the arch to trigger irq work self IPIs
is available with arch_irq_work_has_interrupt().

Reported-by: Catalin Iacob <iacobcatalin@gmail.com>
Reported-by: Dave Jones <davej@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2014-09-13 18:38:15 +02:00
Frederic Weisbecker
a80e49e2cc nohz: Move nohz full init call to tick init
This way we unbloat a bit main.c and more importantly we initialize
nohz full after init_IRQ(). This dependency will be needed in further
patches because nohz full needs irq work to raise its own IRQ.
Information about the support for this ability on ARM64 is obtained on
init_IRQ() which initialize the pointer to __smp_call_function.

Since tick_init() is called right after init_IRQ(), this is a good place
to call tick_nohz_init() and prepare for that dependency.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2014-09-13 18:34:44 +02:00
Richard Larocque
474e941bed alarmtimer: Lock k_itimer during timer callback
Locks the k_itimer's it_lock member when handling the alarm timer's
expiry callback.

The regular posix timers defined in posix-timers.c have this lock held
during timout processing because their callbacks are routed through
posix_timer_fn().  The alarm timers follow a different path, so they
ought to grab the lock somewhere else.

Cc: stable@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Sharvil Nanavati <sharvil@google.com>
Signed-off-by: Richard Larocque <rlarocque@google.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-09-12 13:59:12 -07:00
Richard Larocque
265b81d23a alarmtimer: Do not signal SIGEV_NONE timers
Avoids sending a signal to alarm timers created with sigev_notify set to
SIGEV_NONE by checking for that special case in the timeout callback.

The regular posix timers avoid sending signals to SIGEV_NONE timers by
not scheduling any callbacks for them in the first place.  Although it
would be possible to do something similar for alarm timers, it's simpler
to handle this as a special case in the timeout.

Prior to this patch, the alarm timer would ignore the sigev_notify value
and try to deliver signals to the process anyway.  Even worse, the
sanity check for the value of sigev_signo is skipped when SIGEV_NONE was
specified, so the signal number could be bogus.  If sigev_signo was an
unitialized value (as it often would be if SIGEV_NONE is used), then
it's hard to predict which signal will be sent.

Cc: stable@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Sharvil Nanavati <sharvil@google.com>
Signed-off-by: Richard Larocque <rlarocque@google.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-09-12 13:59:12 -07:00
Richard Larocque
e86fea7649 alarmtimer: Return relative times in timer_gettime
Returns the time remaining for an alarm timer, rather than the time at
which it is scheduled to expire.  If the timer has already expired or it
is not currently scheduled, the it_value's members are set to zero.

This new behavior matches that of the other posix-timers and the POSIX
specifications.

This is a change in user-visible behavior, and may break existing
applications.  Hopefully, few users rely on the old incorrect behavior.

Cc: stable@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Sharvil Nanavati <sharvil@google.com>
Signed-off-by: Richard Larocque <rlarocque@google.com>
[jstultz: minor style tweak]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-09-12 13:59:11 -07:00
Andrew Hunter
d78c9300c5 jiffies: Fix timeval conversion to jiffies
timeval_to_jiffies tried to round a timeval up to an integral number
of jiffies, but the logic for doing so was incorrect: intervals
corresponding to exactly N jiffies would become N+1. This manifested
itself particularly repeatedly stopping/starting an itimer:

setitimer(ITIMER_PROF, &val, NULL);
setitimer(ITIMER_PROF, NULL, &val);

would add a full tick to val, _even if it was exactly representable in
terms of jiffies_ (say, the result of a previous rounding.)  Doing
this repeatedly would cause unbounded growth in val.  So fix the math.

Here's what was wrong with the conversion: we essentially computed
(eliding seconds)

jiffies = usec  * (NSEC_PER_USEC/TICK_NSEC)

by using scaling arithmetic, which took the best approximation of
NSEC_PER_USEC/TICK_NSEC with denominator of 2^USEC_JIFFIE_SC =
x/(2^USEC_JIFFIE_SC), and computed:

jiffies = (usec * x) >> USEC_JIFFIE_SC

and rounded this calculation up in the intermediate form (since we
can't necessarily exactly represent TICK_NSEC in usec.) But the
scaling arithmetic is a (very slight) *over*approximation of the true
value; that is, instead of dividing by (1 usec/ 1 jiffie), we
effectively divided by (1 usec/1 jiffie)-epsilon (rounding
down). This would normally be fine, but we want to round timeouts up,
and we did so by adding 2^USEC_JIFFIE_SC - 1 before the shift; this
would be fine if our division was exact, but dividing this by the
slightly smaller factor was equivalent to adding just _over_ 1 to the
final result (instead of just _under_ 1, as desired.)

In particular, with HZ=1000, we consistently computed that 10000 usec
was 11 jiffies; the same was true for any exact multiple of
TICK_NSEC.

We could possibly still round in the intermediate form, adding
something less than 2^USEC_JIFFIE_SC - 1, but easier still is to
convert usec->nsec, round in nanoseconds, and then convert using
time*spec*_to_jiffies.  This adds one constant multiplication, and is
not observably slower in microbenchmarks on recent x86 hardware.

Tested: the following program:

int main() {
  struct itimerval zero = {{0, 0}, {0, 0}};
  /* Initially set to 10 ms. */
  struct itimerval initial = zero;
  initial.it_interval.tv_usec = 10000;
  setitimer(ITIMER_PROF, &initial, NULL);
  /* Save and restore several times. */
  for (size_t i = 0; i < 10; ++i) {
    struct itimerval prev;
    setitimer(ITIMER_PROF, &zero, &prev);
    /* on old kernels, this goes up by TICK_USEC every iteration */
    printf("previous value: %ld %ld %ld %ld\n",
           prev.it_interval.tv_sec, prev.it_interval.tv_usec,
           prev.it_value.tv_sec, prev.it_value.tv_usec);
    setitimer(ITIMER_PROF, &prev, NULL);
  }
    return 0;
}

Cc: stable@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Paul Turner <pjt@google.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Reviewed-by: Paul Turner <pjt@google.com>
Reported-by: Aaron Jacobs <jacobsa@google.com>
Signed-off-by: Andrew Hunter <ahh@google.com>
[jstultz: Tweaked to apply to 3.17-rc]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-09-12 13:59:03 -07:00
Rik van Riel
e78c349679 time, signal: Protect resource use statistics with seqlock
Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability
issues on large systems, due to both functions being serialized with a
lock.

The lock protects against reporting a wrong value, due to a thread in the
task group exiting, its statistics reporting up to the signal struct, and
that exited task's statistics being counted twice (or not at all).

Protecting that with a lock results in times() and clock_gettime() being
completely serialized on large systems.

This can be fixed by using a seqlock around the events that gather and
propagate statistics. As an additional benefit, the protection code can
be moved into thread_group_cputime(), slightly simplifying the calling
functions.

In the case of posix_cpu_clock_get_task() things can be simplified a
lot, because the calling function already ensures that the task sticks
around, and the rest is now taken care of in thread_group_cputime().

This way the statistics reporting code can run lockless.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Daeseok Youn <daeseok.youn@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guillaume Morin <guillaume@morinfr.org>
Cc: Ionut Alexa <ionut.m.alexa@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Michal Schmidt <mschmidt@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: umgwanakikbuti@gmail.com
Cc: fweisbec@gmail.com
Cc: srao@redhat.com
Cc: lwoodman@redhat.com
Cc: atheurer@redhat.com
Link: http://lkml.kernel.org/r/20140816134010.26a9b572@annuminas.surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-08 08:17:01 +02:00
Thomas Gleixner
9bf2419fa7 timekeeping: Update timekeeper before updating vsyscall and pvclock
The update_walltime() code works on the shadow timekeeper to make the
seqcount protected region as short as possible. But that update to the
shadow timekeeper does not update all timekeeper fields because it's
sufficient to do that once before it becomes life. One of these fields
is tkr.base_mono. That stays stale in the shadow timekeeper unless an
operation happens which copies the real timekeeper to the shadow.

The update function is called after the update calls to vsyscall and
pvclock. While not correct, it did not cause any problems because none
of the invoked update functions used base_mono.

commit cbcf2dd3b3 (x86: kvm: Make kvm_get_time_and_clockread()
nanoseconds based) changed that in the kvm pvclock update function, so
the stale mono_base value got used and caused kvm-clock to malfunction.

Put the update where it belongs and fix the issue.

Reported-by: Chris J Arges <chris.j.arges@canonical.com>
Reported-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1409050000570.3333@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-09-06 12:58:18 +02:00
Frederic Weisbecker
40bea03959 nohz: Restore NMI safe local irq work for local nohz kick
The local nohz kick is currently used by perf which needs it to be
NMI-safe. Recent commit though (7d1311b93e)
changed its implementation to fire the local kick using the remote kick
API. It was convenient to make the code more generic but the remote kick
isn't NMI-safe.

As a result:

	WARNING: CPU: 3 PID: 18062 at kernel/irq_work.c:72 irq_work_queue_on+0x11e/0x140()
	CPU: 3 PID: 18062 Comm: trinity-subchil Not tainted 3.16.0+ #34
	0000000000000009 00000000903774d1 ffff880244e06c00 ffffffff9a7f1e37
	0000000000000000 ffff880244e06c38 ffffffff9a0791dd ffff880244fce180
	0000000000000003 ffff880244e06d58 ffff880244e06ef8 0000000000000000
	Call Trace:
	<NMI>  [<ffffffff9a7f1e37>] dump_stack+0x4e/0x7a
	[<ffffffff9a0791dd>] warn_slowpath_common+0x7d/0xa0
	[<ffffffff9a07930a>] warn_slowpath_null+0x1a/0x20
	[<ffffffff9a17ca1e>] irq_work_queue_on+0x11e/0x140
	[<ffffffff9a10a2c7>] tick_nohz_full_kick_cpu+0x57/0x90
	[<ffffffff9a186cd5>] __perf_event_overflow+0x275/0x350
	[<ffffffff9a184f80>] ? perf_event_task_disable+0xa0/0xa0
	[<ffffffff9a01a4cf>] ? x86_perf_event_set_period+0xbf/0x150
	[<ffffffff9a187934>] perf_event_overflow+0x14/0x20
	[<ffffffff9a020386>] intel_pmu_handle_irq+0x206/0x410
	[<ffffffff9a0b54d3>] ? arch_vtime_task_switch+0x63/0x130
	[<ffffffff9a01937b>] perf_event_nmi_handler+0x2b/0x50
	[<ffffffff9a007b72>] nmi_handle+0xd2/0x390
	[<ffffffff9a007aa5>] ? nmi_handle+0x5/0x390
	[<ffffffff9a0d131b>] ? lock_release+0xab/0x330
	[<ffffffff9a008062>] default_do_nmi+0x72/0x1c0
	[<ffffffff9a0c925f>] ? cpuacct_account_field+0xcf/0x200
	[<ffffffff9a008268>] do_nmi+0xb8/0x100

Lets fix this by restoring the use of local irq work for the nohz local
kick.

Reported-by: Catalin Iacob <iacobcatalin@gmail.com>
Reported-and-tested-by: Dave Jones <davej@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2014-09-04 22:35:59 +02:00
Christoph Lameter
4a32fea9d7 scheduler: Replace __get_cpu_var with this_cpu_ptr
Convert all uses of __get_cpu_var for address calculation to use
this_cpu_ptr instead.

[Uses of __get_cpu_var with cpumask_var_t are no longer
handled by this patch]

Cc: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-08-26 13:45:45 -04:00
Christoph Lameter
dc5df73b3a time: Convert a bunch of &__get_cpu_var introduced in the 3.16 merge period
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-08-26 13:45:44 -04:00
Christoph Lameter
22127e93c5 time: Replace __get_cpu_var uses
Convert uses of __get_cpu_var for creating a address from a percpu
offset to this_cpu_ptr.

The two cases where get_cpu_var is used to actually access a percpu
variable are changed to use this_cpu_read/raw_cpu_read.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-08-26 13:45:44 -04:00
Viresh Kumar
2a16fc93d2 nohz: Avoid tick's double reprogramming in highres mode
In highres mode, the tick reschedules itself unconditionally to the
next jiffies.

However while this clock reprogramming is relevant when the tick is
in periodic mode, it's not that interesting when we run in dynticks mode
because irq exit is likely going to overwrite the next tick to some
randomly deferred future.

So lets just get rid of this tick self rescheduling in dynticks mode.
This way we can avoid some clockevents double write in favourable
scenarios like when we stop the tick completely in idle while no other
hrtimer is pending.

Suggested-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2014-08-22 18:47:35 +02:00
Viresh Kumar
b5e995e671 nohz: Fix spurious periodic tick behaviour in low-res dynticks mode
When we reach the end of the tick handler, we unconditionally reschedule
the next tick to the next jiffy. Then on irq exit, the nohz code
overrides that setting if needed and defers the next tick as far away in
the future as possible.

Now in the best dynticks case, when we actually don't need any tick in
the future (ie: expires == KTIME_MAX), low-res and high-res behave
differently. What we want in this case is to cancel the next tick
programmed by the previous one. That's what we do in high-res mode. OTOH
we lack a low-res mode equivalent of hrtimer_cancel() so we simply don't
do anything in this case and the next tick remains scheduled to jiffies + 1.

As a result, in low-res mode, when the dynticks code determines that no
tick is needed in the future, we can recursively get a spurious tick
every jiffy because then the next tick is always reprogrammed from the
tick handler and is never cancelled. And this can happen indefinetly
until some subsystem actually needs a precise tick in the future and only
then we eventually overwrite the previous tick handler setting to defer
the next tick.

We are fixing this by introducing the ONESHOT_STOPPED mode which will
let us pause a clockevent when no further interrupt is needed. Meanwhile
we can't expect all drivers to support this new mode.

So lets reduce much of the symptoms by skipping the nohz-blind tick
rescheduling from the tick-handler when the CPU is in dynticks mode.
That tick rescheduling wrongly assumed periodicity and the low-res
dynticks code can't cancel such decision. This breaks the recursive (and
thus the worst) part of the problem. In the worst case now, we'll get
only one extra tick due to uncancelled tick scheduled before we entered
dynticks mode.

This also removes a needless clockevent write on idle ticks. Since those
clock write are usually considered to be slow, it's a general win.

Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2014-08-22 18:46:49 +02:00
John Stultz
0680eb1f48 timekeeping: Another fix to the VSYSCALL_OLD update_vsyscall
Benjamin Herrenschmidt pointed out that I further missed modifying
update_vsyscall after the wall_to_mono value was changed to a
timespec64.  This causes issues on powerpc32, which expects a 32bit
timespec.

This patch fixes the problem by properly converting from a timespec64 to
a timespec before passing the value on to the arch-specific vsyscall
logic.

[ Thomas is currently on vacation, but reviewed it and wanted me to send
  this fix on to you directly. ]

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-14 11:04:11 -06:00
Linus Torvalds
e7fda6c4c3 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer and time updates from Thomas Gleixner:
 "A rather large update of timers, timekeeping & co

   - Core timekeeping code is year-2038 safe now for 32bit machines.
     Now we just need to fix all in kernel users and the gazillion of
     user space interfaces which rely on timespec/timeval :)

   - Better cache layout for the timekeeping internal data structures.

   - Proper nanosecond based interfaces for in kernel users.

   - Tree wide cleanup of code which wants nanoseconds but does hoops
     and loops to convert back and forth from timespecs.  Some of it
     definitely belongs into the ugly code museum.

   - Consolidation of the timekeeping interface zoo.

   - A fast NMI safe accessor to clock monotonic for tracing.  This is a
     long standing request to support correlated user/kernel space
     traces.  With proper NTP frequency correction it's also suitable
     for correlation of traces accross separate machines.

   - Checkpoint/restart support for timerfd.

   - A few NOHZ[_FULL] improvements in the [hr]timer code.

   - Code move from kernel to kernel/time of all time* related code.

   - New clocksource/event drivers from the ARM universe.  I'm really
     impressed that despite an architected timer in the newer chips SoC
     manufacturers insist on inventing new and differently broken SoC
     specific timers.

[ Ed. "Impressed"? I don't think that word means what you think it means ]

   - Another round of code move from arch to drivers.  Looks like most
     of the legacy mess in ARM regarding timers is sorted out except for
     a few obnoxious strongholds.

   - The usual updates and fixlets all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
  timekeeping: Fixup typo in update_vsyscall_old definition
  clocksource: document some basic timekeeping concepts
  timekeeping: Use cached ntp_tick_length when accumulating error
  timekeeping: Rework frequency adjustments to work better w/ nohz
  timekeeping: Minor fixup for timespec64->timespec assignment
  ftrace: Provide trace clocks monotonic
  timekeeping: Provide fast and NMI safe access to CLOCK_MONOTONIC
  seqcount: Add raw_write_seqcount_latch()
  seqcount: Provide raw_read_seqcount()
  timekeeping: Use tk_read_base as argument for timekeeping_get_ns()
  timekeeping: Create struct tk_read_base and use it in struct timekeeper
  timekeeping: Restructure the timekeeper some more
  clocksource: Get rid of cycle_last
  clocksource: Move cycle_last validation to core code
  clocksource: Make delta calculation a function
  wireless: ath9k: Get rid of timespec conversions
  drm: vmwgfx: Use nsec based interfaces
  drm: i915: Use nsec based interfaces
  timekeeping: Provide ktime_get_raw()
  hangcheck-timer: Use ktime_get_ns()
  ...
2014-08-05 17:46:42 -07:00
Linus Torvalds
53ee983378 Staging driver patches for 3.17-rc1
Here's the big pull request for the staging driver tree for 3.17-rc1.
 
 Lots of things in here, over 2000 patches, but the best part is this:
  1480 files changed, 39070 insertions(+), 254659 deletions(-)
 
 Thanks to the great work of Kristina Martšenko, 14 different staging
 drivers have been removed from the tree as they were obsolete and no one
 was willing to work on cleaning them up.  Other than the driver
 removals, loads of cleanups are in here (comedi, lustre, etc.) as well
 as the usual IIO driver updates and additions.
 
 All of this has been in the linux-next tree for a while.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iEYEABECAAYFAlPf1wYACgkQMUfUDdst+ykrNwCgswPkRSAPQ3C8WvLhzUYRZZ/L
 AqEAoJP0Q8Fz8unXjlSMcx7pgcqUaJ8G
 =mrTQ
 -----END PGP SIGNATURE-----

Merge tag 'staging-3.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging

Pull staging driver updates from Greg KH:
 "Here's the big pull request for the staging driver tree for 3.17-rc1.

  Lots of things in here, over 2000 patches, but the best part is this:
   1480 files changed, 39070 insertions(+), 254659 deletions(-)

  Thanks to the great work of Kristina Martšenko, 14 different staging
  drivers have been removed from the tree as they were obsolete and no
  one was willing to work on cleaning them up.  Other than the driver
  removals, loads of cleanups are in here (comedi, lustre, etc.) as well
  as the usual IIO driver updates and additions.

  All of this has been in the linux-next tree for a while"

* tag 'staging-3.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (2199 commits)
  staging: comedi: addi_apci_1564: remove diagnostic interrupt support code
  staging: comedi: addi_apci_1564: add subdevice to check diagnostic status
  staging: wlan-ng: coding style problem fix
  staging: wlan-ng: fixing coding style problems
  staging: comedi: ii_pci20kc: request and ioremap memory
  staging: lustre: bitwise vs logical typo
  staging: dgnc: Remove unneeded dgnc_trace.c and dgnc_trace.h
  staging: dgnc: rephrase comment
  staging: comedi: ni_tio: remove some dead code
  staging: rtl8723au: Fix static symbol sparse warning
  staging: rtl8723au: usb_dvobj_init(): Remove unused variable 'pdev_desc'
  staging: rtl8723au: Do not duplicate kernel provided USB macros
  staging: rtl8723au: Remove never set struct pwrctrl_priv.bHWPowerdown
  staging: rtl8723au: Remove two never set variables
  staging: rtl8723au: RSSI_test is never set
  staging:r8190: coding style: Fixed checkpatch reported Error
  staging:r8180: coding style: Fixed too long lines
  staging:r8180: coding style: Fixed commenting style
  staging: lustre: ptlrpc: lproc_ptlrpc.c - fix dereferenceing user space buffer
  staging: lustre: ldlm: ldlm_resource.c - fix dereferenceing user space buffer
  ...
2014-08-04 18:36:12 -07:00
Linus Torvalds
98959948a7 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:

 - Move the nohz kick code out of the scheduler tick to a dedicated IPI,
   from Frederic Weisbecker.

  This necessiated quite some background infrastructure rework,
  including:

   * Clean up some irq-work internals
   * Implement remote irq-work
   * Implement nohz kick on top of remote irq-work
   * Move full dynticks timer enqueue notification to new kick
   * Move multi-task notification to new kick
   * Remove unecessary barriers on multi-task notification

 - Remove proliferation of wait_on_bit() action functions and allow
   wait_on_bit_action() functions to support a timeout.  (Neil Brown)

 - Another round of sched/numa improvements, cleanups and fixes.  (Rik
   van Riel)

 - Implement fast idling of CPUs when the system is partially loaded,
   for better scalability.  (Tim Chen)

 - Restructure and fix the CPU hotplug handling code that may leave
   cfs_rq and rt_rq's throttled when tasks are migrated away from a dead
   cpu.  (Kirill Tkhai)

 - Robustify the sched topology setup code.  (Peterz Zijlstra)

 - Improve sched_feat() handling wrt.  static_keys (Jason Baron)

 - Misc fixes.

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
  sched/fair: Fix 'make xmldocs' warning caused by missing description
  sched: Use macro for magic number of -1 for setparam
  sched: Robustify topology setup
  sched: Fix sched_setparam() policy == -1 logic
  sched: Allow wait_on_bit_action() functions to support a timeout
  sched: Remove proliferation of wait_on_bit() action functions
  sched/numa: Revert "Use effective_load() to balance NUMA loads"
  sched: Fix static_key race with sched_feat()
  sched: Remove extra static_key*() function indirection
  sched/rt: Fix replenish_dl_entity() comments to match the current upstream code
  sched: Transform resched_task() into resched_curr()
  sched/deadline: Kill task_struct->pi_top_task
  sched: Rework check_for_tasks()
  sched/rt: Enqueue just unthrottled rt_rq back on the stack in __disable_runtime()
  sched/fair: Disable runtime_enabled on dying rq
  sched/numa: Change scan period code to match intent
  sched/numa: Rework best node setting in task_numa_migrate()
  sched/numa: Examine a task move when examining a task swap
  sched/numa: Simplify task_numa_compare()
  sched/numa: Use effective_load() to balance NUMA loads
  ...
2014-08-04 16:23:30 -07:00
Linus Torvalds
5bda4f638f Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU changes from Ingo Molar:
 "The main changes:

   - torture-test updates
   - callback-offloading changes
   - maintainership changes
   - update RCU documentation
   - miscellaneous fixes"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
  rcu: Allow for NULL tick_nohz_full_mask when nohz_full= missing
  rcu: Fix a sparse warning in rcu_report_unblock_qs_rnp()
  rcu: Fix a sparse warning in rcu_initiate_boost()
  rcu: Fix __rcu_reclaim() to use true/false for bool
  rcu: Remove CONFIG_PROVE_RCU_DELAY
  rcu: Use __this_cpu_read() instead of per_cpu_ptr()
  rcu: Don't use NMIs to dump other CPUs' stacks
  rcu: Bind grace-period kthreads to non-NO_HZ_FULL CPUs
  rcu: Simplify priority boosting by putting rt_mutex in rcu_node
  rcu: Check both root and current rcu_node when setting up future grace period
  rcu: Allow post-unlock reference for rt_mutex
  rcu: Loosen __call_rcu()'s rcu_head alignment constraint
  rcu: Eliminate read-modify-write ACCESS_ONCE() calls
  rcu: Remove redundant ACCESS_ONCE() from tick_do_timer_cpu
  rcu: Make rcu node arrays static const char * const
  signal: Explain local_irq_save() call
  rcu: Handle obsolete references to TINY_PREEMPT_RCU
  rcu: Document deadlock-avoidance information for rcu_read_unlock()
  scripts: Teach get_maintainer.pl about the new "R:" tag
  rcu: Update rcu torture maintainership filename patterns
  ...
2014-08-04 15:55:08 -07:00