Commit graph

156 commits

Author SHA1 Message Date
Anna-Maria Behnsen
1f32cab0db timers: Use only bucket expiry for base->next_expiry value
The bucket expiry time is the effective expriy time of timers and is
greater than or equal to the requested timer expiry time. This is due
to the guarantee that timers never expire early and the reduced expiry
granularity in the secondary wheel levels.

When a timer is enqueued, trigger_dyntick_cpu() checks whether the
timer is the new first timer. This check compares next_expiry with
the requested timer expiry value and not with the effective expiry
value of the bucket into which the timer was queued.

Storing the requested timer expiry value in base->next_expiry can lead
to base->clk going backwards if the requested timer expiry value is
smaller than base->clk. Commit 30c66fc30e ("timer: Prevent base->clk
from moving backward") worked around this by preventing the store when
timer->expiry is before base->clk, but did not fix the underlying
problem.

Use the expiry value of the bucket into which the timer is queued to
do the new first timer check. This fixes the base->clk going backward
problem.

The workaround of commit 30c66fc30e ("timer: Prevent base->clk from
moving backward") in trigger_dyntick_cpu() is not longer necessary as the
timers bucket expiry is guaranteed to be greater than or equal base->clk.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200717140551.29076-4-frederic@kernel.org
2020-07-17 21:55:21 +02:00
Frederic Weisbecker
3d2e83a2a6 timers: Preserve higher bits of expiration on index calculation
The higher bits of the timer expiration are cropped while calling
calc_index() due to the implicit cast from unsigned long to unsigned int.

This loss shouldn't have consequences on the current code since all the
computation to calculate the index is done on the lower 32 bits.

However to prepare for returning the actual bucket expiration from
calc_index() in order to properly fix base->next_expiry updates, the higher
bits need to be preserved.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200717140551.29076-3-frederic@kernel.org
2020-07-17 21:55:21 +02:00
Frederic Weisbecker
e2a71bdea8 timer: Fix wheel index calculation on last level
When an expiration delta falls into the last level of the wheel, that delta
has be compared against the maximum possible delay and reduced to fit in if
necessary.

However instead of comparing the delta against the maximum, the code
compares the actual expiry against the maximum. Then instead of fixing the
delta to fit in, it sets the maximum delta as the expiry value.

This can result in various undesired outcomes, the worst possible one
being a timer expiring 15 days ahead to fire immediately.

Fixes: 500462a9de ("timers: Switch to a non-cascading wheel")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200717140551.29076-2-frederic@kernel.org
2020-07-17 21:44:05 +02:00
Frederic Weisbecker
30c66fc30e timer: Prevent base->clk from moving backward
When a timer is enqueued with a negative delta (ie: expiry is below
base->clk), it gets added to the wheel as expiring now (base->clk).

Yet the value that gets stored in base->next_expiry, while calling
trigger_dyntick_cpu(), is the initial timer->expires value. The
resulting state becomes:

	base->next_expiry < base->clk

On the next timer enqueue, forward_timer_base() may accidentally
rewind base->clk. As a possible outcome, timers may expire way too
early, the worst case being that the highest wheel levels get spuriously
processed again.

To prevent from that, make sure that base->next_expiry doesn't get below
base->clk.

Fixes: a683f390b9 ("timers: Forward the wheel clock whenever possible")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200703010657.2302-1-frederic@kernel.org
2020-07-09 11:56:57 +02:00
Christoph Hellwig
32927393dc sysctl: pass kernel pointers to ->proc_handler
Instead of having all the sysctl handlers deal with user pointers, which
is rather hairy in terms of the BPF interaction, copy the input to and
from  userspace in common code.  This also means that the strings are
always NUL-terminated by the common code, making the API a little bit
safer.

As most handler just pass through the data to one of the common handlers
a lot of the changes are mechnical.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-27 02:07:40 -04:00
Linus Torvalds
dbb381b619 timekeeping and timer updates:
Core:
 
   - Consolidation of the vDSO build infrastructure to address the
     difficulties of cross-builds for ARM64 compat vDSO libraries by
     restricting the exposure of header content to the vDSO build.
 
     This is achieved by splitting out header content into separate
     headers. which contain only the minimaly required information which is
     necessary to build the vDSO. These new headers are included from the
     kernel headers and the vDSO specific files.
 
   - Enhancements to the generic vDSO library allowing more fine grained
     control over the compiled in code, further reducing architecture
     specific storage and preparing for adopting the generic library by PPC.
 
   - Cleanup and consolidation of the exit related code in posix CPU timers.
 
   - Small cleanups and enhancements here and there
 
  Drivers:
 
   - The obligatory new drivers: Ingenic JZ47xx and X1000 TCU support
 
   - Correct the clock rate of PIT64b global clock
 
   - setup_irq() cleanup
 
   - Preparation for PWM and suspend support for the TI DM timer
 
   - Expand the fttmr010 driver to support ast2600 systems
 
   - The usual small fixes, enhancements and cleanups all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6B+QETHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYofJ5D/94s5fpaqiuNcaAsLq2D3DRIrTnqxx7
 yEeAOPcbYV1bM1SgY/M83L5yGc2S8ny787e26abwRTCZhZV3eAmRTphIFFIZR0Xk
 xS+i67odscbdJTRtztKj3uQ9rFxefszRuphyaa89pwSY9nnyMWLcahGSQOGs0LJK
 hvmgwPjyM1drNfPxgPiaFg7vDr2XxNATpQr/FBt+BhelvVan8TlAfrkcNPiLr++Y
 Axz925FP7jMaRRbZ1acji34gLiIAZk0jLCUdbix7YkPrqDB4GfO+v8Vez+fGClbJ
 uDOYeR4r1+Be/BtSJtJ2tHqtsKCcAL6agtaE2+epZq5HbzaZFRvBFaxgFNF8WVcn
 3FFibdEMdsRNfZTUVp5wwgOLN0UIqE/7LifE12oLEL2oFB5H2PiNEUw3E02XHO11
 rL3zgHhB6Ke1sXKPCjSGdmIQLbxZmV5kOlQFy7XuSeo5fmRapVzKNffnKcftIliF
 1HNtZbgdA+3tdxMFCqoo1QX+kotl9kgpslmdZ0qHAbaRb3xqLoSskbqEjFRMuSCC
 8bjJrwboD9T5GPfwodSCgqs/58CaSDuqPFbIjCay+p90Fcg6wWAkZtyG04ZLdPRc
 GgNNdN4gjTD9bnrRi8cH47z1g8OO4vt4K4SEbmjo8IlDW+9jYMxuwgR88CMeDXd7
 hu7aKsr2I2q/WQ==
 =5o9G
 -----END PGP SIGNATURE-----

Merge tag 'timers-core-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timekeeping and timer updates from Thomas Gleixner:
 "Core:

   - Consolidation of the vDSO build infrastructure to address the
     difficulties of cross-builds for ARM64 compat vDSO libraries by
     restricting the exposure of header content to the vDSO build.

     This is achieved by splitting out header content into separate
     headers. which contain only the minimaly required information which
     is necessary to build the vDSO. These new headers are included from
     the kernel headers and the vDSO specific files.

   - Enhancements to the generic vDSO library allowing more fine grained
     control over the compiled in code, further reducing architecture
     specific storage and preparing for adopting the generic library by
     PPC.

   - Cleanup and consolidation of the exit related code in posix CPU
     timers.

   - Small cleanups and enhancements here and there

  Drivers:

   - The obligatory new drivers: Ingenic JZ47xx and X1000 TCU support

   - Correct the clock rate of PIT64b global clock

   - setup_irq() cleanup

   - Preparation for PWM and suspend support for the TI DM timer

   - Expand the fttmr010 driver to support ast2600 systems

   - The usual small fixes, enhancements and cleanups all over the
     place"

* tag 'timers-core-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (80 commits)
  Revert "clocksource/drivers/timer-probe: Avoid creating dead devices"
  vdso: Fix clocksource.h macro detection
  um: Fix header inclusion
  arm64: vdso32: Enable Clang Compilation
  lib/vdso: Enable common headers
  arm: vdso: Enable arm to use common headers
  x86/vdso: Enable x86 to use common headers
  mips: vdso: Enable mips to use common headers
  arm64: vdso32: Include common headers in the vdso library
  arm64: vdso: Include common headers in the vdso library
  arm64: Introduce asm/vdso/processor.h
  arm64: vdso32: Code clean up
  linux/elfnote.h: Replace elf.h with UAPI equivalent
  scripts: Fix the inclusion order in modpost
  common: Introduce processor.h
  linux/ktime.h: Extract common header for vDSO
  linux/jiffies.h: Extract common header for vDSO
  linux/time64.h: Extract common header for vDSO
  linux/time32.h: Extract common header for vDSO
  linux/time.h: Extract common header for vDSO
  ...
2020-03-30 18:51:47 -07:00
Eric Dumazet
90c018942c timer: Use hlist_unhashed_lockless() in timer_pending()
The timer_pending() function is mostly used in lockless contexts, so
Without proper annotations, KCSAN might detect a data-race [1].

Using hlist_unhashed_lockless() instead of hand-coding it seems
appropriate (as suggested by Paul E. McKenney).

[1]

BUG: KCSAN: data-race in del_timer / detach_if_pending

write to 0xffff88808697d870 of 8 bytes by task 10 on cpu 0:
 __hlist_del include/linux/list.h:764 [inline]
 detach_timer kernel/time/timer.c:815 [inline]
 detach_if_pending+0xcd/0x2d0 kernel/time/timer.c:832
 try_to_del_timer_sync+0x60/0xb0 kernel/time/timer.c:1226
 del_timer_sync+0x6b/0xa0 kernel/time/timer.c:1365
 schedule_timeout+0x2d2/0x6e0 kernel/time/timer.c:1896
 rcu_gp_fqs_loop+0x37c/0x580 kernel/rcu/tree.c:1639
 rcu_gp_kthread+0x143/0x230 kernel/rcu/tree.c:1799
 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

read to 0xffff88808697d870 of 8 bytes by task 12060 on cpu 1:
 del_timer+0x3b/0xb0 kernel/time/timer.c:1198
 sk_stop_timer+0x25/0x60 net/core/sock.c:2845
 inet_csk_clear_xmit_timers+0x69/0xa0 net/ipv4/inet_connection_sock.c:523
 tcp_clear_xmit_timers include/net/tcp.h:606 [inline]
 tcp_v4_destroy_sock+0xa3/0x3f0 net/ipv4/tcp_ipv4.c:2096
 inet_csk_destroy_sock+0xf4/0x250 net/ipv4/inet_connection_sock.c:836
 tcp_close+0x6f3/0x970 net/ipv4/tcp.c:2497
 inet_release+0x86/0x100 net/ipv4/af_inet.c:427
 __sock_release+0x85/0x160 net/socket.c:590
 sock_close+0x24/0x30 net/socket.c:1268
 __fput+0x1e1/0x520 fs/file_table.c:280
 ____fput+0x1f/0x30 fs/file_table.c:313
 task_work_run+0xf6/0x130 kernel/task_work.c:113
 tracehook_notify_resume include/linux/tracehook.h:188 [inline]
 exit_to_usermode_loop+0x2b4/0x2c0 arch/x86/entry/common.c:163

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 12060 Comm: syz-executor.5 Not tainted 5.4.0-rc3+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine,

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
[ paulmck: Pulled in Eric's later amendments. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 15:58:22 -08:00
Alexander Popov
6e317c32fd timer: Improve the comment describing schedule_timeout()
When working commit 6dcd5d7a7a, a mistake was noticed by Linus:
schedule_timeout() was called without setting the task state to anything
particular.

It calls the scheduler, but doesn't delay anything, because the task stays
runnable. That happens because sched_submit_work() does nothing for tasks
in TASK_RUNNING state.

That turned out to be the intended behavior. Adding a WARN() is not useful
as the task could be woken up right after setting the state and before
reaching schedule_timeout().

Improve the comment about schedule_timeout() and describe that more
explicitly.

Signed-off-by: Alexander Popov <alex.popov@linux.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200117225900.16340-1-alex.popov@linux.com
2020-02-17 20:12:19 +01:00
Li RongQing
e430d802d6 timer: Read jiffies once when forwarding base clk
The timer delayed for more than 3 seconds warning was triggered during
testing.

  Workqueue: events_unbound sched_tick_remote
  RIP: 0010:sched_tick_remote+0xee/0x100
  ...
  Call Trace:
   process_one_work+0x18c/0x3a0
   worker_thread+0x30/0x380
   kthread+0x113/0x130
   ret_from_fork+0x22/0x40

The reason is that the code in collect_expired_timers() uses jiffies
unprotected:

    if (next_event > jiffies)
        base->clk = jiffies;

As the compiler is allowed to reload the value base->clk can advance
between the check and the store and in the worst case advance farther than
next event. That causes the timer expiry to be delayed until the wheel
pointer wraps around.

Convert the code to use READ_ONCE()

Fixes: 236968383c ("timers: Optimize collect_expired_timers() for NOHZ")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Liang ZhiCheng <liangzhicheng@baidu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1568894687-14499-1-git-send-email-lirongqing@baidu.com
2019-09-19 17:50:11 +02:00
Thomas Gleixner
dce3e8fd03 posix-cpu-timers: Remove tsk argument from run_posix_cpu_timers()
It's always current. Don't give people wrong ideas.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190819143801.945469967@linutronix.de
2019-08-21 20:27:16 +02:00
Anna-Maria Gleixner
030dcdd197 timers: Prepare support for PREEMPT_RT
When PREEMPT_RT is enabled, the soft interrupt thread can be preempted.  If
the soft interrupt thread is preempted in the middle of a timer callback,
then calling del_timer_sync() can lead to two issues:

  - If the caller is on a remote CPU then it has to spin wait for the timer
    handler to complete. This can result in unbound priority inversion.

  - If the caller originates from the task which preempted the timer
    handler on the same CPU, then spin waiting for the timer handler to
    complete is never going to end.

To avoid these issues, add a new lock to the timer base which is held
around the execution of the timer callbacks. If del_timer_sync() detects
that the timer callback is currently running, it blocks on the expiry
lock. When the callback is finished, the expiry lock is dropped by the
softirq thread which wakes up the waiter and the system makes progress.

This addresses both the priority inversion and the life lock issues.

This mechanism is not used for timers which are marked IRQSAFE as for those
preemption is disabled accross the callback and therefore this situation
cannot happen. The callbacks for such timers need to be individually
audited for RT compliance.

The same issue can happen in virtual machines when the vCPU which runs a
timer callback is scheduled out. If a second vCPU of the same guest calls
del_timer_sync() it will spin wait for the other vCPU to be scheduled back
in. The expiry lock mechanism would avoid that. It'd be trivial to enable
this when paravirt spinlocks are enabled in a guest, but it's not clear
whether this is an actual problem in the wild, so for now it's an RT only
mechanism.

As the softirq thread can be preempted with PREEMPT_RT=y, the SMP variant
of del_timer_sync() needs to be used on UP as well.

[ tglx: Refactored it for mainline ]

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726185753.832418500@linutronix.de
2019-08-01 20:51:22 +02:00
Linus Torvalds
0968621917 Printk changes for 5.2
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAlzP8nQACgkQUqAMR0iA
 lPK79A/+NkRouqA9ihAZhUbgW0DHzOAFvUJSBgX11HQAZbGjngakuoyYFvwUx0T0
 m80SUTCysxQrWl+xLdccPZ9ZrhP2KFQrEBEdeYHZ6ymcYcl83+3bOIBS7VwdZAbO
 EzB8u/58uU/sI6ABL4lF7ZF/+R+U4CXveEUoVUF04bxdPOxZkRX4PT8u3DzCc+RK
 r4yhwQUXGcKrHa2GrRL3GXKsDxcnRdFef/nzq4RFSZsi0bpskzEj34WrvctV6j+k
 FH/R3kEcZrtKIMPOCoDMMWq07yNqK/QKj0MJlGoAlwfK4INgcrSXLOx+pAmr6BNq
 uMKpkxCFhnkZVKgA/GbKEGzFf+ZGz9+2trSFka9LD2Ig6DIstwXqpAgiUK8JFQYj
 lq1mTaJZD3DfF2vnGHGeAfBFG3XETv+mIT/ow6BcZi3NyNSVIaqa5GAR+lMc6xkR
 waNkcMDkzLFuP1r0p7ZizXOksk9dFkMP3M6KqJomRtApwbSNmtt+O2jvyLPvB3+w
 wRyN9WT7IJZYo4v0rrD5Bl6BjV15ZeCPRSFZRYofX+vhcqJQsFX1M9DeoNqokh55
 Cri8f6MxGzBVjE1G70y2/cAFFvKEKJud0NUIMEuIbcy+xNrEAWPF8JhiwpKKnU10
 c0u674iqHJ2HeVsYWZF0zqzqQ6E1Idhg/PrXfuVuhAaL5jIOnYY=
 =WZfC
 -----END PGP SIGNATURE-----

Merge tag 'printk-for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk

Pull printk updates from Petr Mladek:

 - Allow state reset of printk_once() calls.

 - Prevent crashes when dereferencing invalid pointers in vsprintf().
   Only the first byte is checked for simplicity.

 - Make vsprintf warnings consistent and inlined.

 - Treewide conversion of obsolete %pf, %pF to %ps, %pF printf
   modifiers.

 - Some clean up of vsprintf and test_printf code.

* tag 'printk-for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
  lib/vsprintf: Make function pointer_string static
  vsprintf: Limit the length of inlined error messages
  vsprintf: Avoid confusion between invalid address and value
  vsprintf: Prevent crash when dereferencing invalid pointers
  vsprintf: Consolidate handling of unknown pointer specifiers
  vsprintf: Factor out %pO handler as kobject_string()
  vsprintf: Factor out %pV handler as va_format()
  vsprintf: Factor out %p[iI] handler as ip_addr_string()
  vsprintf: Do not check address of well-known strings
  vsprintf: Consistent %pK handling for kptr_restrict == 0
  vsprintf: Shuffle restricted_pointer()
  printk: Tie printk_once / printk_deferred_once into .data.once for reset
  treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively
  lib/test_printf: Switch to bitmap_zalloc()
2019-05-07 09:18:12 -07:00
Sakari Ailus
d75f773c86 treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively
%pF and %pf are functionally equivalent to %pS and %ps conversion
specifiers. The former are deprecated, therefore switch the current users
to use the preferred variant.

The changes have been produced by the following command:

	git grep -l '%p[fF]' | grep -v '^\(tools\|Documentation\)/' | \
	while read i; do perl -i -pe 's/%pf/%ps/g; s/%pF/%pS/g;' $i; done

And verifying the result.

Link: http://lkml.kernel.org/r/20190325193229.23390-1-sakari.ailus@linux.intel.com
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: sparclinux@vger.kernel.org
Cc: linux-um@lists.infradead.org
Cc: xen-devel@lists.xenproject.org
Cc: linux-acpi@vger.kernel.org
Cc: linux-pm@vger.kernel.org
Cc: drbd-dev@lists.linbit.com
Cc: linux-block@vger.kernel.org
Cc: linux-mmc@vger.kernel.org
Cc: linux-nvdimm@lists.01.org
Cc: linux-pci@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Cc: linux-btrfs@vger.kernel.org
Cc: linux-f2fs-devel@lists.sourceforge.net
Cc: linux-mm@kvack.org
Cc: ceph-devel@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Acked-by: David Sterba <dsterba@suse.com> (for btrfs)
Acked-by: Mike Rapoport <rppt@linux.ibm.com> (for mm/memblock.c)
Acked-by: Bjorn Helgaas <bhelgaas@google.com> (for drivers/pci)
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-04-09 14:19:06 +02:00
Anna-Maria Gleixner
f28d3d5346 timer/trace: Improve timer tracing
Timers are added to the timer wheel off by one. This is required in
case a timer is queued directly before incrementing jiffies to prevent
early timer expiry.

When reading a timer trace and relying only on the expiry time of the timer
in the timer_start trace point and on the now in the timer_expiry_entry
trace point, it seems that the timer fires late. With the current
timer_expiry_entry trace point information only now=jiffies is printed but
not the value of base->clk. This makes it impossible to draw a conclusion
to the index of base->clk and makes it impossible to examine timer problems
without additional trace points.

Therefore add the base->clk value to the timer_expire_entry trace
point, to be able to calculate the index the timer base is located at
during collecting expired timers.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: fweisbec@gmail.com
Cc: peterz@infradead.org
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20190321120921.16463-5-anna-maria@linutronix.de
2019-03-24 20:29:33 +01:00
Anna-Maria Gleixner
dc1e7dc5ac timer: Move trace point to get proper index
When placing the timer_start trace point before the timer wheel bucket
index is calculated, the index information in the trace point is useless.

It is not possible to simply move the debug_activate() call after the index
calculation, because debug_object_activate() needs to be called before
touching the object.

Therefore split debug_activate() and move the trace point into
enqueue_timer() after the new index has been calculated. The
debug_object_activate() call remains at the original place.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: fweisbec@gmail.com
Cc: peterz@infradead.org
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20190321120921.16463-3-anna-maria@linutronix.de
2019-03-24 20:29:32 +01:00
Linus Torvalds
3717f613f4 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The main RCU related changes in this cycle were:

   - Additional cleanups after RCU flavor consolidation

   - Grace-period forward-progress cleanups and improvements

   - Documentation updates

   - Miscellaneous fixes

   - spin_is_locked() conversions to lockdep

   - SPDX changes to RCU source and header files

   - SRCU updates

   - Torture-test updates, including nolibc updates and moving nolibc to
     tools/include"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
  locking/locktorture: Convert to SPDX license identifier
  linux/torture: Convert to SPDX license identifier
  torture: Convert to SPDX license identifier
  linux/srcu: Convert to SPDX license identifier
  linux/rcutree: Convert to SPDX license identifier
  linux/rcutiny: Convert to SPDX license identifier
  linux/rcu_sync: Convert to SPDX license identifier
  linux/rcu_segcblist: Convert to SPDX license identifier
  linux/rcupdate: Convert to SPDX license identifier
  linux/rcu_node_tree: Convert to SPDX license identifier
  rcu/update: Convert to SPDX license identifier
  rcu/tree: Convert to SPDX license identifier
  rcu/tiny: Convert to SPDX license identifier
  rcu/sync: Convert to SPDX license identifier
  rcu/srcu: Convert to SPDX license identifier
  rcu/rcutorture: Convert to SPDX license identifier
  rcu/rcu_segcblist: Convert to SPDX license identifier
  rcu/rcuperf: Convert to SPDX license identifier
  rcu/rcu.h: Convert to SPDX license identifier
  RCU/torture.txt: Remove section MODULE PARAMETERS
  ...
2019-03-05 14:49:11 -08:00
Gustavo A. R. Silva
75b710af71 timers: Mark expected switch fall-throughs
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where fall through is indeed expected.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Stephen Boyd <sboyd@kernel.org>
Link: https://lkml.kernel.org/r/20190123081413.GA3949@embeddedor
2019-01-29 20:08:42 +01:00
Paul E. McKenney
c98cac603f rcu: Rename rcu_check_callbacks() to rcu_sched_clock_irq()
The name rcu_check_callbacks() arguably made sense back in the early
2000s when RCU was quite a bit simpler than it is today, but it has
become quite misleading, especially with the advent of dyntick-idle
and NO_HZ_FULL.  The rcu_check_callbacks() function is RCU's hook into
the scheduling-clock interrupt, and is now but one of many ways that
callbacks get promoted to invocable state.

This commit therefore changes the name to rcu_sched_clock_irq(),
which is the same number of characters and clearly indicates this
function's relation to the rest of the Linux kernel.  In addition, for
the sake of consistency, rcu_flavor_check_callbacks() is also renamed
to rcu_flavor_sched_clock_irq().

While in the area, the header comments for both functions are reworked.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-01-25 15:35:21 -08:00
Thomas Gleixner
35728b8209 time: Add SPDX license identifiers
Update the time(r) core files files with the correct SPDX license
identifier based on the license text in the file itself. The SPDX
identifier is a legally binding shorthand, which can be used instead of the
full boiler plate text.

This work is based on a script and data from Philippe Ombredanne, Kate
Stewart and myself. The data has been created with two independent license
scanners and manual inspection.

The following files do not contain any direct license information and have
been omitted from the big initial SPDX changes:

  timeconst.bc: The .bc files were not touched
  time.c, timer.c, timekeeping.c: Licence was deduced from EXPORT_SYMBOL_GPL

As those files do not contain direct license references they fall under the
project license, i.e. GPL V2 only.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: John Stultz <john.stultz@linaro.org>
Acked-by: Corey Minyard <cminyard@mvista.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Russell King <rmk+kernel@armlinux.org.uk>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: David Riley <davidriley@chromium.org>
Cc: Colin Cross <ccross@android.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: https://lkml.kernel.org/r/20181031182252.879109557@linutronix.de
2018-11-23 11:51:20 +01:00
Thomas Gleixner
58c5fc2b96 time: Remove useless filenames in top level comments
Remove the pointless filenames in the top level comments. They have no
value at all and just occupy space. While at it tidy up some of the
comments and remove a stale one.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Nicolas Pitre <nico@linaro.org>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: John Stultz <john.stultz@linaro.org>
Acked-by: Corey Minyard <cminyard@mvista.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Russell King <rmk+kernel@armlinux.org.uk>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: David Riley <davidriley@chromium.org>
Cc: Colin Cross <ccross@android.com>
Cc: Mark Brown <broonie@kernel.org>
Link: https://lkml.kernel.org/r/20181031182252.794898238@linutronix.de
2018-11-23 11:51:20 +01:00
Gaurav Kohli
363e934d88 timers: Clear timer_base::must_forward_clk with timer_base::lock held
timer_base::must_forward_clock is indicating that the base clock might be
stale due to a long idle sleep.

The forwarding of the base clock takes place in the timer softirq or when a
timer is enqueued to a base which is idle. If the enqueue of timer to an
idle base happens from a remote CPU, then the following race can happen:

  CPU0					CPU1
  run_timer_softirq			mod_timer

					base = lock_timer_base(timer);
  base->must_forward_clk = false
					if (base->must_forward_clk)
				       	    forward(base); -> skipped

					enqueue_timer(base, timer, idx);
					-> idx is calculated high due to
					   stale base
					unlock_timer_base(timer);
  base = lock_timer_base(timer);
  forward(base);

The root cause is that timer_base::must_forward_clk is cleared outside the
timer_base::lock held region, so the remote queuing CPU observes it as
cleared, but the base clock is still stale. This can cause large
granularity values for timers, i.e. the accuracy of the expiry time
suffers.

Prevent this by clearing the flag with timer_base::lock held, so that the
forwarding takes place before the cleared flag is observable by a remote
CPU.

Signed-off-by: Gaurav Kohli <gkohli@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: john.stultz@linaro.org
Cc: sboyd@kernel.org
Cc: linux-arm-msm@vger.kernel.org
Link: https://lkml.kernel.org/r/1533199863-22748-1-git-send-email-gkohli@codeaurora.org
2018-08-02 12:52:38 +02:00
Yi Wang
3058758925 timer: Fix coding style
The call to wake_up_nohz_cpu() is incorrectly indented. Remove the surplus TAB.

Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn>
Cc: john.stultz@linaro.org
Cc: sboyd@kernel.org
Cc: zhong.weidong@zte.com.cn
CC: Anna-Maria Gleixner <anna-maria@linutronix.de>
Link: https://lkml.kernel.org/r/1531721337-30284-1-git-send-email-wang.yi59@zte.com.cn
2018-07-19 16:52:40 +02:00
Mauro Carvalho Chehab
bf9c96bec7 timers: Adjust a kernel-doc comment
Those three warnings can easily solved by using :: to indicate a
code block:

	./kernel/time/timer.c:1259: WARNING: Unexpected indentation.
	./kernel/time/timer.c:1261: WARNING: Unexpected indentation.
	./kernel/time/timer.c:1262: WARNING: Block quote ends without a blank line; unexpected unindent.

While here, align the lines at the block.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Linux Doc Mailing List <linux-doc@vger.kernel.org>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: https://lkml.kernel.org/r/f02e6a0ce27f3b5e33415d92d07a40598904b3ee.1525684985.git.mchehab%2Bsamsung@kernel.org
2018-05-13 15:55:43 +02:00
Lingutla Chandrasekhar
c52232a49e timers: Forward timer base before migrating timers
On CPU hotunplug the enqueued timers of the unplugged CPU are migrated to a
live CPU. This happens from the control thread which initiated the unplug.

If the CPU on which the control thread runs came out from a longer idle
period then the base clock of that CPU might be stale because the control
thread runs prior to any event which forwards the clock.

In such a case the timers from the unplugged CPU are queued on the live CPU
based on the stale clock which can cause large delays due to increased
granularity of the outer timer wheels which are far away from base:;clock.

But there is a worse problem than that. The following sequence of events
illustrates it:

 - CPU0 timer1 is queued expires = 59969 and base->clk = 59131.

   The timer is queued at wheel level 2, with resulting expiry time = 60032
   (due to level granularity).

 - CPU1 enters idle @60007, with next timer expiry @60020.

 - CPU0 is hotplugged at @60009

 - CPU1 exits idle and runs the control thread which migrates the
   timers from CPU0

   timer1 is now queued in level 0 for immediate handling in the next
   softirq because the requested expiry time 59969 is before CPU1 base->clk
   60007

 - CPU1 runs code which forwards the base clock which succeeds because the
   next expiring timer. which was collected at idle entry time is still set
   to 60020.

   So it forwards beyond 60007 and therefore misses to expire the migrated
   timer1. That timer gets expired when the wheel wraps around again, which
   takes between 63 and 630ms depending on the HZ setting.

Address both problems by invoking forward_timer_base() for the control CPUs
timer base. All other places, which might run into a similar problem
(mod_timer()/add_timer_on()) already invoke forward_timer_base() to avoid
that.

[ tglx: Massaged comment and changelog ]

Fixes: a683f390b9 ("timers: Forward the wheel clock whenever possible")
Co-developed-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: linux-arm-msm@vger.kernel.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180118115022.6368-1-clingutla@codeaurora.org
2018-02-28 23:34:33 +01:00
Anna-Maria Gleixner
14c803419d hrtimer: Unify remote enqueue handling
hrtimer_reprogram() is conditionally invoked from hrtimer_start_range_ns()
when hrtimer_cpu_base.hres_active is true.

In the !hres_active case there is a special condition for the nohz_active
case:

  If the newly enqueued timer expires before the first expiring timer on a
  remote CPU then the remote CPU needs to be notified and woken up from a
  NOHZ idle sleep to take the new first expiring timer into account.

Previous changes have already established the prerequisites to make the
remote enqueue behaviour the same whether high resolution mode is active or
not:

  If the to be enqueued timer expires before the first expiring timer on a
  remote CPU, then it cannot be enqueued there.

This was done for the high resolution mode because there is no way to
access the remote CPU timer hardware. The same is true for NOHZ, but was
handled differently by unconditionally enqueuing the timer and waking up
the remote CPU so it can reprogram its timer. Again there is no compelling
reason for this difference.

hrtimer_check_target(), which makes the 'can remote enqueue' decision is
already unconditional, but not yet functional because nothing updates
hrtimer_cpu_base.expires_next in the !hres_active case.

To unify this the following changes are required:

 1) Make the store of the new first expiry time unconditonal in
    hrtimer_reprogram() and check __hrtimer_hres_active() before proceeding
    to the actual hardware access. This check also lets the compiler
    eliminate the rest of the function in case of CONFIG_HIGH_RES_TIMERS=n.

 2) Invoke hrtimer_reprogram() unconditionally from
    hrtimer_start_range_ns()

 3) Remove the remote wakeup special case for the !high_res && nohz_active
    case.

Confine the timers_nohz_active static key to timer.c which is the only user
now.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-21-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:53:58 +01:00
Thomas Gleixner
ae67badaa1 hrtimer: Optimize the hrtimer code by using static keys for migration_enable/nohz_active
The hrtimer_cpu_base::migration_enable and ::nohz_active fields
were originally introduced to avoid accessing global variables
for these decisions.

Still that results in a (cache hot) load and conditional branch,
which can be avoided by using static keys.

Implement it with static keys and optimize for the most critical
case of high performance networking which tends to disable the
timer migration functionality.

No change in functionality.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1801142327490.2371@nanos
Link: https://lkml.kernel.org/r/20171221104205.7269-2-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:35:44 +01:00
Thomas Gleixner
ed4bbf7910 timers: Unconditionally check deferrable base
When the timer base is checked for expired timers then the deferrable base
must be checked as well. This was missed when making the deferrable base
independent of base::nohz_active.

Fixes: ced6d5c11d ("timers: Use deferrable base independent of base::nohz_active")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Cc: rt@linutronix.de
2018-01-14 23:25:33 +01:00
Thomas Gleixner
fd45bb77ad timers: Invoke timer_start_debug() where it makes sense
The timer start debug function is called before the proper timer base is
set. As a consequence the trace data contains the stale CPU and flags
values.

Call the debug function after setting the new base and flags.

Fixes: 500462a9de ("timers: Switch to a non-cascading wheel")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: stable@vger.kernel.org
Cc: rt@linutronix.de
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Link: https://lkml.kernel.org/r/20171222145337.792907137@linutronix.de
2017-12-29 23:13:10 +01:00
Thomas Gleixner
26456f87ac timers: Reinitialize per cpu bases on hotplug
The timer wheel bases are not (re)initialized on CPU hotplug. That leaves
them with a potentially stale clk and next_expiry valuem, which can cause
trouble then the CPU is plugged.

Add a prepare callback which forwards the clock, sets next_expiry to far in
the future and reset the control flags to a known state.

Set base->must_forward_clk so the first timer which is queued will try to
forward the clock to current jiffies.

Fixes: 500462a9de ("timers: Switch to a non-cascading wheel")
Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712272152200.2431@nanos
2017-12-29 23:13:09 +01:00
Anna-Maria Gleixner
ced6d5c11d timers: Use deferrable base independent of base::nohz_active
During boot and before base::nohz_active is set in the timer bases, deferrable
timers are enqueued into the standard timer base. This works correctly as
long as base::nohz_active is false.

Once it base::nohz_active is set and a timer which was enqueued before that
is accessed the lock selector code choses the lock of the deferred
base. This causes unlocked access to the standard base and in case the
timer is removed it does not clear the pending flag in the standard base
bitmap which causes get_next_timer_interrupt() to return bogus values.

To prevent that, the deferrable timers must be enqueued in the deferrable
base, even when base::nohz_active is not set. Those deferrable timers also
need to be expired unconditional.

Fixes: 500462a9de ("timers: Switch to a non-cascading wheel")
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: stable@vger.kernel.org
Cc: rt@linutronix.de
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Link: https://lkml.kernel.org/r/20171222145337.633328378@linutronix.de
2017-12-29 23:13:09 +01:00
Kees Cook
188665b2d6 timer: Pass function down to initialization routines
In preparation for removing more macros, pass the function down to the
initialization routines instead of doing it in macros.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
2017-11-21 15:57:14 -08:00
Kees Cook
354b46b1a0 timer: Switch callback prototype to take struct timer_list * argument
Since all callbacks have been converted, we can switch the core
prototype to "struct timer_list *" now too.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
2017-11-21 15:57:13 -08:00
Kees Cook
c1eba5bcb6 timer: Pass timer_list pointer to callbacks unconditionally
Now that all timer callbacks are already taking their struct timer_list
pointer as the callback argument, just do this unconditionally and remove
the .data field.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
2017-11-21 15:57:12 -08:00
David Howells
b24591e2fc timers: Add a function to start/reduce a timer
Add a function, similar to mod_timer(), that will start a timer if it isn't
running and will modify it if it is running and has an expiry time longer
than the new time.  If the timer is running with an expiry time that's the
same or sooner, no change is made.

The function looks like:

	int timer_reduce(struct timer_list *timer, unsigned long expires);

This can be used by code such as networking code to make it easier to share
a timer for multiple timeouts.  For instance, in upcoming AF_RXRPC code,
the rxrpc_call struct will maintain a number of timeouts:

	unsigned long	ack_at;
	unsigned long	resend_at;
	unsigned long	ping_at;
	unsigned long	expect_rx_by;
	unsigned long	expect_req_by;
	unsigned long	expect_term_by;

each of which is set independently of the others.  With timer reduction
available, when the code needs to set one of the timeouts, it only needs to
look at that timeout and then call timer_reduce() to modify the timer,
starting it or bringing it forward if necessary.  There is no need to refer
to the other timeouts to see which is earliest and no need to take any lock
other than, potentially, the timer lock inside timer_reduce().

Note, that this does not protect against concurrent invocations of any of
the timer functions.

As an example, the expect_rx_by timeout above, which terminates a call if
we don't get a packet from the server within a certain time window, would
be set something like this:

	unsigned long now = jiffies;
	unsigned long expect_rx_by = now + packet_receive_timeout;
	WRITE_ONCE(call->expect_rx_by, expect_rx_by);
	timer_reduce(&call->timer, expect_rx_by);

The timer service code (which might, say, be in a work function) would then
check all the timeouts to see which, if any, had triggered, deal with
those:

	t = READ_ONCE(call->ack_at);
	if (time_after_eq(now, t)) {
		cmpxchg(&call->ack_at, t, now + MAX_JIFFY_OFFSET);
		set_bit(RXRPC_CALL_EV_ACK, &call->events);
	}

and then restart the timer if necessary by finding the soonest timeout that
hasn't yet passed and then calling timer_reduce().

The disadvantage of doing things this way rather than comparing the timers
each time and calling mod_timer() is that you *will* take timer events
unless you can finish what you're doing and delete the timer in time.

The advantage of doing things this way is that you don't need to use a lock
to work out when the next timer should be set, other than the timer's own
lock - which you might not have to take.

[ tglx: Fixed weird formatting and adopted it to pending changes ]

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: keyrings@vger.kernel.org
Cc: linux-afs@lists.infradead.org
Link: https://lkml.kernel.org/r/151023090769.23050.1801643667223880753.stgit@warthog.procyon.org.uk
2017-11-12 15:10:27 +01:00
Thomas Gleixner
ba16490eac timer: Convert stub timer to timer_setup()
In preparation for unconditionally passing the struct timer_list pointer
to all timer callbacks, switch to using the new timer_setup() and
from_timer() to pass the timer pointer explicitly.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Kees Cook <keescook@chromium.org>
2017-10-18 17:04:25 +02:00
Zhenzhong Duan
c310ce4dcb timers: Avoid an unnecessary iteration in __run_timers()
If the base clock is behind jiffies in the soft irq expiry code then the
next timer is retrieved by get_next_timer_interrupt() to avoid incrementing
base clock one by one. If the next timer interrupt is past current jiffies
then the base clock is set to jiffies - 1. At the call site this is
incremented and another iteration through the expiry loop is executed which
checks empty hash buckets.

That's a pointless excercise because it's already known that the next timer
is past jiffies.

Set the base clock in that case to jiffies directly so it gets incremented
to jiffies + 1 at the call site resulting in immediate termination of the
expiry loop.

[ tglx: Massaged changelog and added comment to the code ]

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Joe Jin <joe.jin@oracle.com>
Cc: sboyd@codeaurora.org
Cc: Srinivas Reddy Eeda <srinivas.eeda@oracle.com>
Cc: john.stultz@linaro.org
Link: https://lkml.kernel.org/r/7086a857-f90c-4616-bbe8-f7696f21626c@default
2017-10-18 15:29:33 +02:00
Kees Cook
58e1177b4c timer: Convert schedule_timeout() to use from_timer()
In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new from_timer() helper and passing
the timer pointer explicitly. Since this special timer is on the stack, it
needs to have a wrapper structure to carry state once .data is eliminated.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mips@linux-mips.org
Cc: Petr Mladek <pmladek@suse.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Kalle Valo <kvalo@qca.qualcomm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: linux1394-devel@lists.sourceforge.net
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: linux-s390@vger.kernel.org
Cc: linux-wireless@vger.kernel.org
Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
Cc: Wim Van Sebroeck <wim@iguana.be>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ursula Braun <ubraun@linux.vnet.ibm.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Harish Patil <harish.patil@cavium.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Manish Chopra <manish.chopra@cavium.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux-pm@vger.kernel.org
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Mark Gross <mark.gross@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: linux-watchdog@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Stefan Richter <stefanr@s5r6.in-berlin.de>
Cc: Michael Reed <mdr@sgi.com>
Cc: netdev@vger.kernel.org
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Link: https://lkml.kernel.org/r/1507159627-127660-2-git-send-email-keescook@chromium.org
2017-10-05 15:01:16 +02:00
Nicholas Piggin
2fe59f507a timers: Fix excessive granularity of new timers after a nohz idle
When a timer base is idle, it is forwarded when a new timer is added
to ensure that granularity does not become excessive. When not idle,
the timer tick is expected to increment the base.

However there are several problems:

- If an existing timer is modified, the base is forwarded only after
  the index is calculated.

- The base is not forwarded by add_timer_on.

- There is a window after a timer is restarted from a nohz idle, after
  it is marked not-idle and before the timer tick on this CPU, where a
  timer may be added but the ancient base does not get forwarded.

These result in excessive granularity (a 1 jiffy timeout can blow out
to 100s of jiffies), which cause the rcu lockup detector to trigger,
among other things.

Fix this by keeping track of whether the timer base has been idle
since it was last run or forwarded, and if so then forward it before
adding a new timer.

There is still a case where mod_timer optimises the case of a pending
timer mod with the same expiry time, where the timer can see excessive
granularity relative to the new, shorter interval. A comment is added,
but it's not changed because it is an important fastpath for
networking.

This has been tested and found to fix the RCU softlockup messages.

Testing was also done with tracing to measure requested versus
achieved wakeup latencies for all non-deferrable timers in an idle
system (with no lockup watchdogs running). Wakeup latency relative to
absolute latency is calculated (note this suffers from round-up skew
at low absolute times) and analysed:

             max     avg      std
upstream   506.0    1.20     4.68
patched      2.0    1.08     0.15

The bug was noticed due to the lockup detector Kconfig changes
dropping it out of people's .configs and resulting in larger base
clk skew When the lockup detectors are enabled, no CPU can go idle for
longer than 4 seconds, which limits the granularity errors.
Sub-optimal timer behaviour is observable on a smaller scale in that
case:

	     max     avg      std
upstream     9.0    1.05     0.19
patched      2.0    1.04     0.11

Fixes: Fixes: a683f390b9 ("timers: Forward the wheel clock whenever possible")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: David Miller <davem@davemloft.net>
Cc: dzickus@redhat.com
Cc: sfr@canb.auug.org.au
Cc: mpe@ellerman.id.au
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: linuxarm@huawei.com
Cc: abdhalee@linux.vnet.ibm.com
Cc: John Stultz <john.stultz@linaro.org>
Cc: akpm@linux-foundation.org
Cc: paulmck@linux.vnet.ibm.com
Cc: torvalds@linux-foundation.org
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170822084348.21436-1-npiggin@gmail.com
2017-08-24 11:40:18 +02:00
Matija Glavinic Pecotic
34f41c0316 timers: Fix overflow in get_next_timer_interrupt
For e.g. HZ=100, timer being 430 jiffies in the future, and 32 bit
unsigned int, there is an overflow on unsigned int right-hand side
of the expression which results with wrong values being returned.

Type cast the multiplier to 64bit to avoid that issue.

Fixes: 46c8f0b077 ("timers: Fix get_next_timer_interrupt() computation")
Signed-off-by: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nokia.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>
Cc: khilman@baylibre.com
Cc: akpm@linux-foundation.org
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/a7900f04-2a21-c9fd-67be-ab334d459ee5@nokia.com
2017-08-01 14:20:53 +02:00
Sebastian Andrzej Siewior
2287d8664f timers: Make the cpu base lock raw
The timers cpu base lock could not be converted to a raw spinlock becaue
the lock held time was non-deterministic due to cascading and long lasting
timer wheel traversals.

The rework of the timer wheel to the new non-cascading model removed also
the wheel traversals and the lock held times are deterministic now. This
allows to make the lock raw and thereby unbreaks NOHz* on preempt-RT.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: http://lkml.kernel.org/r/20170627161538.30257-1-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-06-29 00:27:24 +02:00
Peter Meerwald-Stadler
d15bc69aff timers: Fix parameter description of try_to_del_timer_sync()
Signed-off-by: Peter Meerwald-Stadler <pmeerw@pmeerw.net>
Link: http://lkml.kernel.org/r/20170530194103.7454-1-pmeerw@pmeerw.net
Cc: John Stultz <john.stultz@linaro.org>
Cc: trivial@rustcorp.com.au
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-06-20 21:33:55 +02:00
Linus Torvalds
89c9fea3c8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  tty: fix comment for __tty_alloc_driver()
  init/main: properly align the multi-line comment
  init/main: Fix double "the" in comment
  Fix dead URLs to ftp.kernel.org
  drivers: Clean up duplicated email address
  treewide: Fix typo in xml/driver-api/basics.xml
  tools/testing/selftests/powerpc: remove redundant CFLAGS in Makefile: "-Wall -O2 -Wall" -> "-O2 -Wall"
  selftests/timers: Spelling s/privledges/privileges/
  HID: picoLCD: Spelling s/REPORT_WRTIE_MEMORY/REPORT_WRITE_MEMORY/
  net: phy: dp83848: Fix Typo
  UBI: Fix typos
  Documentation: ftrace.txt: Correct nice value of 120 priority
  net: fec: Fix typo in error msg and comment
  treewide: Fix typos in printk
2017-05-02 19:09:35 -07:00
Myungho Jung
b94bf594cf timer/sysclt: Restrict timer migration sysctl values to 0 and 1
timer_migration sysctl acts as a boolean switch, so the allowed values
should be restricted to 0 and 1.

Add the necessary extra fields to the sysctl table entry to enforce that.

[ tglx: Rewrote changelog ]

Signed-off-by: Myungho Jung <mhjungk@gmail.com>
Link: http://lkml.kernel.org/r/1492640690-3550-1-git-send-email-mhjungk@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-20 14:56:59 +02:00
Masanari Iida
0ba42a599f treewide: Fix typo in xml/driver-api/basics.xml
This patch fix spelling typos found in
Documentation/output/xml/driver-api/basics.xml.
It is because the xml file was generated from comments in source,
so I had to fix the comments.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-24 15:47:57 +01:00
Ingo Molnar
b17b01533b sched/headers: Prepare for new header dependencies before moving code to <linux/sched/debug.h>
We are going to split <linux/sched/debug.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/debug.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:34 +01:00
Ingo Molnar
370c91355c sched/headers: Prepare for new header dependencies before moving code to <linux/sched/nohz.h>
We are going to split <linux/sched/nohz.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/nohz.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:34 +01:00
Ingo Molnar
174cd4b1e5 sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h>
Fix up affected files that include this signal functionality via sched.h.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:32 +01:00
Kees Cook
dfb4357da6 time: Remove CONFIG_TIMER_STATS
Currently CONFIG_TIMER_STATS exposes process information across namespaces:

kernel/time/timer_list.c print_timer():

        SEQ_printf(m, ", %s/%d", tmp, timer->start_pid);

/proc/timer_list:

 #11: <0000000000000000>, hrtimer_wakeup, S:01, do_nanosleep, cron/2570

Given that the tracer can give the same information, this patch entirely
removes CONFIG_TIMER_STATS.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: linux-doc@vger.kernel.org
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Xing Gao <xgao01@email.wm.edu>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jessica Frazelle <me@jessfraz.com>
Cc: kernel-hardening@lists.openwall.com
Cc: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Michal Marek <mmarek@suse.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Olof Johansson <olof@lixom.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-api@vger.kernel.org
Cc: Arjan van de Ven <arjan@linux.intel.com>
Link: http://lkml.kernel.org/r/20170208192659.GA32582@beast
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-02-10 11:15:08 +01:00
Linus Torvalds
7c0f6ba682 Replace <asm/uaccess.h> with <linux/uaccess.h> globally
This was entirely automated, using the script by Al:

  PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
  sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
        $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-24 11:46:01 -08:00
Linus Torvalds
9465d9cc31 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "The time/timekeeping/timer folks deliver with this update:

   - Fix a reintroduced signed/unsigned issue and cleanup the whole
     signed/unsigned mess in the timekeeping core so this wont happen
     accidentaly again.

   - Add a new trace clock based on boot time

   - Prevent injection of random sleep times when PM tracing abuses the
     RTC for storage

   - Make posix timers configurable for real tiny systems

   - Add tracepoints for the alarm timer subsystem so timer based
     suspend wakeups can be instrumented

   - The usual pile of fixes and updates to core and drivers"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
  timekeeping: Use mul_u64_u32_shr() instead of open coding it
  timekeeping: Get rid of pointless typecasts
  timekeeping: Make the conversion call chain consistently unsigned
  timekeeping_Force_unsigned_clocksource_to_nanoseconds_conversion
  alarmtimer: Add tracepoints for alarm timers
  trace: Update documentation for mono, mono_raw and boot clock
  trace: Add an option for boot clock as trace clock
  timekeeping: Add a fast and NMI safe boot clock
  timekeeping/clocksource_cyc2ns: Document intended range limitation
  timekeeping: Ignore the bogus sleep time if pm_trace is enabled
  selftests/timers: Fix spelling mistake "Asyncrhonous" -> "Asynchronous"
  clocksource/drivers/bcm2835_timer: Unmap region obtained by of_iomap
  clocksource/drivers/arm_arch_timer: Map frame with of_io_request_and_map()
  arm64: dts: rockchip: Arch counter doesn't tick in system suspend
  clocksource/drivers/arm_arch_timer: Don't assume clock runs in suspend
  posix-timers: Make them configurable
  posix_cpu_timers: Move the add_device_randomness() call to a proper place
  timer: Move sys_alarm from timer.c to itimer.c
  ptp_clock: Allow for it to be optional
  Kconfig: Regenerate *.c_shipped files after previous changes
  ...
2016-12-12 19:56:15 -08:00
Nicolas Pitre
baa73d9e47 posix-timers: Make them configurable
Some embedded systems have no use for them.  This removes about
25KB from the kernel binary size when configured out.

Corresponding syscalls are routed to a stub logging the attempt to
use those syscalls which should be enough of a clue if they were
disabled without proper consideration. They are: timer_create,
timer_gettime: timer_getoverrun, timer_settime, timer_delete,
clock_adjtime, setitimer, getitimer, alarm.

The clock_settime, clock_gettime, clock_getres and clock_nanosleep
syscalls are replaced by simple wrappers compatible with CLOCK_REALTIME,
CLOCK_MONOTONIC and CLOCK_BOOTTIME only which should cover the vast
majority of use cases with very little code.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: Paul Bolle <pebolle@tiscali.nl>
Cc: linux-kbuild@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: Michal Marek <mmarek@suse.com>
Cc: Edward Cree <ecree@solarflare.com>
Link: http://lkml.kernel.org/r/1478841010-28605-7-git-send-email-nicolas.pitre@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-16 09:26:35 +01:00
Nicolas Pitre
74ba181e61 timer: Move sys_alarm from timer.c to itimer.c
Move the only user of alarm_setitimer to itimer.c where it is defined.
This allows for making alarm_setitimer static, and dropping it from the
build when __ARCH_WANT_SYS_ALARM is not defined.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Paul Bolle <pebolle@tiscali.nl>
Cc: linux-kbuild@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Michal Marek <mmarek@suse.com>
Cc: Edward Cree <ecree@solarflare.com>
Link: http://lkml.kernel.org/r/1478841010-28605-5-git-send-email-nicolas.pitre@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-16 09:26:34 +01:00
Douglas Anderson
4b7e9cf9c8 timers: Fix documentation for schedule_timeout() and similar
The documentation for schedule_timeout(), schedule_hrtimeout(), and
schedule_hrtimeout_range() all claim that the routines couldn't possibly
return early if the task state was TASK_UNINTERRUPTIBLE. This is simply
not true since wake_up_process() will cause those routines to exit early.

We cannot make schedule_[hr]timeout() loop until the timeout expires if the
task state is uninterruptible because we have users which rely on the
existing and designed behaviour.

Make the documentation match the (correct) implementation.

schedule_hrtimeout() returns -EINTR even when a uninterruptible task was
woken up. This might look strange, but making the return code depend on the
state is too much of an effort as it would affect all the call sites. There
is no value in doing so, but we spell it out clearly in the documentation.

Suggested-by: Daniel Kurtz <djkurtz@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Cc: huangtao@rock-chips.com
Cc: heiko@sntech.de
Cc: broonie@kernel.org
Cc: briannorris@chromium.org
Cc: Andreas Mohr <andi@lisas.de>
Cc: linux-rockchip@lists.infradead.org
Cc: tony.xie@rock-chips.com
Cc: John Stultz <john.stultz@linaro.org>
Cc: linux@roeck-us.net
Cc: tskd08@gmail.com
Link: http://lkml.kernel.org/r/1477065531-30342-2-git-send-email-dianders@chromium.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-10-26 13:14:47 +02:00
Douglas Anderson
6c5e905969 timers: Fix usleep_range() in the context of wake_up_process()
Users of usleep_range() expect that it will _never_ return in less time
than the minimum passed parameter. However, nothing in the code ensures
this, when the sleeping task is woken by wake_up_process() or any other
mechanism which can wake a task from uninterruptible state.

Neither usleep_range() nor schedule_hrtimeout_range*() have any protection
against wakeups. schedule_hrtimeout_range*() is designed this way despite
the fact that the API documentation does not mention it.

msleep() already has code to handle this case since it will loop as long
as there was still time left.  usleep_range() has no such loop, add it.

Presumably this problem was not detected before because usleep_range() is
only used in a few places and the function is mostly used in contexts which
are not exposed to wakeups of any form.

An effort was made to look for users relying on the old behavior by
looking for usleep_range() in the same file as wake_up_process().
No problems were found by this search, though it is conceivable that
someone could have put the sleep and wakeup in two different files.

An effort was made to ask several upstream maintainers if they were aware
of people relying on wake_up_process() to wake up usleep_range(). No
maintainers were aware of that but they were aware of many people relying
on usleep_range() never returning before the minimum.

Reported-by: Tao Huang <huangtao@rock-chips.com>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Cc: heiko@sntech.de
Cc: broonie@kernel.org
Cc: briannorris@chromium.org
Cc: Andreas Mohr <andi@lisas.de>
Cc: linux-rockchip@lists.infradead.org
Cc: tony.xie@rock-chips.com
Cc: John Stultz <john.stultz@linaro.org>
Cc: djkurtz@chromium.org
Cc: linux@roeck-us.net
Cc: tskd08@gmail.com
Link: http://lkml.kernel.org/r/1477065531-30342-1-git-send-email-dianders@chromium.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-10-26 13:14:46 +02:00
Thomas Gleixner
6bad6bccf2 timers: Prevent base clock corruption when forwarding
When a timer is enqueued we try to forward the timer base clock. This
mechanism has two issues:

1) Forwarding a remote base unlocked

The forwarding function is called from get_target_base() with the current
timer base lock held. But if the new target base is a different base than
the current base (can happen with NOHZ, sigh!) then the forwarding is done
on an unlocked base. This can lead to corruption of base->clk.

Solution is simple: Invoke the forwarding after the target base is locked.

2) Possible corruption due to jiffies advancing

This is similar to the issue in get_net_timer_interrupt() which was fixed
in the previous patch. jiffies can advance between check and assignement
and therefore advancing base->clk beyond the next expiry value.

So we need to read jiffies into a local variable once and do the checks and
assignment with the local copy.

Fixes: a683f390b93f("timers: Forward the wheel clock whenever possible")
Reported-by: Ashton Holmes <scoopta@gmail.com>
Reported-by: Michael Thayer <michael.thayer@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Michal Necasek <michal.necasek@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: knut.osmundsen@oracle.com
Cc: stable@vger.kernel.org
Cc: stern@rowland.harvard.edu
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20161022110552.253640125@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-10-25 16:32:50 +02:00
Thomas Gleixner
041ad7bc75 timers: Prevent base clock rewind when forwarding clock
Ashton and Michael reported, that kernel versions 4.8 and later suffer from
USB timeouts which are caused by the timer wheel rework.

This is caused by a bug in the base clock forwarding mechanism, which leads
to timers expiring early. The scenario which leads to this is:

run_timers()
  while (jiffies >= base->clk) {
    collect_expired_timers();
    base->clk++;
    expire_timers();
  }          

So base->clk = jiffies + 1. Now the cpu goes idle:

idle()
  get_next_timer_interrupt()
    nextevt = __next_time_interrupt();
    if (time_after(nextevt, base->clk))
       	base->clk = jiffies;

jiffies has not advanced since run_timers(), so this assignment effectively
decrements base->clk by one.

base->clk is the index into the timer wheel arrays. So let's assume the
following state after the base->clk increment in run_timers():

 jiffies = 0
 base->clk = 1

A timer gets enqueued with an expiry delta of 63 ticks (which is the case
with the USB timeout and HZ=250) so the resulting bucket index is:

  base->clk + delta = 1 + 63 = 64

The timer goes into the first wheel level. The array size is 64 so it ends
up in bucket 0, which is correct as it takes 63 ticks to advance base->clk
to index into bucket 0 again.

If the cpu goes idle before jiffies advance, then the bug in the forwarding
mechanism sets base->clk back to 0, so the next invocation of run_timers()
at the next tick will index into bucket 0 and therefore expire the timer 62
ticks too early.

Instead of blindly setting base->clk to jiffies we must make the forwarding
conditional on jiffies > base->clk, but we cannot use jiffies for this as
we might run into the following issue:

  if (time_after(jiffies, base->clk) {
    if (time_after(nextevt, base->clk))
       base->clk = jiffies;

jiffies can increment between the check and the assigment far enough to
advance beyond nextevt. So we need to use a stable value for checking.

get_next_timer_interrupt() has the basej argument which is the jiffies
value snapshot taken in the calling code. So we can just that.

Thanks to Ashton for bisecting and providing trace data!

Fixes: a683f390b9 ("timers: Forward the wheel clock whenever possible")
Reported-by: Ashton Holmes <scoopta@gmail.com>
Reported-by: Michael Thayer <michael.thayer@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Michal Necasek <michal.necasek@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: knut.osmundsen@oracle.com
Cc: stable@vger.kernel.org
Cc: stern@rowland.harvard.edu
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20161022110552.175308322@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-10-25 16:32:50 +02:00
Thomas Gleixner
4da9152a43 timers: Lock base for same bucket optimization
Linus stumbled over the unlocked modification of the timer expiry value in
mod_timer() which is an optimization for timers which stay in the same
bucket - due to the bucket granularity - despite their expiry time getting
updated.

The optimization itself still makes sense even if we take the lock, because
in case that the bucket stays the same, we avoid the pointless
queue/enqueue dance.

Make the check and the modification of timer->expires protected by the base
lock and shuffle the remaining code around so we can keep the lock held
when we actually have to requeue the timer to a different bucket.

Fixes: f00c0afdfa ("timers: Implement optimization for same expiry time in mod_timer()")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1610241711220.4983@nanos
Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
2016-10-25 16:27:39 +02:00
Thomas Gleixner
b831275a35 timers: Plug locking race vs. timer migration
Linus noticed that lock_timer_base() lacks a READ_ONCE() for accessing the
timer flags. As a consequence the compiler is allowed to reload the flags
between the initial check for TIMER_MIGRATION and the following timer base
computation and the spin lock of the base.

While this has not been observed (yet), we need to make sure that it never
happens.

Fixes: 0eeda71bc3 ("timer: Replace timer base by a cpu index")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1610241711220.4983@nanos
Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
2016-10-25 16:27:39 +02:00
Emese Revfy
0766f788eb latent_entropy: Mark functions with __latent_entropy
The __latent_entropy gcc attribute can be used only on functions and
variables.  If it is on a function then the plugin will instrument it for
gathering control-flow entropy. If the attribute is on a variable then
the plugin will initialize it with random contents.  The variable must
be an integer, an integer array type or a structure with integer fields.

These specific functions have been selected because they are init
functions (to help gather boot-time entropy), are called at unpredictable
times, or they have variable loops, each of which provide some level of
latent entropy.

Signed-off-by: Emese Revfy <re.emese@gmail.com>
[kees: expanded commit message]
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-10-10 14:51:45 -07:00
Chris Metcalf
46c8f0b077 timers: Fix get_next_timer_interrupt() computation
The tick_nohz_stop_sched_tick() routine is not properly
canceling the sched timer when nothing is pending, because
get_next_timer_interrupt() is no longer returning KTIME_MAX in
that case.  This causes periodic interrupts when none are needed.

When determining the next interrupt time, we first use
__next_timer_interrupt() to get the first expiring timer in the
timer wheel.  If no timer is found, we return the base clock value
plus NEXT_TIMER_MAX_DELTA to indicate there is no timer in the
timer wheel.

Back in get_next_timer_interrupt(), we set the "expires" value
by converting the timer wheel expiry (in ticks) to a nsec value.
But we don't want to do this if the timer wheel expiry value
indicates no timer; we want to return KTIME_MAX.

Prior to commit 500462a9de ("timers: Switch to a non-cascading
wheel") we checked base->active_timers to see if any timers
were active, and if not, we didn't touch the expiry value and so
properly returned KTIME_MAX.  Now we don't have active_timers.

To fix this, we now just check the timer wheel expiry value to
see if it is "now + NEXT_TIMER_MAX_DELTA", and if it is, we don't
try to compute a new value based on it, but instead simply let the
KTIME_MAX value in expires remain.

Fixes: 500462a9de "timers: Switch to a non-cascading wheel"
Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/1470688147-22287-1-git-send-email-cmetcalf@mellanox.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-08-09 09:31:55 +02:00
Richard Cochran
24f73b9971 timers/core: Convert to hotplug state machine
When tearing down, call timers_dead_cpu() before notify_dead().
There is a hidden dependency between:

 - timers
 - block multiqueue
 - rcutree

If timers_dead_cpu() comes later than blk_mq_queue_reinit_notify()
that latter function causes a RCU stall.

Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160713153337.566790058@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-15 10:41:42 +02:00
Ingo Molnar
4b4b20852d Merge branch 'timers/fast-wheel' into timers/core 2016-07-07 10:35:28 +02:00
Anna-Maria Gleixner
f00c0afdfa timers: Implement optimization for same expiry time in mod_timer()
The existing optimization for same expiry time in mod_timer() checks whether
the timer expiry time is the same as the new requested expiry time. In the old
timer wheel implementation this does not take the slack batching into account,
neither does the new implementation evaluate whether the new expiry time will
requeue the timer to the same bucket.

To optimize that, we can calculate the resulting bucket and check if the new
expiry time is different from the current expiry time. This calculation
happens outside the base lock held region. If the resulting bucket is the same
we can avoid taking the base lock and requeueing the timer.

If the timer needs to be requeued then we have to check under the base lock
whether the base time has changed between the lockless calculation and taking
the lock. If it has changed we need to recalculate under the lock.

This optimization takes effect for timers which are enqueued into the less
granular wheel levels (1 and above). With a simple test case the functionality
has been verified:

            Before        After
 Match:       5.5%        86.6%
 Requeue:    94.5%        13.4%
 Recalc:                  <0.01%

In the non optimized case the timer is requeued in 94.5% of the cases. With
the index optimization in place the requeue rate drops to 13.4%. The case
where the lockless index calculation has to be redone is less than 0.01%.

With a real world test case (networking) we observed the following changes:

            Before        After
 Match:      97.8%        99.7%
 Requeue:     2.2%         0.3%
 Recalc:                  <0.001%

That means two percent fewer lock/requeue/unlock operations done in one of
the hot path use cases of timers.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.778527749@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:12 +02:00
Anna-Maria Gleixner
ffdf047728 timers: Split out index calculation
For further optimizations we need to seperate index calculation
from queueing. No functional change.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.691159619@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:12 +02:00
Thomas Gleixner
4e85876a9d timers: Only wake softirq if necessary
With the wheel forwading in place and with the HZ=1000 4ms folding we can
avoid running the softirq at all.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.607650550@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:11 +02:00
Thomas Gleixner
a683f390b9 timers: Forward the wheel clock whenever possible
The wheel clock is stale when a CPU goes into a long idle sleep. This has the
side effect that timers which are queued end up in the outer wheel levels.
That results in coarser granularity.

To solve this, we keep track of the idle state and forward the wheel clock
whenever possible.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.512039360@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:11 +02:00
Anna-Maria Gleixner
236968383c timers: Optimize collect_expired_timers() for NOHZ
After a NOHZ idle sleep the timer wheel must be forwarded to current jiffies.
There might be expired timers so the current code loops and checks the expired
buckets for timers. This can take quite some time for long NOHZ idle periods.

The pending bitmask in the timer base allows us to do a quick search for the
next expiring timer and therefore a fast forward of the base time which
prevents pointless long lasting loops.

For a 3 seconds idle sleep this reduces the catchup time from ~1ms to 5us.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.351296290@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:10 +02:00
Anna-Maria Gleixner
73420fea80 timers: Move __run_timers() function
Move __run_timers() below __next_timer_interrupt() and next_pending_bucket()
in preparation for __run_timers() NOHZ optimization.

No functional change.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.271872665@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:09 +02:00
Thomas Gleixner
53bf837b78 timers: Remove set_timer_slack() leftovers
We now have implicit batching in the timer wheel. The slack API is no longer
used, so remove it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Andrew F. Davis <afd@ti.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jaehoon Chung <jh80.chung@samsung.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathias Nyman <mathias.nyman@intel.com>
Cc: Pali Rohár <pali.rohar@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mmc@vger.kernel.org
Cc: linux-pm@vger.kernel.org
Cc: linux-usb@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.189813118@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:09 +02:00
Thomas Gleixner
500462a9de timers: Switch to a non-cascading wheel
The current timer wheel has some drawbacks:

1) Cascading:

   Cascading can be an unbound operation and is completely pointless in most
   cases because the vast majority of the timer wheel timers are canceled or
   rearmed before expiration. (They are used as timeout safeguards, not as
   real timers to measure time.)

2) No fast lookup of the next expiring timer:

   In NOHZ scenarios the first timer soft interrupt after a long NOHZ period
   must fast forward the base time to the current value of jiffies. As we
   have no way to find the next expiring timer fast, the code loops linearly
   and increments the base time one by one and checks for expired timers
   in each step. This causes unbound overhead spikes exactly in the moment
   when we should wake up as fast as possible.

After a thorough analysis of real world data gathered on laptops,
workstations, webservers and other machines (thanks Chris!) I came to the
conclusion that the current 'classic' timer wheel implementation can be
modified to address the above issues.

The vast majority of timer wheel timers is canceled or rearmed before
expiry. Most of them are timeouts for networking and other I/O tasks. The
nature of timeouts is to catch the exception from normal operation (TCP ack
timed out, disk does not respond, etc.). For these kinds of timeouts the
accuracy of the timeout is not really a concern. Timeouts are very often
approximate worst-case values and in case the timeout fires, we already
waited for a long time and performance is down the drain already.

The few timers which actually expire can be split into two categories:

 1) Short expiry times which expect halfways accurate expiry

 2) Long term expiry times are inaccurate today already due to the
    batching which is done for NOHZ automatically and also via the
    set_timer_slack() API.

So for long term expiry timers we can avoid the cascading property and just
leave them in the less granular outer wheels until expiry or
cancelation. Timers which are armed with a timeout larger than the wheel
capacity are no longer cascaded. We expire them with the longest possible
timeout (6+ days). We have not observed such timeouts in our data collection,
but at least we handle them, applying the rule of the least surprise.

To avoid extending the wheel levels for HZ=1000 so we can accomodate the
longest observed timeouts (5 days in the network conntrack code) we reduce the
first level granularity on HZ=1000 to 4ms, which effectively is the same as
the HZ=250 behaviour. From our data analysis there is nothing which relies on
that 1ms granularity and as a side effect we get better batching and timer
locality for the networking code as well.

Contrary to the classic wheel the granularity of the next wheel is not the
capacity of the first wheel. The granularities of the wheels are in the
currently chosen setting 8 times the granularity of the previous wheel.

So for HZ=250 we end up with the following granularity levels:

 Level Offset   Granularity                  Range
     0      0          4 ms                 0 ms -        252 ms
     1     64         32 ms               256 ms -       2044 ms (256ms - ~2s)
     2    128        256 ms              2048 ms -      16380 ms (~2s   - ~16s)
     3    192       2048 ms (~2s)       16384 ms -     131068 ms (~16s  - ~2m)
     4    256      16384 ms (~16s)     131072 ms -    1048572 ms (~2m   - ~17m)
     5    320     131072 ms (~2m)     1048576 ms -    8388604 ms (~17m  - ~2h)
     6    384    1048576 ms (~17m)    8388608 ms -   67108863 ms (~2h   - ~18h)
     7    448    8388608 ms (~2h)    67108864 ms -  536870911 ms (~18h  - ~6d)

That's a worst case inaccuracy of 12.5% for the timers which are queued at the
beginning of a level.

So the new wheel concept addresses the old issues:

1) Cascading is avoided completely

2) By keeping the timers in the bucket until expiry/cancelation we can track
   the buckets which have timers enqueued in a bucket bitmap and therefore can
   look up the next expiring timer very fast and O(1).

A further benefit of the concept is that the slack calculation which is done
on every timer start is no longer necessary because the granularity levels
provide natural batching already.

Our extensive testing with various loads did not show any performance
degradation vs. the current wheel implementation.

This patch does not address the 'fast lookup' issue as we wanted to make sure
that there is no regression introduced by the wheel redesign. The
optimizations are in follow up patches.

This patch contains fixes from Anna-Maria Gleixner and Richard Cochran.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.108621834@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:09 +02:00
Thomas Gleixner
494af3ed78 timers: Give a few structs and members proper names
Some of the names in the internal implementation of the timer code
are not longer correct and others are simply too long to type.

Clean it up before we switch the wheel implementation over to
the new scheme.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094341.948752516@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:08 +02:00
Thomas Gleixner
177ec0a0a5 timers: Remove the deprecated mod_timer_pinned() API
We switched all users to initialize the timers as pinned and call
mod_timer(). Remove the now unused timer API function.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094341.706205231@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:06 +02:00
Thomas Gleixner
e675447bda timers: Make 'pinned' a timer property
We want to move the timer migration logic from a 'push' to a 'pull' model.

Under the current 'push' model pinned timers are handled via
a runtime API variant: mod_timer_pinned().

The 'pull' model requires us to store the pinned attribute of a timer
in the timer_list structure itself, as a new TIMER_PINNED bit in
timer->flags.

This flag must be set at initialization time and the timer APIs
recognize the flag.

This patch:

 - Implements the new flag and associated new-style initialization
   methods

 - makes mod_timer() recognize new-style pinned timers,

 - and adds some migration helper facility to allow
   step by step conversion of old-style to new-style
   pinned timers.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094341.049338558@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:25:13 +02:00
Bjorn Helgaas
b5227d03b7 timers: Clarify usleep_range() function comment
Update the usleep_range() function comment to make it clear that it can
only be used in non-atomic context.

Previously we claimed usleep_range() was a drop-in replacement for udelay()
where wakeup is flexible.  But that's only true in non-atomic contexts,
where it's possible to sleep instead of delay.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20160531212302.28502.44995.stgit@bhelgaas-glaptop2.roam.corp.google.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-06-10 00:59:14 +02:00
Du, Changbin
b9fdac7f66 debugobjects: insulate non-fixup logic related to static obj from fixup callbacks
When activating a static object we need make sure that the object is
tracked in the object tracker.  If it is a non-static object then the
activation is illegal.

In previous implementation, each subsystem need take care of this in
their fixup callbacks.  Actually we can put it into debugobjects core.
Thus we can save duplicated code, and have *pure* fixup callbacks.

To achieve this, a new callback "is_static_object" is introduced to let
the type specific code decide whether a object is static or not.  If
yes, we take it into object tracker, otherwise give warning and invoke
fixup callback.

This change has paassed debugobjects selftest, and I also do some test
with all debugobjects supports enabled.

At last, I have a concern about the fixups that can it change the object
which is in incorrect state on fixup? Because the 'addr' may not point
to any valid object if a non-static object is not tracked.  Then Change
such object can overwrite someone's memory and cause unexpected
behaviour.  For example, the timer_fixup_activate bind timer to function
stub_timer.

Link: http://lkml.kernel.org/r/1462576157-14539-1-git-send-email-changbin.du@intel.com
[changbin.du@intel.com: improve code comments where invoke the new is_static_object callback]
  Link: http://lkml.kernel.org/r/1462777431-8171-1-git-send-email-changbin.du@intel.com
Signed-off-by: Du, Changbin <changbin.du@intel.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Triplett <josh@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Du, Changbin
e3252464da timer: update debugobjects fixup callbacks return type
Update the return type to use bool instead of int, corresponding to
cheange (debugobjects: make fixup functions return bool instead of int).

Signed-off-by: Du, Changbin <changbin.du@intel.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Triplett <josh@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Andrew Morton
69b27baf00 sched: add schedule_timeout_idle()
This will be needed in the patch "mm, oom: introduce oom reaper".

Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25 16:37:42 -07:00
John Stultz
da8b44d5a9 timer: convert timer_slack_ns from unsigned long to u64
This patchset introduces a /proc/<pid>/timerslack_ns interface which
would allow controlling processes to be able to set the timerslack value
on other processes in order to save power by avoiding wakeups (Something
Android currently does via out-of-tree patches).

The first patch tries to fix the internal timer_slack_ns usage which was
defined as a long, which limits the slack range to ~4 seconds on 32bit
systems.  It converts it to a u64, which provides the same basically
unlimited slack (500 years) on both 32bit and 64bit machines.

The second patch introduces the /proc/<pid>/timerslack_ns interface
which allows the full 64bit slack range for a task to be read or set on
both 32bit and 64bit machines.

With these two patches, on a 32bit machine, after setting the slack on
bash to 10 seconds:

$ time sleep 1

real    0m10.747s
user    0m0.001s
sys     0m0.005s

The first patch is a little ugly, since I had to chase the slack delta
arguments through a number of functions converting them to u64s.  Let me
know if it makes sense to break that up more or not.

Other than that things are fairly straightforward.

This patch (of 2):

The timer_slack_ns value in the task struct is currently a unsigned
long.  This means that on 32bit applications, the maximum slack is just
over 4 seconds.  However, on 64bit machines, its much much larger (~500
years).

This disparity could make application development a little (as well as
the default_slack) to a u64.  This means both 32bit and 64bit systems
have the same effective internal slack range.

Now the existing ABI via PR_GET_TIMERSLACK and PR_SET_TIMERSLACK specify
the interface as a unsigned long, so we preserve that limitation on
32bit systems, where SET_TIMERSLACK can only set the slack to a unsigned
long value, and GET_TIMERSLACK will return ULONG_MAX if the slack is
actually larger then what can be stored by an unsigned long.

This patch also modifies hrtimer functions which specified the slack
delta as a unsigned long.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Oren Laadan <orenl@cellrox.com>
Cc: Ruchi Kandoi <kandoiruchi@google.com>
Cc: Rom Lemarchand <romlem@android.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Android Kernel Team <kernel-team@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Tejun Heo
22b886dd10 timers: Use proper base migration in add_timer_on()
Regardless of the previous CPU a timer was on, add_timer_on()
currently simply sets timer->flags to the new CPU.  As the caller must
be seeing the timer as idle, this is locally fine, but the timer
leaving the old base while unlocked can lead to race conditions as
follows.

Let's say timer was on cpu 0.

  cpu 0					cpu 1
  -----------------------------------------------------------------------------
  del_timer(timer) succeeds
					del_timer(timer)
					  lock_timer_base(timer) locks cpu_0_base
  add_timer_on(timer, 1)
    spin_lock(&cpu_1_base->lock)
    timer->flags set to cpu_1_base
    operates on @timer			  operates on @timer

This triggered with mod_delayed_work_on() which contains
"if (del_timer()) add_timer_on()" sequence eventually leading to the
following oops.

  BUG: unable to handle kernel NULL pointer dereference at           (null)
  IP: [<ffffffff810ca6e9>] detach_if_pending+0x69/0x1a0
  ...
  Workqueue: wqthrash wqthrash_workfunc [wqthrash]
  task: ffff8800172ca680 ti: ffff8800172d0000 task.ti: ffff8800172d0000
  RIP: 0010:[<ffffffff810ca6e9>]  [<ffffffff810ca6e9>] detach_if_pending+0x69/0x1a0
  ...
  Call Trace:
   [<ffffffff810cb0b4>] del_timer+0x44/0x60
   [<ffffffff8106e836>] try_to_grab_pending+0xb6/0x160
   [<ffffffff8106e913>] mod_delayed_work_on+0x33/0x80
   [<ffffffffa0000081>] wqthrash_workfunc+0x61/0x90 [wqthrash]
   [<ffffffff8106dba8>] process_one_work+0x1e8/0x650
   [<ffffffff8106e05e>] worker_thread+0x4e/0x450
   [<ffffffff810746af>] kthread+0xef/0x110
   [<ffffffff8185980f>] ret_from_fork+0x3f/0x70

Fix it by updating add_timer_on() to perform proper migration as
__mod_timer() does.

Reported-and-tested-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Chris Worley <chris.worley@primarydata.com>
Cc: bfields@fieldses.org
Cc: Michael Skralivetsky <michael.skralivetsky@primarydata.com>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: kernel-team@fb.com
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20151029103113.2f893924@tlielax.poochiereds.net
Link: http://lkml.kernel.org/r/20151104171533.GI5749@mtj.duckdns.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-11-04 20:23:19 +01:00
Rasmus Villemoes
9fc4468d54 timers: Use __fls in apply_slack()
In apply_slack(), find_last_bit() is applied to a bitmask consisting
of precisely BITS_PER_LONG bits. Since mask is non-zero, we might as
well eliminate the function call and use __fls() directly. On x86_64,
this shaves 23 bytes of the only caller, mod_timer().

This also gets rid of Coverity CID 1192106, but that is a false
positive: Coverity is not aware that mask != 0 implies that
find_last_bit will not return BITS_PER_LONG.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/1443771931-6284-1-git-send-email-linux@rasmusvillemoes.dk
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-10-11 22:13:46 +02:00
Dmitry Vyukov
3ed769bdb2 timers: Fix data race in timer_stats_account_timer()
timer_stats_account_timer() reads timer->start_site, then checks it
for NULL and then re-reads it again, while
timer_stats_timer_clear_start_info() can concurrently reset
timer->start_site to NULL. This should not lead to crashes, but can
double number of entries in timer stats as start_site is used during
comparison, the doubled entries will have unuseful NULL start_site.

Read timer->start_site only once in timer_stats_account_timer().

The data race was found with KernelThreadSanitizer (KTSAN).

Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: andreyknvl@google.com
Cc: glider@google.com
Cc: kcc@google.com
Cc: ktsan@googlegroups.com
Cc: john.stultz@linaro.org
Link: http://lkml.kernel.org/r/1442584463-69553-1-git-send-email-dvyukov@google.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-09-22 15:43:18 +02:00
Eric Dumazet
d0023a1448 timer: Write timer->flags atomically
lock_timer_base() cannot prevent the following :

CPU1 ( in __mod_timer()
timer->flags |= TIMER_MIGRATING;
spin_unlock(&base->lock);
base = new_base;
spin_lock(&base->lock);
// The next line clears TIMER_MIGRATING
timer->flags &= ~TIMER_BASEMASK;
                                  CPU2 (in lock_timer_base())
                                  see timer base is cpu0 base
                                  spin_lock_irqsave(&base->lock, *flags);
                                  if (timer->flags == tf)
                                       return base; // oops, wrong base
timer->flags |= base->cpu // too late

We must write timer->flags in one go, otherwise we can fool other cpus.

Fixes: bc7a34b8b9 ("timer: Reduce timer migration overhead if disabled")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jon Christopherson <jon@jons.org>
Cc: David Miller <davem@davemloft.net>
Cc: xen-devel@lists.xen.org
Cc: david.vrabel@citrix.com
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Link: http://lkml.kernel.org/r/1439831928.32680.11.camel@edumazet-glaptop2.roam.corp.google.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
2015-08-18 15:31:16 +02:00
Thomas Gleixner
24bfcb1009 timer: Fix hotplug regression
The recent timer wheel rework removed the get/put_cpu_var() pair in
the hotplug migration code, which results in:

BUG: using smp_processor_id() in preemptible [00000000] code: hib.sh/2845
...
[<ffffffff810d4fa3>] timer_cpu_notify+0x53/0x12

That hunk is a leftover from an earlier iteration and went unnoticed
so far.

Restore the previous code which was obviously correct.

Fixes: 0eeda71bc3 'timer: Replace timer base by a cpu index'
Reported-and_tested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-06-26 22:58:06 +02:00
Thomas Gleixner
683be13a28 timer: Minimize nohz off overhead
If nohz is disabled on the kernel command line the [hr]timer code
still calls wake_up_nohz_cpu() and tick_nohz_full_cpu(), a pretty
pointless exercise. Cache nohz_active in [hr]timer per cpu bases and
avoid the overhead.

Before:
  48.10%  hog       [.] main
  15.25%  [kernel]  [k] _raw_spin_lock_irqsave
   9.76%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.50%  [kernel]  [k] mod_timer
   6.44%  [kernel]  [k] lock_timer_base.isra.38
   3.87%  [kernel]  [k] detach_if_pending
   3.80%  [kernel]  [k] del_timer
   2.67%  [kernel]  [k] internal_add_timer
   1.33%  [kernel]  [k] __internal_add_timer
   0.73%  [kernel]  [k] timerfn
   0.54%  [kernel]  [k] wake_up_nohz_cpu

After:
  48.73%  hog       [.] main
  15.36%  [kernel]  [k] _raw_spin_lock_irqsave
   9.77%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.61%  [kernel]  [k] lock_timer_base.isra.38
   6.42%  [kernel]  [k] mod_timer
   3.90%  [kernel]  [k] detach_if_pending
   3.76%  [kernel]  [k] del_timer
   2.41%  [kernel]  [k] internal_add_timer
   1.39%  [kernel]  [k] __internal_add_timer
   0.76%  [kernel]  [k] timerfn

We probably should have a cached value for nohz full in the per cpu
bases as well to avoid the cpumask check. The base cache line is hot
already, the cpumask not necessarily.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Joonwoo Park <joonwoop@codeaurora.org>
Cc: Wenbo Wang <wenbo.wang@memblaze.com>
Link: http://lkml.kernel.org/r/20150526224512.207378134@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-06-19 15:18:28 +02:00
Thomas Gleixner
bc7a34b8b9 timer: Reduce timer migration overhead if disabled
Eric reported that the timer_migration sysctl is not really nice
performance wise as it needs to check at every timer insertion whether
the feature is enabled or not. Further the check does not live in the
timer code, so we have an extra function call which checks an extra
cache line to figure out that it is disabled.

We can do better and store that information in the per cpu (hr)timer
bases. I pondered to use a static key, but that's a nightmare to
update from the nohz code and the timer base cache line is hot anyway
when we select a timer base.

The old logic enabled the timer migration unconditionally if
CONFIG_NO_HZ was set even if nohz was disabled on the kernel command
line.

With this modification, we start off with migration disabled. The user
visible sysctl is still set to enabled. If the kernel switches to NOHZ
migration is enabled, if the user did not disable it via the sysctl
prior to the switch. If nohz=off is on the kernel command line,
migration stays disabled no matter what.

Before:
  47.76%  hog       [.] main
  14.84%  [kernel]  [k] _raw_spin_lock_irqsave
   9.55%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.71%  [kernel]  [k] mod_timer
   6.24%  [kernel]  [k] lock_timer_base.isra.38
   3.76%  [kernel]  [k] detach_if_pending
   3.71%  [kernel]  [k] del_timer
   2.50%  [kernel]  [k] internal_add_timer
   1.51%  [kernel]  [k] get_nohz_timer_target
   1.28%  [kernel]  [k] __internal_add_timer
   0.78%  [kernel]  [k] timerfn
   0.48%  [kernel]  [k] wake_up_nohz_cpu

After:
  48.10%  hog       [.] main
  15.25%  [kernel]  [k] _raw_spin_lock_irqsave
   9.76%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.50%  [kernel]  [k] mod_timer
   6.44%  [kernel]  [k] lock_timer_base.isra.38
   3.87%  [kernel]  [k] detach_if_pending
   3.80%  [kernel]  [k] del_timer
   2.67%  [kernel]  [k] internal_add_timer
   1.33%  [kernel]  [k] __internal_add_timer
   0.73%  [kernel]  [k] timerfn
   0.54%  [kernel]  [k] wake_up_nohz_cpu


Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Joonwoo Park <joonwoop@codeaurora.org>
Cc: Wenbo Wang <wenbo.wang@memblaze.com>
Link: http://lkml.kernel.org/r/20150526224512.127050787@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-06-19 15:18:28 +02:00
Thomas Gleixner
c74441a17e timer: Stats: Simplify the flags handling
Simplify the handling of the flag storage for the timer statistics. No
intermediate storage anymore. Just hand over the flags field.

I left the printout of 'deferrable' for now because changing this
would be an ABI update and I have no idea how strong people feel about
that. OTOH, I wonder whether we should kill the whole timer stats
stuff because all of that information can be retrieved via ftrace/perf
as well.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Joonwoo Park <joonwoop@codeaurora.org>
Cc: Wenbo Wang <wenbo.wang@memblaze.com>
Link: http://lkml.kernel.org/r/20150526224512.046626248@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-06-19 15:18:27 +02:00
Thomas Gleixner
0eeda71bc3 timer: Replace timer base by a cpu index
Instead of storing a pointer to the per cpu tvec_base we can simply
cache a CPU index in the timer_list and use that to get hold of the
correct per cpu tvec_base. This is only used in lock_timer_base() and
the slightly larger code is peanuts versus the spinlock operation and
the d-cache foot print of the timer wheel.

Aside of that this allows to get rid of following nuisances:

 - boot_tvec_base

   That statically allocated 4k bss data is just kept around so the
   timer has a home when it gets statically initialized. It serves no
   other purpose.

   With the CPU index we assign the timer to CPU0 at static
   initialization time and therefor can avoid the whole boot_tvec_base
   dance.  That also simplifies the init code, which just can use the
   per cpu base.

   Before:
     text	   data	    bss	    dec	    hex	filename
    17491	   9201	   4160	  30852	   7884	../build/kernel/time/timer.o
   After:
     text	   data	    bss	    dec	    hex	filename
    17440	   9193	      0	  26633	   6809	../build/kernel/time/timer.o

 - Overloading the base pointer with various flags

   The CPU index has enough space to hold the flags (deferrable,
   irqsafe) so we can get rid of the extra masking and bit fiddling
   with the base pointer.

As a benefit we reduce the size of struct timer_list on 64 bit
machines. 4 - 8 bytes, a size reduction up to 15% per struct timer_list,
which is a real win as we have tons of them embedded in other structs.

This changes also the newly added deferrable printout of the timer
start trace point to capture and print all timer->flags, which allows
us to decode the target cpu of the timer as well.

We might have used bitfields for this, but that would change the
static initializers and the init function for no value to accomodate
big endian bitfields.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Joonwoo Park <joonwoop@codeaurora.org>
Cc: Wenbo Wang <wenbo.wang@memblaze.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Badhri Jagan Sridharan <Badhri@google.com>
Link: http://lkml.kernel.org/r/20150526224511.950084301@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-06-19 15:18:27 +02:00
Thomas Gleixner
1dabbcec2c timer: Use hlist for the timer wheel hash buckets
This reduces the size of struct tvec_base by 50% and results in
slightly smaller code as well.

Before:
   struct tvec_base: size: 8256, cachelines: 129

   text	   data	    bss	    dec	    hex	filename
  17698	  13297	   8256	  39251	   9953	../build/kernel/time/timer.o

After:
  struct tvec_base: 4160, cachelines: 65

   text	   data	    bss	    dec	    hex	filename
  17491	   9201	   4160	  30852	   7884	../build/kernel/time/timer.o

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Joonwoo Park <joonwoop@codeaurora.org>
Cc: Wenbo Wang <wenbo.wang@memblaze.com>
Link: http://lkml.kernel.org/r/20150526224511.854731214@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-06-19 15:18:27 +02:00
Thomas Gleixner
1bd04bf6f6 timer: Remove FIFO "guarantee"
The FIFO guarantee is only there if two timers are queued into the
same bucket at the same jiffie on the same cpu:

 - The slack value depends on the delta between expiry and enqueue
   time, so the resulting expiry time can be different for timers
   which are queued in different jiffies.

 - Timers which are queued into the secondary array end up after a
   later queued timer which was queued into the primary array due to
   cascading.

 - Timers can end up on different cpus due to the NOHZ target moving
   around. Obviously there is no guarantee of expiry ordering between
   cpus.

So anything which relies on FIFO behaviour of the timer wheel is
broken already.

This is a preparatory patch for converting the timer wheel to hlist
which reduces the memory foot print of the wheel by 50%.

It's a seperate patch so any (unlikely to happen) regression caused by
this can be identified clearly.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Joonwoo Park <joonwoop@codeaurora.org>
Cc: Wenbo Wang <wenbo.wang@memblaze.com>
Cc: George Spelvin <linux@horizon.com>
Link: http://lkml.kernel.org/r/20150526224511.757520403@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-06-19 15:18:27 +02:00
Thomas Gleixner
3bb475a344 timers: Sanitize catchup_timer_jiffies() usage
catchup_timer_jiffies() has been applied blindly to several functions
without looking for possible better ways to do it.

1) internal_add_timer()

   Move the update to base->all_timers before we actually insert the
   timer into the wheel.

2) detach_if_pending()

   Again the update to base->all_timers allows us to explicitely do
   the timer_jiffies update in place, if this was the last timer which
   got removed.

3) __run_timers()

   We only check on entry, which is silly, because base->timer_jiffies
   can be behind - especially on NOHZ kernels - and if there is a
   single deferrable timer somewhere between base->timer_jiffies and
   jiffies we expire it and then loop until base->timer_jiffies ==
   jiffies.

   Move it into the loop.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Joonwoo Park <joonwoop@codeaurora.org>
Cc: Wenbo Wang <wenbo.wang@memblaze.com>
Link: http://lkml.kernel.org/r/20150526224511.662994644@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-06-19 15:18:27 +02:00
Badhri Jagan Sridharan
4e413e8526 tracing: timer: Add deferrable flag to timer_start
The timer_start event now shows whether the timer is
deferrable in case of a low-res timer. The debug_activate
function now includes a deferrable flag while calling
the trace_timer_start event.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Badhri Jagan Sridharan <Badhri@google.com>
[jstultz: Fixed minor whitespace and grammer tweaks
 pointed out by Ingo]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2015-05-22 10:36:06 -07:00
Joonwoo Park
781978e6e1 timer: Use timer->base for flag checks
At present, internal_add_timer() examines flags with 'base' which doesn't
contain flags.  Examine with 'timer->base' to avoid unnecessary waking up
of nohz CPU when timer base has TIMER_DEFERRABLE set.

Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Cc: sboyd@codeaurora.org
Cc: skannan@codeaurora.org
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/1430187709-21087-1-git-send-email-joonwoop@codeaurora.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-05-05 10:40:43 +02:00
Thomas Gleixner
2ad5d3272d timer: Put usleep_range into the __sched section
do_usleep_range() and schedule_hrtimeout_range() are __sched as
well. So it makes no sense to have the exported function in a
different section.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20150414203503.833709502@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:52 +02:00
Thomas Gleixner
6deba083e1 timer: Remove pointless return value of do_usleep_range()
The only user ignores it anyway and rightfully so.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20150414203503.756060258@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:52 +02:00
Thomas Gleixner
c1ad348b45 tick: Nohz: Rework next timer evaluation
The evaluation of the next timer in the nohz code is based on jiffies
while all the tick internals are nano seconds based. We have also to
convert hrtimer nanoseconds to jiffies in the !highres case. That's
just wrong and introduces interesting corner cases.

Turn it around and convert the next timer wheel timer expiry and the
rcu event to clock monotonic and base all calculations on
nanoseconds. That identifies the case where no timer is pending
clearly with an absolute expiry value of KTIME_MAX.

Makes the code more readable and gets rid of the jiffies magic in the
nohz code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Link: http://lkml.kernel.org/r/20150414203502.184198593@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:50 +02:00
Thomas Gleixner
c6eb3f70d4 hrtimer: Get rid of hrtimer softirq
hrtimer softirq is a leftover from the initial implementation and
serves only the purpose to handle the enqueueing of already expired
timers in the high resolution timer mode. We discussed whether we
change the return value and force all start sites to handle that the
timer is already expired, but that would be a Herculean task and I'm
not sure whether its a good idea to enforce that handling on
everyone.

A simpler solution is to enforce a timer interrupt instead of raising
and scheduling a softirq. Just use the existing infrastructure to do
so and remove all the softirq leftovers.

The HRTIMER softirq enum is now unused, but kept around because trace
parsers rely on the existing numbering.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20150414203501.840834708@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:50 +02:00
Peter Zijlstra
3650b57fdf timer: Further simplify the SMP and HOTPLUG logic
Remove one CONFIG_HOTPLUG_CPU #ifdef in trade for introducing one
CONFIG_SMP #ifdef.

The CONFIG_SMP ifdef avoids declaring the per-CPU __tvec_bases storage
on UP systems since they already have boot_tvec_bases.

Also (re)add a runtime check on the base alignment -- for the paranoid
amongst us :-)

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/fdd2d35e169bdc554ffa3fe77f77716298c75ada.1427814611.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02 17:46:21 +02:00
Viresh Kumar
8def906044 timer: Don't initialize 'tvec_base' on hotplug
There is no need to call init_timers_cpu() on every CPU hotplug event,
there is not much we need to reset.

 - Timer-lists are already empty at the end of migrate_timers().
 - timer_jiffies will be refreshed while adding a new timer, after the
   CPU is online again.
 - active_timers and all_timers can be reset from migrate_timers().

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/54a1c30ea7b805af55beb220cadf5a07a21b0a4d.1427814611.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02 17:46:01 +02:00
Peter Zijlstra
b337a9380f timer: Allocate per-cpu tvec_base's statically
Memory for the 'tvec_base' array is allocated separately for the boot CPU (statically)
and non-boot CPUs (dynamically).

The reason is because __TIMER_INITIALIZER() needs to set ->base to a
valid pointer (because we've made NULL special, hint: lock_timer_base())
and we cannot get a compile time pointer to per-cpu entries because we
don't know where we'll map the section, even for the boot cpu.

This can be simplified a bit by statically allocating per-cpu memory.
The only disadvantage is that memory for one of the structures will stay
unused, i.e. for the boot CPU, which uses boot_tvec_bases.

This will also guarantee that tvec_base is cacheline aligned. Even
though tvec_base has ____cacheline_aligned stuck on, kzalloc_node() does
not actually respect that (but guarantees a minimum u64 alignment).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/17cdf560f2727f687ab159707d0aa591f8a2f82d.1427814611.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02 17:46:00 +02:00
Paul E. McKenney
c3377c2da6 rcu: Remove "cpu" argument to rcu_check_callbacks()
The "cpu" argument was kept around on the off-chance that RCU might
offload scheduler-clock interrupts.  However, this offload approach
has been replaced by NO_HZ_FULL, which offloads -all- RCU processing
from qualifying CPUs.  It is therefore time to remove the "cpu" argument
to rcu_check_callbacks(), which this commit does.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2014-11-03 19:20:11 -08:00