Commit graph

26931 commits

Author SHA1 Message Date
Steven Rostedt (VMware)
6be7fa3c74 ftrace, orc, x86: Handle ftrace dynamically allocated trampolines
The function tracer can create a dynamically allocated trampoline that is
called by the function mcount or fentry hook that is used to call the
function callback that is registered. The problem is that the orc undwinder
will bail if it encounters one of these trampolines. This breaks the stack
trace of function callbacks, which include the stack tracer and setting the
stack trace for individual functions.

Since these dynamic trampolines are basically copies of the static ftrace
trampolines defined in ftrace_*.S, we do not need to create new orc entries
for the dynamic trampolines. Finding the return address on the stack will be
identical as the functions that were copied to create the dynamic
trampolines. When encountering a ftrace dynamic trampoline, we can just use
the orc entry of the ftrace static function that was copied for that
trampoline.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-01-23 15:56:55 -05:00
David S. Miller
5ca114400d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
en_rx_am.c was deleted in 'net-next' but had a bug fixed in it in
'net'.

The esp{4,6}_offload.c conflicts were overlapping changes.
The 'out' label is removed so we just return ERR_PTR(-EINVAL)
directly.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-23 13:51:56 -05:00
Yonghong Song
2310035fa0 bpf: fix incorrect kmalloc usage in lpm_trie MAP_GET_NEXT_KEY rcu region
In commit b471f2f1de ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map"),
the implemented MAP_GET_NEXT_KEY callback function is guarded with rcu read lock.
In the function body, "kmalloc(size, GFP_USER | __GFP_NOWARN)" is used which may
sleep and violate rcu read lock region requirements. This patch fixed the issue
by using GFP_ATOMIC instead to avoid blocking kmalloc. Tested with
CONFIG_DEBUG_ATOMIC_SLEEP=y as suggested by Eric Dumazet.

Fixes: b471f2f1de ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map")
Signed-off-by: Yonghong Song <yhs@fb.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-23 17:33:57 +01:00
Eric W. Biederman
f71dd7dc2d signal/ptrace: Add force_sig_ptrace_errno_trap and use it where needed
There are so many places that build struct siginfo by hand that at
least one of them is bound to get it wrong.  A handful of cases in the
kernel arguably did just that when using the errno field of siginfo to
pass no errno values to userspace.  The usage is limited to a single
si_code so at least does not mess up anything else.

Encapsulate this questionable pattern in a helper function so
that the userspace ABI is preserved.

Update all of the places that use this pattern to use the new helper
function.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-22 19:07:11 -06:00
Eric W. Biederman
382467358a signal: Helpers for faults with specialized siginfo layouts
The helpers added are:
send_sig_mceerr
force_sig_mceerr
force_sig_bnderr
force_sig_pkuerr

Filling out siginfo properly can ge tricky.  Especially for these
specialized cases where the temptation is to share code with other
cases which use a different subset of siginfo fields.  Unfortunately
that code sharing frequently results in bugs with the wrong siginfo
fields filled in, and makes it harder to verify that the siginfo
structure was properly initialized.

Provide these helpers instead that get all of the details right, and
guarantee that siginfo is properly initialized.

send_sig_mceerr and force_sig_mceer are a little special as two si
codes BUS_MCEERR_AO and BUS_MCEER_AR both use the same extended
signinfo layout.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-22 19:07:10 -06:00
Eric W. Biederman
f8ec66014f signal: Add send_sig_fault and force_sig_fault
The vast majority of signals sent from architecture specific code are
simple faults.  Encapsulate this reality with two helper functions so
that the nit-picky implementation of preparing a siginfo does not need
to be repeated many times on each architecture.

As only some architectures support the trapno field, make the trapno
arguement only present on those architectures.

Similary as ia64 has three fields: imm, flags, and isr that
are specific to it.  Have those arguments always present on ia64
and no where else.

This ensures the architecture specific code always remembers which
fields it needs to pass into the siginfo structure.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-22 19:07:09 -06:00
Eric W. Biederman
3b10db2b06 signal: Replace memset(info,...) with clear_siginfo for clarity
The function clear_siginfo is just a nice wrapper around memset so
this results in no functional change.  This change makes mistakes
a little more difficult and it makes it clearer what is going on.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-22 19:07:08 -06:00
Eric W. Biederman
5f74972ce6 signal: Don't use structure initializers for struct siginfo
The siginfo structure has all manners of holes with the result that a
structure initializer is not guaranteed to initialize all of the bits.
As we have to copy the structure to userspace don't even try to use
a structure initializer.  Instead use clear_siginfo followed by initializing
selected fields.  This gives a guarantee that uninitialized kernel memory
is not copied to userspace.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-22 19:07:08 -06:00
Petr Mladek
bb4f552a59 Merge branch 'for-4.16-console-waiter-logic' into for-4.16 2018-01-22 10:40:59 +01:00
Petr Mladek
51ccbb0ae8 Merge branch 'for-4.16-print-symbol' into for-4.16 2018-01-22 10:40:50 +01:00
Petr Mladek
3ccdc5190f Merge branch 'for-4.16-deprecate-printk-pf' into for-4.16 2018-01-22 10:40:32 +01:00
Sergey Senozhatsky
6fd78a1a99 printk: drop redundant devkmsg_log_str memsets
We copy in null terminated strings "on" and "off", no
need to zero out devkmsg_log_str in control_devkmsg().

Link: http://lkml.kernel.org/r/20180119043901.1728-1-sergey.senozhatsky@gmail.com
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2018-01-22 10:33:04 +01:00
Linus Torvalds
66f8162418 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Thomas Gleixner:
 "A single fix for the new matrix allocator to prevent vector exhaustion
  by certain network drivers which allocate gazillions of unused vectors
  which cannot be put into reservation mode due to MSI and the lack of
  MSI entry masking.

  The fix/workaround is to spread the vectors across CPUs by searching
  the supplied target CPU mask for the CPU with the smallest number of
  allocated vectors"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irq/matrix: Spread interrupts on allocation
2018-01-21 10:39:58 -08:00
David S. Miller
ea9722e265 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2018-01-19

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) bpf array map HW offload, from Jakub.

2) support for bpf_get_next_key() for LPM map, from Yonghong.

3) test_verifier now runs loaded programs, from Alexei.

4) xdp cpumap monitoring, from Jesper.

5) variety of tests, cleanups and small x64 JIT optimization, from Daniel.

6) user space can now retrieve HW JITed program, from Jiong.

Note there is a minor conflict between Russell's arm32 JIT fixes
and removal of bpf_jit_enable variable by Daniel which should
be resolved by keeping Russell's comment and removing that variable.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-20 22:03:46 -05:00
David S. Miller
8565d26bcb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The BPF verifier conflict was some minor contextual issue.

The TUN conflict was less trivial.  Cong Wang fixed a memory leak of
tfile->tx_array in 'net'.  This is an skb_array.  But meanwhile in
net-next tun changed tfile->tx_arry into tfile->tx_ring which is a
ptr_ring.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-19 22:59:33 -05:00
Daniel Borkmann
4bd95f4b99 bpf: add upper complexity limit to verifier log
Given the limit could potentially get further adjustments in the
future, add it to the log so it becomes obvious what the current
limit is w/o having to check the source first. This may also be
helpful for debugging complexity related issues on kernels that
backport from upstream.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-19 18:37:00 -08:00
Daniel Borkmann
fa9dd599b4 bpf: get rid of pure_initcall dependency to enable jits
Having a pure_initcall() callback just to permanently enable BPF
JITs under CONFIG_BPF_JIT_ALWAYS_ON is unnecessary and could leave
a small race window in future where JIT is still disabled on boot.
Since we know about the setting at compilation time anyway, just
initialize it properly there. Also consolidate all the individual
bpf_jit_enable variables into a single one and move them under one
location. Moreover, don't allow for setting unspecified garbage
values on them.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-19 18:37:00 -08:00
Daniel Borkmann
9013341594 bpf, verifier: detect misconfigured mem, size argument pair
I've seen two patch proposals now for helper additions that used
ARG_PTR_TO_MEM or similar in reg_X but no corresponding ARG_CONST_SIZE
in reg_X+1. Verifier won't complain in such case, but it will omit
verifying the memory passed to the helper thus ending up badly.
Detect such buggy helper function signature and bail out during
verification rather than finding them through review.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-19 18:36:59 -08:00
Jan H. Schönherr
77dd66a3c6 mm: Fix devm_memremap_pages() collision handling
If devm_memremap_pages() detects a collision while adding entries
to the radix-tree, we call pgmap_radix_release(). Unfortunately,
the function removes *all* entries for the range -- including the
entries that caused the collision in the first place.

Modify pgmap_radix_release() to take an additional argument to
indicate where to stop, so that only newly added entries are removed
from the tree.

Cc: <stable@vger.kernel.org>
Fixes: 9476df7d80 ("mm: introduce find_dev_pagemap()")
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-01-19 16:29:56 -08:00
Jan H. Schönherr
10a0cd6e49 mm: Fix memory size alignment in devm_memremap_pages_release()
The functions devm_memremap_pages() and devm_memremap_pages_release() use
different ways to calculate the section-aligned amount of memory. The
latter function may use an incorrect size if the memory region is small
but straddles a section border.

Use the same code for both.

Cc: <stable@vger.kernel.org>
Fixes: 5f29a77cd9 ("mm: fix mixed zone detection in devm_memremap_pages")
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-01-19 16:29:48 -08:00
Yonghong Song
b471f2f1de bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map
Current LPM_TRIE map type does not implement MAP_GET_NEXT_KEY
command. This command is handy when users want to enumerate
keys. Otherwise, a different map which supports key
enumeration may be required to store the keys. If the
map data is sparse and all map data are to be deleted without
closing file descriptor, using MAP_GET_NEXT_KEY to find
all keys is much faster than enumerating all key space.

This patch implements MAP_GET_NEXT_KEY command for LPM_TRIE map.
If user provided key pointer is NULL or the key does not have
an exact match in the trie, the first key will be returned.
Otherwise, the next key will be returned.

In this implemenation, key enumeration follows a postorder
traversal of internal trie. More specific keys
will be returned first than less specific ones, given
a sequence of MAP_GET_NEXT_KEY syscalls.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-19 23:26:41 +01:00
Linus Torvalds
ec835f8104 Two more small fixes
- The conversion of enums into their actual numbers to display
    in the event format file had an off-by-one bug, that could cause
    an enum not to be converted, and break user space parsing tools.
 
  - A fix to a previous fix to bring back the context recursion checks.
    The interrupt case checks for NMI, IRQ and softirq, but the softirq
    returned the same number regardless if it was set or not, although
    the logic would force it to be set if it were hit.
 -----BEGIN PGP SIGNATURE-----
 
 iQHIBAABCgAyFiEEPm6V/WuN2kyArTUe1a05Y9njSUkFAlpiEd8UHHJvc3RlZHRA
 Z29vZG1pcy5vcmcACgkQ1a05Y9njSUndmQv9EdmMa/CXyz6IFTEHaQHizNHak5yx
 hyMsKgcyylY+GlS9p1285Gn6d5bdOpa3rbv311YVu4/8lTPlZmWQdgVrmTt/jdiK
 NbKZSG5mXYP8F6n3phenM9PFIFHkrrrS2RKq7J17gwUbE6YTT3HFxCAkK+i9UmRo
 VoRAWSEBDONma1dgh1ezkEFLQ9fucFZMIur3PiHJlZmtuhSGsImzQQBzZO0Gj1gi
 iBPPf90+55IOzPAaSAlj5RafjPFsaVgJC9I+KtSLL+waPWNTEHc9wg8XvogM5mJv
 t2Q8A91NcM4AQ37HP+Tyhot32RprZVrEWBZ6fnrC7RXFLk0HwVk6mCTwBYhoWJoN
 QiHtx5h6BtQTzNMjKBweMrEZlqWds9+y81ZrP5g/EA67yT2FQFC57IuuyH6fzfc6
 ECn6XNyGVafBjUZWhxUG6uwp2P6lsRSvFnT+tAOdocgoUqnDySoyEUIZh6Wa7CQW
 ZtmYjRISjc9220i8m5HWTTj+0imYm0p2W3w8
 =O7m4
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.15-rc4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Two more small fixes

   - The conversion of enums into their actual numbers to display in the
     event format file had an off-by-one bug, that could cause an enum
     not to be converted, and break user space parsing tools.

   - A fix to a previous fix to bring back the context recursion checks.
     The interrupt case checks for NMI, IRQ and softirq, but the softirq
     returned the same number regardless if it was set or not, although
     the logic would force it to be set if it were hit"

* tag 'trace-v4.15-rc4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix converting enum's from the map in trace_event_eval_update()
  ring-buffer: Fix duplicate results in mapping context to bits in recursive lock
2018-01-19 11:38:19 -08:00
Linus Torvalds
8b335c7d22 Merge branch 'for-4.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fix from Tejun Heo:
 "cgroup.threads should be delegatable (ie. a container should be able
  to write to it from inside) but was missing the flag.

  The change is very low risk"

* 'for-4.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: make cgroup.threads delegatable
2018-01-19 11:25:17 -08:00
Linus Torvalds
a2c9c1c035 Merge branch 'for-4.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fixlet from Tejun Heo:
 "One patch to add touch_nmi_watchdog() while dumping workqueue debug
  messages to avoid triggering the lockup detector spuriously.

  The change is very low risk"

* 'for-4.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: avoid hard lockups in show_workqueue_state()
2018-01-19 11:23:39 -08:00
Linus Torvalds
726ba84b50 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix BPF divides by zero, from Eric Dumazet and Alexei Starovoitov.

 2) Reject stores into bpf context via st and xadd, from Daniel
    Borkmann.

 3) Fix a memory leak in TUN, from Cong Wang.

 4) Disable RX aggregation on a specific troublesome configuration of
    r8152 in a Dell TB16b dock.

 5) Fix sw_ctx leak in tls, from Sabrina Dubroca.

 6) Fix program replacement in cls_bpf, from Daniel Borkmann.

 7) Fix uninitialized station_info structures in cfg80211, from Johannes
    Berg.

 8) Fix miscalculation of transport header offset field in flow
    dissector, from Eric Dumazet.

 9) Fix LPM tree leak on failure in mlxsw driver, from Ido Schimmel.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (29 commits)
  ibmvnic: Fix IPv6 packet descriptors
  ibmvnic: Fix IP offload control buffer
  ipv6: don't let tb6_root node share routes with other node
  ip6_gre: init dev->mtu and dev->hard_header_len correctly
  mlxsw: spectrum_router: Free LPM tree upon failure
  flow_dissector: properly cap thoff field
  fm10k: mark PM functions as __maybe_unused
  cfg80211: fix station info handling bugs
  netlink: reset extack earlier in netlink_rcv_skb
  can: af_can: canfd_rcv(): replace WARN_ONCE by pr_warn_once
  can: af_can: can_rcv(): replace WARN_ONCE by pr_warn_once
  bpf: mark dst unknown on inconsistent {s, u}bounds adjustments
  bpf: fix cls_bpf on filter replace
  Net: ethernet: ti: netcp: Fix inbound ping crash if MTU size is greater than 1500
  tls: reset crypto_info when do_tls_setsockopt_tx fails
  tls: return -EBUSY if crypto_info is already set
  tls: fix sw_ctx leak
  net/tls: Only attach to sockets in ESTABLISHED state
  net: fs_enet: do not call phy_stop() in interrupts
  r8152: disable RX aggregation on Dell TB16 dock
  ...
2018-01-19 09:30:33 -08:00
Tejun Heo
08a77676f9 string: drop __must_check from strscpy() and restore strscpy() usages in cgroup
e7fd37ba12 ("cgroup: avoid copying strings longer than the buffers")
converted possibly unsafe strncpy() usages in cgroup to strscpy().
However, although the callsites are completely fine with truncated
copied, because strscpy() is marked __must_check, it led to the
following warnings.

  kernel/cgroup/cgroup.c: In function ‘cgroup_file_name’:
  kernel/cgroup/cgroup.c:1400:10: warning: ignoring return value of ‘strscpy’, declared with attribute warn_unused_result [-Wunused-result]
     strscpy(buf, cft->name, CGROUP_FILE_NAME_MAX);
	       ^

To avoid the warnings, 50034ed496 ("cgroup: use strlcpy() instead of
strscpy() to avoid spurious warning") switched them to strlcpy().

strlcpy() is worse than strlcpy() because it unconditionally runs
strlen() on the source string, and the only reason we switched to
strlcpy() here was because it was lacking __must_check, which doesn't
reflect any material differences between the two function.  It's just
that someone added __must_check to strscpy() and not to strlcpy().

These basic string copy operations are used in variety of ways, and
one of not-so-uncommon use cases is safely handling truncated copies,
where the caller naturally doesn't care about the return value.  The
__must_check doesn't match the actual use cases and forces users to
opt for inferior variants which lack __must_check by happenstance or
spread ugly (void) casts.

Remove __must_check from strscpy() and restore strscpy() usages in
cgroup.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
2018-01-19 08:51:36 -08:00
Jakub Kicinski
52775b33bb bpf: offload: report device information about offloaded maps
Tell user space about device on which the map was created.
Unfortunate reality of user ABI makes sharing this code
with program offload difficult but the information is the
same.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-18 22:54:25 +01:00
Jakub Kicinski
7a0ef69395 bpf: offload: allow array map offload
The special handling of different map types is left to the driver.
Allow offload of array maps by simply adding it to accepted types.
For nfp we have to make sure array elements are not deleted.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-18 22:54:25 +01:00
Jakub Kicinski
32852649ba bpf: arraymap: use bpf_map_init_from_attr()
Arraymap was not converted to use bpf_map_init_from_attr()
to avoid merge conflicts with emergency fixes.  Do it now.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-18 22:54:25 +01:00
Jakub Kicinski
ad46061fca bpf: arraymap: move checks out of alloc function
Use the new callback to perform allocation checks for array maps.
The fd maps don't need a special allocation callback, they only
need a special check callback.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-18 22:54:25 +01:00
Alexei Starovoitov
61f3c964df bpf: allow socket_filter programs to use bpf_prog_test_run
in order to improve test coverage allow socket_filter program type
to be run via bpf_prog_test_run command.
Since such programs can be loaded by non-root tighten
permissions for bpf_prog_test_run to be root only
to avoid surprises.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-18 22:37:58 +01:00
Steven Rostedt (VMware)
1ebe1eaf2f tracing: Fix converting enum's from the map in trace_event_eval_update()
Since enums do not get converted by the TRACE_EVENT macro into their values,
the event format displaces the enum name and not the value. This breaks
tools like perf and trace-cmd that need to interpret the raw binary data. To
solve this, an enum map was created to convert these enums into their actual
numbers on boot up. This is done by TRACE_EVENTS() adding a
TRACE_DEFINE_ENUM() macro.

Some enums were not being converted. This was caused by an optization that
had a bug in it.

All calls get checked against this enum map to see if it should be converted
or not, and it compares the call's system to the system that the enum map
was created under. If they match, then they call is processed.

To cut down on the number of iterations needed to find the maps with a
matching system, since calls and maps are grouped by system, when a match is
made, the index into the map array is saved, so that the next call, if it
belongs to the same system as the previous call, could start right at that
array index and not have to scan all the previous arrays.

The problem was, the saved index was used as the variable to know if this is
a call in a new system or not. If the index was zero, it was assumed that
the call is in a new system and would keep incrementing the saved index
until it found a matching system. The issue arises when the first matching
system was at index zero. The next map, if it belonged to the same system,
would then think it was the first match and increment the index to one. If
the next call belong to the same system, it would begin its search of the
maps off by one, and miss the first enum that should be converted. This left
a single enum not converted properly.

Also add a comment to describe exactly what that index was for. It took me a
bit too long to figure out what I was thinking when debugging this issue.

Link: http://lkml.kernel.org/r/717BE572-2070-4C1E-9902-9F2E0FEDA4F8@oracle.com

Cc: stable@vger.kernel.org
Fixes: 0c564a538a ("tracing: Add TRACE_DEFINE_ENUM() macro to map enums to their values")
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Teste-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-01-18 15:53:10 -05:00
Steven Rostedt (VMware)
0164e0d7e8 ring-buffer: Fix duplicate results in mapping context to bits in recursive lock
In bringing back the context checks, the code checks first if its normal
(non-interrupt) context, and then for NMI then IRQ then softirq. The final
check is redundant. Since the if branch is only hit if the context is one of
NMI, IRQ, or SOFTIRQ, if it's not NMI or IRQ there's no reason to check if
it is SOFTIRQ. The current code returns the same result even if its not a
SOFTIRQ. Which is confusing.

  pc & SOFTIRQ_OFFSET ? 2 : RB_CTX_SOFTIRQ

Is redundant as RB_CTX_SOFTIRQ *is* 2!

Fixes: a0e3a18f4b ("ring-buffer: Bring back context level recursive checks")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-01-18 15:45:48 -05:00
David S. Miller
7155f8f391 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2018-01-18

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix a divide by zero due to wrong if (src_reg == 0) check in
   64-bit mode. Properly handle this in interpreter and mask it
   also generically in verifier to guard against similar checks
   in JITs, from Eric and Alexei.

2) Fix a bug in arm64 JIT when tail calls are involved and progs
   have different stack sizes, from Daniel.

3) Reject stores into BPF context that are not expected BPF_STX |
   BPF_MEM variant, from Daniel.

4) Mark dst reg as unknown on {s,u}bounds adjustments when the
   src reg has derived bounds from dead branches, from Daniel.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-18 09:17:04 -05:00
Matthew Wilcox
08f36ff642 lockdep: Make lockdep checking constant
There are several places in the kernel which would like to pass a const
pointer to lockdep_is_held().  Constify the entire path so nobody has to
trick the compiler.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Link: https://lkml.kernel.org/r/20180117151414.23686-3-willy@infradead.org
2018-01-18 11:56:48 +01:00
Matthew Wilcox
64f29d1bc9 lockdep: Assign lock keys on registration
Lockdep is assigning lock keys when a lock was looked up.  This is
unnecessary; if the lock has never been registered then it is known that it
is not locked.  It also complicates the calling convention.  Switch to
assigning the lock key in register_lock_class().

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Link: https://lkml.kernel.org/r/20180117151414.23686-2-willy@infradead.org
2018-01-18 11:56:48 +01:00
Thomas Gleixner
a0c9259dc4 irq/matrix: Spread interrupts on allocation
Keith reported an issue with vector space exhaustion on a server machine
which is caused by the i40e driver allocating 168 MSI interrupts when the
driver is initialized, even when most of these interrupts are not used at
all.

The x86 vector allocation code tries to avoid the immediate allocation with
the reservation mode, but the card uses MSI and does not support MSI entry
masking, which prevents reservation mode and requires immediate vector
allocation.

The matrix allocator is a bit naive and prefers the first CPU in the
cpumask which describes the possible target CPUs for an allocation. That
results in allocating all 168 vectors on CPU0 which later causes vector
space exhaustion when the NVMe driver tries to allocate managed interrupts
on each CPU for the per CPU queues.

Avoid this by finding the CPU which has the lowest vector allocation count
to spread out the non managed interrupt accross the possible target CPUs.

Fixes: 2f75d9e1c9 ("genirq: Implement bitmap matrix allocator")
Reported-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Keith Busch <keith.busch@intel.com>
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801171557330.1777@nanos
2018-01-18 11:38:41 +01:00
Rafael J. Wysocki
bcaea4678f Merge branches 'acpi-pm' and 'pm-sleep'
* acpi-pm:
  platform/x86: surfacepro3: Support for wakeup from suspend-to-idle
  ACPI / PM: Use Low Power S0 Idle on more systems
  ACPI / PM: Make it possible to ignore the system sleep blacklist

* pm-sleep:
  PM / hibernate: Drop unused parameter of enough_swap
  block, scsi: Fix race between SPI domain validation and system suspend
  PM / sleep: Make lock/unlock_system_sleep() available to kernel modules
  PM: hibernate: Do not subtract NR_FILE_MAPPED in minimum_image_size()
2018-01-18 02:55:28 +01:00
Rafael J. Wysocki
f9b736f64a Merge branches 'pm-domains', 'pm-kconfig', 'pm-cpuidle' and 'powercap'
* pm-domains:
  PM / genpd: Stop/start devices without pm_runtime_force_suspend/resume()
  PM / domains: Don't skip driver's ->suspend|resume_noirq() callbacks
  PM / Domains: Remove obsolete "samsung,power-domain" check

* pm-kconfig:
  bus: simple-pm-bus: convert bool SIMPLE_PM_BUS to tristate
  PM: Provide a config snippet for disabling PM

* pm-cpuidle:
  cpuidle: Avoid NULL argument in cpuidle_switch_governor()

* powercap:
  powercap: intel_rapl: Fix trailing semicolon
  powercap: add suspend and resume mechanism for SOC power limit
  powercap: Simplify powercap_init()
2018-01-18 02:54:45 +01:00
Yonghong Song
eefa864a81 bpf: change fake_ip for bpf_trace_printk helper
Currently, for bpf_trace_printk helper, fake ip address 0x1
is used with comments saying that fake ip will not be printed.
This is indeed true for 4.12 and earlier version, but for
4.13 and later version, the ip address will be printed if
it cannot be resolved with kallsym. Running samples/bpf/tracex5
program and you will have the following in the debugfs
trace_pipe output:
  ...
  <...>-1819  [003] ....   443.497877: 0x00000001: mmap
  <...>-1819  [003] ....   443.498289: 0x00000001: syscall=102 (one of get/set uid/pid/gid)
  ...

The kernel commit changed this behavior is:
  commit feaf1283d1
  Author: Steven Rostedt (VMware) <rostedt@goodmis.org>
  Date:   Thu Jun 22 17:04:55 2017 -0400

      tracing: Show address when function names are not found
  ...

This patch changed the comment and also altered the fake ip
address to 0x0 as users may think 0x1 has some special meaning
while it doesn't. The new output:
  ...
  <...>-1799  [002] ....    25.953576: 0: mmap
  <...>-1799  [002] ....    25.953865: 0: read(fd=0, buf=00000000053936b5, size=512)
  ...

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-18 01:51:42 +01:00
Jiong Wang
fcfb126def bpf: add new jited info fields in bpf_dev_offload and bpf_prog_info
For host JIT, there are "jited_len"/"bpf_func" fields in struct bpf_prog
used by all host JIT targets to get jited image and it's length. While for
offload, targets are likely to have different offload mechanisms that these
info are kept in device private data fields.

Therefore, BPF_OBJ_GET_INFO_BY_FD syscall needs an unified way to get JIT
length and contents info for offload targets.

One way is to introduce new callback to parse device private data then fill
those fields in bpf_prog_info. This might be a little heavy, the other way
is to add generic fields which will be initialized by all offload targets.

This patch follow the second approach to introduce two new fields in
struct bpf_dev_offload and teach bpf_prog_get_info_by_fd about them to fill
correct jited_prog_len and jited_prog_insns in bpf_prog_info.

Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-18 01:26:15 +01:00
Daniel Borkmann
6f16101e6a bpf: mark dst unknown on inconsistent {s, u}bounds adjustments
syzkaller generated a BPF proglet and triggered a warning with
the following:

  0: (b7) r0 = 0
  1: (d5) if r0 s<= 0x0 goto pc+0
   R0=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0
  2: (1f) r0 -= r1
   R0=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0
  verifier internal error: known but bad sbounds

What happens is that in the first insn, r0's min/max value
are both 0 due to the immediate assignment, later in the jsle
test the bounds are updated for the min value in the false
path, meaning, they yield smin_val = 1, smax_val = 0, and when
ctx pointer is subtracted from r0, verifier bails out with the
internal error and throwing a WARN since smin_val != smax_val
for the known constant.

For min_val > max_val scenario it means that reg_set_min_max()
and reg_set_min_max_inv() (which both refine existing bounds)
demonstrated that such branch cannot be taken at runtime.

In above scenario for the case where it will be taken, the
existing [0, 0] bounds are kept intact. Meaning, the rejection
is not due to a verifier internal error, and therefore the
WARN() is not necessary either.

We could just reject such cases in adjust_{ptr,scalar}_min_max_vals()
when either known scalars have smin_val != smax_val or
umin_val != umax_val or any scalar reg with bounds
smin_val > smax_val or umin_val > umax_val. However, there
may be a small risk of breakage of buggy programs, so handle
this more gracefully and in adjust_{ptr,scalar}_min_max_vals()
just taint the dst reg as unknown scalar when we see ops with
such kind of src reg.

Reported-by: syzbot+6d362cadd45dc0a12ba4@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-17 16:23:17 -08:00
Linus Torvalds
9a4ba2ab08 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
 "A delayacct statistics correctness fix"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  delayacct: Account blkio completion on the correct task
2018-01-17 12:28:22 -08:00
Linus Torvalds
b8c22594b1 Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Ingo Molnar:
 "Two futex fixes: a input parameters robustness fix, and futex race
  fixes"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Prevent overflow by strengthen input validation
  futex: Avoid violating the 10th rule of futex
2018-01-17 12:24:42 -08:00
Linus Torvalds
dd43f3465d Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
 "A one-liner fix which prevents deferrable timers becoming stale when
  the system does not switch into NOHZ mode"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timers: Unconditionally check deferrable base
2018-01-17 11:43:42 -08:00
Ingo Molnar
7a7368a5f2 Merge branch 'perf/urgent' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-17 17:20:08 +01:00
Kyungsik Lee
8ffdfe35b8 PM / hibernate: Drop unused parameter of enough_swap
Parameter flags is no longer used, remove it.

Signed-off-by: Kyungsik Lee <kyungsik.lee@lge.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-01-17 12:41:22 +01:00
David Howells
e1e871aff3 Expand INIT_STRUCT_PID and remove
Expand INIT_STRUCT_PID in the single place that uses it and then remove it.
There doesn't seem any point in the macro.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Tested-by: Will Deacon <will.deacon@arm.com> (arm64)
Tested-by: Palmer Dabbelt <palmer@sifive.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
2018-01-17 11:30:16 +00:00
David S. Miller
c02b3741eb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Overlapping changes all over.

The mini-qdisc bits were a little bit tricky, however.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 00:10:42 -05:00
Jakub Kicinski
48b3256326 bpf: annotate bpf_insn_print_t with __printf
Functions of type bpf_insn_print_t take printf-like format
string, mark the type accordingly.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-17 01:15:05 +01:00
Jakub Kicinski
0a2d28ff51 bpf: offload: make bpf_offload_dev_match() reject host+host case
Daniel suggests it would be more logical for bpf_offload_dev_match()
to return false is either the program or the map are not offloaded,
rather than treating the both not offloaded case as a "matching
CPU/host device".

This makes no functional difference today, since verifier only calls
bpf_offload_dev_match() when one of the objects is offloaded.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-17 01:15:05 +01:00
Wei Yongjun
0fe875c5c7 bpf: cpumap: make some functions static
Fixes the following sparse warnings:

kernel/bpf/cpumap.c:146:6: warning:
 symbol '__cpu_map_queue_destructor' was not declared. Should it be static?
kernel/bpf/cpumap.c:225:16: warning:
 symbol 'cpu_map_build_skb' was not declared. Should it be static?
kernel/bpf/cpumap.c:340:26: warning:
 symbol '__cpu_map_entry_alloc' was not declared. Should it be static?
kernel/bpf/cpumap.c:398:6: warning:
 symbol '__cpu_map_entry_free' was not declared. Should it be static?
kernel/bpf/cpumap.c:441:6: warning:
 symbol '__cpu_map_entry_replace' was not declared. Should it be static?
kernel/bpf/cpumap.c:454:5: warning:
 symbol 'cpu_map_delete_elem' was not declared. Should it be static?
kernel/bpf/cpumap.c:467:5: warning:
 symbol 'cpu_map_update_elem' was not declared. Should it be static?
kernel/bpf/cpumap.c:505:6: warning:
 symbol 'cpu_map_free' was not declared. Should it be static?

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-17 00:12:58 +01:00
Daniel Borkmann
f37a8cb84c bpf: reject stores into ctx via st and xadd
Alexei found that verifier does not reject stores into context
via BPF_ST instead of BPF_STX. And while looking at it, we
also should not allow XADD variant of BPF_STX.

The context rewriter is only assuming either BPF_LDX_MEM- or
BPF_STX_MEM-type operations, thus reject anything other than
that so that assumptions in the rewriter properly hold. Add
test cases as well for BPF selftests.

Fixes: d691f9e8d4 ("bpf: allow programs to write to certain skb fields")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-16 15:04:58 -08:00
Linus Torvalds
b45a53be53 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Two read past end of buffer fixes in AF_KEY, from Eric Biggers.

 2) Memory leak in key_notify_policy(), from Steffen Klassert.

 3) Fix overflow with bpf arrays, from Daniel Borkmann.

 4) Fix RDMA regression with mlx5 due to mlx5 no longer using
    pci_irq_get_affinity(), from Saeed Mahameed.

 5) Missing RCU read locking in nl80211_send_iface() when it calls
    ieee80211_bss_get_ie(), from Dominik Brodowski.

 6) cfg80211 should check dev_set_name()'s return value, from Johannes
    Berg.

 7) Missing module license tag in 9p protocol, from Stephen Hemminger.

 8) Fix crash due to too small MTU in udp ipv6 sendmsg, from Mike
    Maloney.

 9) Fix endless loop in netlink extack code, from David Ahern.

10) TLS socket layer sets inverted error codes, resulting in an endless
    loop. From Robert Hering.

11) Revert openvswitch erspan tunnel support, it's mis-designed and we
    need to kill it before it goes into a real release. From William Tu.

12) Fix lan78xx failures in full speed USB mode, from Yuiko Oshino.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (54 commits)
  net, sched: fix panic when updating miniq {b,q}stats
  qed: Fix potential use-after-free in qed_spq_post()
  nfp: use the correct index for link speed table
  lan78xx: Fix failure in USB Full Speed
  sctp: do not allow the v4 socket to bind a v4mapped v6 address
  sctp: return error if the asoc has been peeled off in sctp_wait_for_sndbuf
  sctp: reinit stream if stream outcnt has been change by sinit in sendmsg
  ibmvnic: Fix pending MAC address changes
  netlink: extack: avoid parenthesized string constant warning
  ipv4: Make neigh lookup keys for loopback/point-to-point devices be INADDR_ANY
  net: Allow neigh contructor functions ability to modify the primary_key
  sh_eth: fix dumping ARSTR
  Revert "openvswitch: Add erspan tunnel support."
  net/tls: Fix inverted error codes to avoid endless loop
  ipv6: ip6_make_skb() needs to clear cork.base.dst
  sctp: avoid compiler warning on implicit fallthru
  net: ipv4: Make "ip route get" match iif lo rules again.
  netlink: extack needs to be reset each time through loop
  tipc: fix a memory leak in tipc_nl_node_get_link()
  ipv6: fix udpv6 sendmsg crash caused by too small MTU
  ...
2018-01-16 12:45:30 -08:00
Linus Torvalds
921d4f67bf This includes two fixes
- Bring back context level recursive protection in ring buffer.
    The simpler counter protection failed, due to a path when
    tracing with trace_clock_global() as it could not be reentrant
    and depended on the ring buffer recursive protection to keep that
    from happening.
 
  - Prevent branch profiling when FORTIFY_SOURCE is enabled. It causes
    50 - 60 MB in warning messages. Branch profiling should never be
    run on production systems, so there's no reason that it needs to
    be enabled with FORTIFY_SOURCE.
 -----BEGIN PGP SIGNATURE-----
 
 iQHIBAABCgAyFiEEPm6V/WuN2kyArTUe1a05Y9njSUkFAlpdXGsUHHJvc3RlZHRA
 Z29vZG1pcy5vcmcACgkQ1a05Y9njSUmTIwv+JZPXGYdzZs2nAdCNKfC0bR0sM+MQ
 sqBN9En1UVOWn60LEt1lULTWh2AXzUKEDU3CkiDWwY44z9U3HZ4i/gbxIuxFLBUt
 oNKxIGkAz9F1yqqH5oZN+YeAgwc5XijglkrxmTUuSodW9ezCVT8gMr3sQpmnR9nk
 Jz2P/3XK2pbjLQQD8cmW4ZlWF2LMAnDGck8bWA1nFfe3TsZVCs56FK5vvAW+cUdF
 +cJHbfa3xfj93xHbUcpPJwdNjK8e14qsmNA56ukFp2zlq62Jnh8kDz3MXpOYGkYO
 fbXuPsopjtAZET3iQZEcwD5NygFF8E2cxB1qGbATXWTXYWbTO7GKQuUGSHrZGxnY
 DyWkEFSvL1oS3X54sGbnTiyE7iaSwWS2QAMeC/JkmsAQjgQ4H8KqxKLsyMpdZSEK
 IcoKGeLXqcSmu1mjrZ2PK+EKdKdONRm0bJXmjwkSTSQ12KXa5ubE3ahv3xpo7WD7
 Lye2b1FupNPZc6sjq0iyx83t0yfaUl01AOWm
 =qWqD
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.15-rc4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Bring back context level recursive protection in ring buffer.

   The simpler counter protection failed, due to a path when tracing
   with trace_clock_global() as it could not be reentrant and depended
   on the ring buffer recursive protection to keep that from happening.

 - Prevent branch profiling when FORTIFY_SOURCE is enabled.

   It causes 50 - 60 MB in warning messages. Branch profiling should
   never be run on production systems, so there's no reason that it
   needs to be enabled with FORTIFY_SOURCE.

* tag 'trace-v4.15-rc4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Prevent PROFILE_ALL_BRANCHES when FORTIFY_SOURCE=y
  ring-buffer: Bring back context level recursive checks
2018-01-16 12:09:36 -08:00
Eric W. Biederman
0752d7bf62 ptrace: Use copy_siginfo in setsiginfo and getsiginfo
Now that copy_siginfo copies all of the fields this is safe, safer (as
all of the bits are guaranteed to be copied), clearer, and less error
prone than using a structure copy.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-16 12:48:30 -06:00
Sergey Senozhatsky
fd5f7cde1b printk: Never set console_may_schedule in console_trylock()
This patch, basically, reverts commit 6b97a20d3a ("printk:
set may_schedule for some of console_trylock() callers").
That commit was a mistake, it introduced a big dependency
on the scheduler, by enabling preemption under console_sem
in printk()->console_unlock() path, which is rather too
critical. The patch did not significantly reduce the
possibilities of printk() lockups, but made it possible to
stall printk(), as has been reported by Tetsuo Handa [1].

Another issues is that preemption under console_sem also
messes up with Steven Rostedt's hand off scheme, by making
it possible to sleep with console_sem both in console_unlock()
and in vprintk_emit(), after acquiring the console_sem
ownership (anywhere between printk_safe_exit_irqrestore() in
console_trylock_spinning() and printk_safe_enter_irqsave()
in console_unlock()). This makes hand off less likely and,
at the same time, may result in a significant amount of
pending logbuf messages. Preempted console_sem owner makes
it impossible for other CPUs to emit logbuf messages, but
does not make it impossible for other CPUs to append new
messages to the logbuf.

Reinstate the old behavior and make printk() non-preemptible.
Should any printk() lockup reports arrive they must be handled
in a different way.

[1] http://lkml.kernel.org/r/201603022101.CAH73907.OVOOMFHFFtQJSL%20()%20I-love%20!%20SAKURA%20!%20ne%20!%20jp
Fixes: 6b97a20d3a ("printk: set may_schedule for some of console_trylock() callers")
Link: http://lkml.kernel.org/r/20180116044716.GE6607@jagdpanzerIV
To: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: akpm@linux-foundation.org
Cc: linux-mm@kvack.org
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2018-01-16 17:21:28 +01:00
Petr Mladek
c162d5b433 printk: Hide console waiter logic into helpers
The commit ("printk: Add console owner and waiter logic to load balance
console writes") made vprintk_emit() and console_unlock() even more
complicated.

This patch extracts the new code into 3 helper functions. They should
help to keep it rather self-contained. It will be easier to use and
maintain.

This patch just shuffles the existing code. It does not change
the functionality.

Link: http://lkml.kernel.org/r/20180112160837.GD24497@linux.suse
Cc: akpm@linux-foundation.org
Cc: linux-mm@kvack.org
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: rostedt@home.goodmis.org
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2018-01-16 17:21:16 +01:00
Steven Rostedt (VMware)
dbdda842fe printk: Add console owner and waiter logic to load balance console writes
This patch implements what I discussed in Kernel Summit. I added
lockdep annotation (hopefully correctly), and it hasn't had any splats
(since I fixed some bugs in the first iterations). It did catch
problems when I had the owner covering too much. But now that the owner
is only set when actively calling the consoles, lockdep has stayed
quiet.

Here's the design again:

I added a "console_owner" which is set to a task that is actively
writing to the consoles. It is *not* the same as the owner of the
console_lock. It is only set when doing the calls to the console
functions. It is protected by a console_owner_lock which is a raw spin
lock.

There is a console_waiter. This is set when there is an active console
owner that is not current, and waiter is not set. This too is protected
by console_owner_lock.

In printk() when it tries to write to the consoles, we have:

	if (console_trylock())
		console_unlock();

Now I added an else, which will check if there is an active owner, and
no current waiter. If that is the case, then console_waiter is set, and
the task goes into a spin until it is no longer set.

When the active console owner finishes writing the current message to
the consoles, it grabs the console_owner_lock and sees if there is a
waiter, and clears console_owner.

If there is a waiter, then it breaks out of the loop, clears the waiter
flag (because that will release the waiter from its spin), and exits.
Note, it does *not* release the console semaphore. Because it is a
semaphore, there is no owner. Another task may release it. This means
that the waiter is guaranteed to be the new console owner! Which it
becomes.

Then the waiter calls console_unlock() and continues to write to the
consoles.

If another task comes along and does a printk() it too can become the
new waiter, and we wash rinse and repeat!

By Petr Mladek about possible new deadlocks:

The thing is that we move console_sem only to printk() call
that normally calls console_unlock() as well. It means that
the transferred owner should not bring new type of dependencies.
As Steven said somewhere: "If there is a deadlock, it was
there even before."

We could look at it from this side. The possible deadlock would
look like:

CPU0                            CPU1

console_unlock()

  console_owner = current;

				spin_lockA()
				  printk()
				    spin = true;
				    while (...)

    call_console_drivers()
      spin_lockA()

This would be a deadlock. CPU0 would wait for the lock A.
While CPU1 would own the lockA and would wait for CPU0
to finish calling the console drivers and pass the console_sem
owner.

But if the above is true than the following scenario was
already possible before:

CPU0

spin_lockA()
  printk()
    console_unlock()
      call_console_drivers()
	spin_lockA()

By other words, this deadlock was there even before. Such
deadlocks are prevented by using printk_deferred() in
the sections guarded by the lock A.

By Steven Rostedt:

To demonstrate the issue, this module has been shown to lock up a
system with 4 CPUs and a slow console (like a serial console). It is
also able to lock up a 8 CPU system with only a fast (VGA) console, by
passing in "loops=100". The changes in this commit prevent this module
from locking up the system.

 #include <linux/module.h>
 #include <linux/delay.h>
 #include <linux/sched.h>
 #include <linux/mutex.h>
 #include <linux/workqueue.h>
 #include <linux/hrtimer.h>

 static bool stop_testing;
 static unsigned int loops = 1;

 static void preempt_printk_workfn(struct work_struct *work)
 {
 	int i;

 	while (!READ_ONCE(stop_testing)) {
 		for (i = 0; i < loops && !READ_ONCE(stop_testing); i++) {
 			preempt_disable();
 			pr_emerg("%5d%-75s\n", smp_processor_id(),
 				 " XXX NOPREEMPT");
 			preempt_enable();
 		}
 		msleep(1);
 	}
 }

 static struct work_struct __percpu *works;

 static void finish(void)
 {
 	int cpu;

 	WRITE_ONCE(stop_testing, true);
 	for_each_online_cpu(cpu)
 		flush_work(per_cpu_ptr(works, cpu));
 	free_percpu(works);
 }

 static int __init test_init(void)
 {
 	int cpu;

 	works = alloc_percpu(struct work_struct);
 	if (!works)
 		return -ENOMEM;

 	/*
 	 * This is just a test module. This will break if you
 	 * do any CPU hot plugging between loading and
 	 * unloading the module.
 	 */

 	for_each_online_cpu(cpu) {
 		struct work_struct *work = per_cpu_ptr(works, cpu);

 		INIT_WORK(work, &preempt_printk_workfn);
 		schedule_work_on(cpu, work);
 	}

 	return 0;
 }

 static void __exit test_exit(void)
 {
 	finish();
 }

 module_param(loops, uint, 0);
 module_init(test_init);
 module_exit(test_exit);
 MODULE_LICENSE("GPL");

Link: http://lkml.kernel.org/r/20180110132418.7080-2-pmladek@suse.com
Cc: akpm@linux-foundation.org
Cc: linux-mm@kvack.org
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
[pmladek@suse.com: Commit message about possible deadlocks]
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2018-01-16 17:21:09 +01:00
Sergey Senozhatsky
d2279c9d7f kallsyms: remove print_symbol() function
No more print_symbol()/__print_symbol() users left, remove these
symbols.

It was a very old API that encouraged people use continuous lines.
It had been obsoleted by %pS format specifier in a normal printk()
call.

Link: http://lkml.kernel.org/r/20180105102538.GC471@jagdpanzerIV
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-c6x-dev@linux-c6x.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-am33-list@redhat.com
Cc: linux-sh@vger.kernel.org
Cc: linux-edac@vger.kernel.org
Cc: x86@kernel.org
Cc: linux-snps-arc@lists.infradead.org
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Suggested-by: Joe Perches <joe@perches.com>
[pmladek@suse.com: updated commit message]
Signed-off-by: Petr Mladek <pmladek@suse.com>
2018-01-16 16:59:55 +01:00
Christian Borntraeger
1a5c79125a kvm_config: add CONFIG_S390_GUEST
make kvmconfig currently does not select CONFIG_S390_GUEST. Since
the virtio-ccw transport depends on CONFIG_S390_GUEST, we want
to add CONFIG_S390_GUEST to kvmconfig.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Cornelia Huck <cohuck@redhat.com>
2018-01-16 16:15:18 +01:00
Anna-Maria Gleixner
42f42da41b hrtimer: Implement SOFT/HARD clock base selection
All prerequisites to handle hrtimers for expiry in either hard or soft
interrupt context are in place.

Add the missing bit in hrtimer_init() which associates the timer to the
hard or the softirq clock base.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-30-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 09:51:22 +01:00
Anna-Maria Gleixner
5da7016046 hrtimer: Implement support for softirq based hrtimers
hrtimer callbacks are always invoked in hard interrupt context. Several
users in tree require soft interrupt context for their callbacks and
achieve this by combining a hrtimer with a tasklet. The hrtimer schedules
the tasklet in hard interrupt context and the tasklet callback gets invoked
in softirq context later.

That's suboptimal and aside of that the real-time patch moves most of the
hrtimers into softirq context. So adding native support for hrtimers
expiring in softirq context is a valuable extension for both mainline and
the RT patch set.

Each valid hrtimer clock id has two associated hrtimer clock bases: one for
timers expiring in hardirq context and one for timers expiring in softirq
context.

Implement the functionality to associate a hrtimer with the hard or softirq
related clock bases and update the relevant functions to take them into
account when the next expiry time needs to be evaluated.

Add a check into the hard interrupt context handler functions to check
whether the first expiring softirq based timer has expired. If it's expired
the softirq is raised and the accounting of softirq based timers to
evaluate the next expiry time for programming the timer hardware is skipped
until the softirq processing has finished. At the end of the softirq
processing the regular processing is resumed.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-29-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 09:51:22 +01:00
Josh Snyder
c96f5471ce delayacct: Account blkio completion on the correct task
Before commit:

  e33a9bba85 ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler")

delayacct_blkio_end() was called after context-switching into the task which
completed I/O.

This resulted in double counting: the task would account a delay both waiting
for I/O and for time spent in the runqueue.

With e33a9bba85, delayacct_blkio_end() is called by try_to_wake_up().
In ttwu, we have not yet context-switched. This is more correct, in that
the delay accounting ends when the I/O is complete.

But delayacct_blkio_end() relies on 'get_current()', and we have not yet
context-switched into the task whose I/O completed. This results in the
wrong task having its delay accounting statistics updated.

Instead of doing that, pass the task_struct being woken to delayacct_blkio_end(),
so that it can update the statistics of the correct task.

Signed-off-by: Josh Snyder <joshs@netflix.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Cc: <stable@vger.kernel.org>
Cc: Brendan Gregg <bgregg@netflix.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-block@vger.kernel.org
Fixes: e33a9bba85 ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler")
Link: http://lkml.kernel.org/r/1513613712-571-1-git-send-email-joshs@netflix.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 03:29:36 +01:00
Anna-Maria Gleixner
c458b1d102 hrtimer: Prepare handling of hard and softirq based hrtimers
The softirq based hrtimer can utilize most of the existing hrtimers
functions, but need to operate on a different data set.

Add an 'active_mask' parameter to various functions so the hard and soft bases
can be selected. Fixup the existing callers and hand in the ACTIVE_HARD
mask.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-28-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 03:01:20 +01:00
Anna-Maria Gleixner
98ecadd430 hrtimer: Add clock bases and hrtimer mode for softirq context
Currently hrtimer callback functions are always executed in hard interrupt
context. Users of hrtimers, which need their timer function to be executed
in soft interrupt context, make use of tasklets to get the proper context.

Add additional hrtimer clock bases for timers which must expire in softirq
context, so the detour via the tasklet can be avoided. This is also
required for RT, where the majority of hrtimer is moved into softirq
hrtimer context.

The selection of the expiry mode happens via a mode bit. Introduce
HRTIMER_MODE_SOFT and the matching combinations with the ABS/REL/PINNED
bits and update the decoding of hrtimer_mode in tracepoints.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-27-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 03:00:50 +01:00
Anna-Maria Gleixner
dd934aa8ad hrtimer: Use irqsave/irqrestore around __run_hrtimer()
__run_hrtimer() is called with the hrtimer_cpu_base.lock held and
interrupts disabled. Before invoking the timer callback the base lock is
dropped, but interrupts stay disabled.

The upcoming support for softirq based hrtimers requires that interrupts
are enabled before the timer callback is invoked.

To avoid code duplication, take hrtimer_cpu_base.lock with
raw_spin_lock_irqsave(flags) at the call site and hand in the flags as
a parameter. So raw_spin_unlock_irqrestore() before the callback invocation
will either keep interrupts disabled in interrupt context or restore to
interrupt enabled state when called from softirq context.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-26-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 03:00:47 +01:00
Anna-Maria Gleixner
ad38f596d8 hrtimer: Factor out __hrtimer_next_event_base()
Preparatory patch for softirq based hrtimers to avoid code duplication.

No functional change.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-25-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 03:00:43 +01:00
Eric W. Biederman
ea64d5acc8 signal: Unify and correct copy_siginfo_to_user32
Among the existing architecture specific versions of
copy_siginfo_to_user32 there are several different implementation
problems.  Some architectures fail to handle all of the cases in in
the siginfo union.  Some architectures perform a blind copy of the
siginfo union when the si_code is negative.  A blind copy suggests the
data is expected to be in 32bit siginfo format, which means that
receiving such a signal via signalfd won't work, or that the data is
in 64bit siginfo and the code is copying nonsense to userspace.

Create a single instance of copy_siginfo_to_user32 that all of the
architectures can share, and teach it to handle all of the cases in
the siginfo union correctly, with the assumption that siginfo is
stored internally to the kernel is 64bit siginfo format.

A special case is made for x86 x32 format.  This is needed as presence
of both x32 and ia32 on x86_64 results in two different 32bit signal
formats.  By allowing this small special case there winds up being
exactly one code base that needs to be maintained between all of the
architectures.  Vastly increasing the testing base and the chances of
finding bugs.

As the x86 copy of copy_siginfo_to_user32 the call of the x86
signal_compat_build_tests were moved into sigaction_compat_abi, so
that they will keep running.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-15 19:56:20 -06:00
Anna-Maria Gleixner
138a6b7ae4 hrtimer: Factor out __hrtimer_start_range_ns()
Preparatory patch for softirq based hrtimers to avoid code duplication,
factor out the __hrtimer_start_range_ns() function from hrtimer_start_range_ns().

No functional change.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-24-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:53:59 +01:00
Anna-Maria Gleixner
3ec7a3ee9f hrtimer: Remove the 'base' parameter from hrtimer_reprogram()
hrtimer_reprogram() must have access to the hrtimer_clock_base of the new
first expiring timer to access hrtimer_clock_base.offset for adjusting the
expiry time to CLOCK_MONOTONIC. This is required to evaluate whether the
new left most timer in the hrtimer_clock_base is the first expiring timer
of all clock bases in a hrtimer_cpu_base.

The only user of hrtimer_reprogram() is hrtimer_start_range_ns(), which has
a pointer to hrtimer_clock_base() already and hands it in as a parameter. But
hrtimer_start_range_ns() will be split for the upcoming support for softirq
based hrtimers to avoid code duplication and will lose the direct access to
the clock base pointer.

Instead of handing in timer and timer->base as a parameter remove the base
parameter from hrtimer_reprogram() instead and retrieve the clock base internally.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-23-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:53:59 +01:00
Anna-Maria Gleixner
2ac2dccce9 hrtimer: Make remote enqueue decision less restrictive
The current decision whether a timer can be queued on a remote CPU checks
for timer->expiry <= remote_cpu_base.expires_next.

This is too restrictive because a timer with the same expiry time as an
existing timer will be enqueued on right-hand size of the existing timer
inside the rbtree, i.e. behind the first expiring timer.

So its safe to allow enqueuing timers with the same expiry time as the
first expiring timer on a remote CPU base.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-22-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:53:58 +01:00
Anna-Maria Gleixner
14c803419d hrtimer: Unify remote enqueue handling
hrtimer_reprogram() is conditionally invoked from hrtimer_start_range_ns()
when hrtimer_cpu_base.hres_active is true.

In the !hres_active case there is a special condition for the nohz_active
case:

  If the newly enqueued timer expires before the first expiring timer on a
  remote CPU then the remote CPU needs to be notified and woken up from a
  NOHZ idle sleep to take the new first expiring timer into account.

Previous changes have already established the prerequisites to make the
remote enqueue behaviour the same whether high resolution mode is active or
not:

  If the to be enqueued timer expires before the first expiring timer on a
  remote CPU, then it cannot be enqueued there.

This was done for the high resolution mode because there is no way to
access the remote CPU timer hardware. The same is true for NOHZ, but was
handled differently by unconditionally enqueuing the timer and waking up
the remote CPU so it can reprogram its timer. Again there is no compelling
reason for this difference.

hrtimer_check_target(), which makes the 'can remote enqueue' decision is
already unconditional, but not yet functional because nothing updates
hrtimer_cpu_base.expires_next in the !hres_active case.

To unify this the following changes are required:

 1) Make the store of the new first expiry time unconditonal in
    hrtimer_reprogram() and check __hrtimer_hres_active() before proceeding
    to the actual hardware access. This check also lets the compiler
    eliminate the rest of the function in case of CONFIG_HIGH_RES_TIMERS=n.

 2) Invoke hrtimer_reprogram() unconditionally from
    hrtimer_start_range_ns()

 3) Remove the remote wakeup special case for the !high_res && nohz_active
    case.

Confine the timers_nohz_active static key to timer.c which is the only user
now.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-21-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:53:58 +01:00
Anna-Maria Gleixner
61bb4bcb79 hrtimer: Unify hrtimer removal handling
When the first hrtimer on the current CPU is removed,
hrtimer_force_reprogram() is invoked but only when
CONFIG_HIGH_RES_TIMERS=y and hrtimer_cpu_base.hres_active is set.

hrtimer_force_reprogram() updates hrtimer_cpu_base.expires_next and
reprograms the clock event device. When CONFIG_HIGH_RES_TIMERS=y and
hrtimer_cpu_base.hres_active is set, a pointless hrtimer interrupt can be
prevented.

hrtimer_check_target() makes the 'can remote enqueue' decision. As soon as
hrtimer_check_target() is unconditionally available and
hrtimer_cpu_base.expires_next is updated by hrtimer_reprogram(),
hrtimer_force_reprogram() needs to be available unconditionally as well to
prevent the following scenario with CONFIG_HIGH_RES_TIMERS=n:

- the first hrtimer on this CPU is removed and hrtimer_force_reprogram() is
  not executed

- CPU goes idle (next timer is calculated and hrtimers are taken into
  account)

- a hrtimer is enqueued remote on the idle CPU: hrtimer_check_target()
  compares expiry value and hrtimer_cpu_base.expires_next. The expiry value
  is after expires_next, so the hrtimer is enqueued. This timer will fire
  late, if it expires before the effective first hrtimer on this CPU and
  the comparison was with an outdated expires_next value.

To prevent this scenario, make hrtimer_force_reprogram() unconditional
except the effective reprogramming part, which gets eliminated by the
compiler in the CONFIG_HIGH_RES_TIMERS=n case.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-20-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:53:58 +01:00
Anna-Maria Gleixner
ebba2c723f hrtimer: Make hrtimer_force_reprogramm() unconditionally available
hrtimer_force_reprogram() needs to be available unconditionally for softirq
based hrtimers. Move the function and all required struct members out of
the CONFIG_HIGH_RES_TIMERS #ifdef.

There is no functional change because hrtimer_force_reprogram() is only
invoked when hrtimer_cpu_base.hres_active is true and
CONFIG_HIGH_RES_TIMERS=y.

Making it unconditional increases the text size for the
CONFIG_HIGH_RES_TIMERS=n case slightly, but avoids replication of that code
for the upcoming softirq based hrtimers support. Most of the code gets
eliminated in the CONFIG_HIGH_RES_TIMERS=n case by the compiler.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-19-anna-maria@linutronix.de
[ Made it build on !CONFIG_HIGH_RES_TIMERS ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:53:28 +01:00
Anna-Maria Gleixner
11a9fe069e hrtimer: Make hrtimer_reprogramm() unconditional
hrtimer_reprogram() needs to be available unconditionally for softirq based
hrtimers. Move the function and all required struct members out of the
CONFIG_HIGH_RES_TIMERS #ifdef.

There is no functional change because hrtimer_reprogram() is only invoked
when hrtimer_cpu_base.hres_active is true. Making it unconditional
increases the text size for the CONFIG_HIGH_RES_TIMERS=n case, but avoids
replication of that code for the upcoming softirq based hrtimers support.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-18-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:35:47 +01:00
Anna-Maria Gleixner
eb27926ba0 hrtimer: Make hrtimer_cpu_base.next_timer handling unconditional
hrtimer_cpu_base.next_timer stores the pointer to the next expiring timer
in a CPU base.

This pointer cannot be dereferenced and is solely used to check whether a
hrtimer which is removed is the hrtimer which is the first to expire in the
CPU base. If this is the case, then the timer hardware needs to be
reprogrammed to avoid an extra interrupt for nothing.

Again, this is conditional functionality, but there is no compelling reason
to make this conditional. As a preparation, hrtimer_cpu_base.next_timer
needs to be available unconditonally.

Aside of that the upcoming support for softirq based hrtimers requires access
to this pointer unconditionally as well, so our motivation is not entirely
simplicity based.

Make the update of hrtimer_cpu_base.next_timer unconditional and remove the
#ifdef cruft. The impact on CONFIG_HIGH_RES_TIMERS=n && CONFIG_NOHZ=n is
marginal as it's just a store on an already dirtied cacheline.

No functional change.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-17-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:35:47 +01:00
Anna-Maria Gleixner
07a9a7eae8 hrtimer: Make the remote enqueue check unconditional
hrtimer_cpu_base.expires_next is used to cache the next event armed in the
timer hardware. The value is used to check whether an hrtimer can be
enqueued remotely. If the new hrtimer is expiring before expires_next, then
remote enqueue is not possible as the remote hrtimer hardware cannot be
accessed for reprogramming to an earlier expiry time.

The remote enqueue check is currently conditional on
CONFIG_HIGH_RES_TIMERS=y and hrtimer_cpu_base.hres_active. There is no
compelling reason to make this conditional.

Move hrtimer_cpu_base.expires_next out of the CONFIG_HIGH_RES_TIMERS=y
guarded area and remove the conditionals in hrtimer_check_target().

The check is currently a NOOP for the CONFIG_HIGH_RES_TIMERS=n and the
!hrtimer_cpu_base.hres_active case because in these cases nothing updates
hrtimer_cpu_base.expires_next yet. This will be changed with later patches
which further reduce the #ifdef zoo in this code.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-16-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:35:47 +01:00
Anna-Maria Gleixner
851cff8caf hrtimer: Use accesor functions instead of direct access
__hrtimer_hres_active() is now available unconditionally, so replace open
coded direct accesses to hrtimer_cpu_base.hres_active.

No functional change.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-15-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:35:47 +01:00
Anna-Maria Gleixner
28bfd18bf3 hrtimer: Make the hrtimer_cpu_base::hres_active field unconditional, to simplify the code
The hrtimer_cpu_base::hres_active_member field depends on CONFIG_HIGH_RES_TIMERS=y
currently, and all related functions to this member are conditional as well.

To simplify the code make it unconditional and set it to zero during initialization.

(This will also help with the upcoming softirq based hrtimers code.)

The conditional code sections can be avoided by adding IS_ENABLED(HIGHRES)
conditionals into common functions, which ensures dead code elimination.

There is no functional change.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-14-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:35:47 +01:00
Anna-Maria Gleixner
3f0b9e8eec hrtimer: Store running timer in hrtimer_clock_base
The pointer to the currently running timer is stored in hrtimer_cpu_base
before the base lock is dropped and the callback is invoked.

This results in two levels of indirections and the upcoming support for
softirq based hrtimer requires splitting the "running" storage into soft
and hard IRQ context expiry.

Storing both in the cpu base would require conditionals in all code paths
accessing that information.

It's possible to have a per clock base sequence count and running pointer
without changing the semantics of the related mechanisms because the timer
base pointer cannot be changed while a timer is running the callback.

Unfortunately this makes cpu_clock base larger than 32 bytes on 32-bit
kernels. Instead of having huge gaps due to alignment, remove the alignment
and let the compiler pack CPU base for 32-bit kernels. The resulting cache access
patterns are fortunately not really different from the current
behaviour. On 64-bit kernels the 64-byte alignment stays and the behaviour is
unchanged. This was determined by analyzing the resulting layout and
looking at the number of cache lines involved for the frequently used
clocks.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-12-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:35:46 +01:00
Anna-Maria Gleixner
c272ca58c3 hrtimer: Switch 'for' loop to _ffs() evaluation
Looping over all clock bases to find active bits is suboptimal if not all
bases are active.

Avoid this by converting it to a __ffs() evaluation. The functionallity is
outsourced into its own function and is called via a macro as suggested by
Peter Zijlstra.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-11-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:35:46 +01:00
Anna-Maria Gleixner
63e2ed3659 tracing/hrtimer: Print the hrtimer mode in the 'hrtimer_start' tracepoint
The 'hrtimer_start' tracepoint lacks the mode information. The mode is
important because consecutive starts can switch from ABS to REL or from
PINNED to non PINNED.

Append the mode field.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-10-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:35:46 +01:00
Anna-Maria Gleixner
48d0c9becc hrtimer: Ensure POSIX compliance (relative CLOCK_REALTIME hrtimers)
The POSIX specification defines that relative CLOCK_REALTIME timers are not
affected by clock modifications. Those timers have to use CLOCK_MONOTONIC
to ensure POSIX compliance.

The introduction of the additional HRTIMER_MODE_PINNED mode broke this
requirement for pinned timers.

There is no user space visible impact because user space timers are not
using pinned mode, but for consistency reasons this needs to be fixed.

Check whether the mode has the HRTIMER_MODE_REL bit set instead of
comparing with HRTIMER_MODE_ABS.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Fixes: 597d027573 ("timers: Framework for identifying pinned timers")
Link: http://lkml.kernel.org/r/20171221104205.7269-7-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:35:45 +01:00
Anna-Maria Gleixner
6de6250c75 hrtimer: Fix hrtimer_start[_range_ns]() function descriptions
The hrtimer_start[_range_ns]() functions start a timer reliably on this CPU only
when HRTIMER_MODE_PINNED is set.

Furthermore the HRTIMER_MODE_PINNED mode is not considered when a hrtimer is initialized.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-6-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:35:45 +01:00
Anna-Maria Gleixner
907777136f hrtimer: Clean up the 'int clock' parameter of schedule_hrtimeout_range_clock()
schedule_hrtimeout_range_clock() uses an 'int clock' parameter for the
clock ID, instead of the customary predefined "clockid_t" type.

In hrtimer coding style the canonical variable name for the clock ID is
'clock_id', therefore change the name of the parameter here as well
to make it all consistent.

While at it, clean up the description for the 'clock_id' and 'mode'
function parameters. The clock modes and the clock IDs are not
restricted as the comment suggests.

Fix the mode description as well for the callers of schedule_hrtimeout_range_clock().

No functional changes intended.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-5-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:35:44 +01:00
Thomas Gleixner
d05ca13b8d hrtimer: Correct blatantly incorrect comment
The protection of a hrtimer which runs its callback against migration to a
different CPU has nothing to do with hard interrupt context.

The protection against migration of a hrtimer running the expiry callback
is the pointer in the cpu_base which holds a pointer to the currently
running timer. This pointer is evaluated in the code which potentially
switches the timer base and makes sure it's kept on the CPU on which the
callback is running.

Reported-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-3-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:35:44 +01:00
Thomas Gleixner
ae67badaa1 hrtimer: Optimize the hrtimer code by using static keys for migration_enable/nohz_active
The hrtimer_cpu_base::migration_enable and ::nohz_active fields
were originally introduced to avoid accessing global variables
for these decisions.

Still that results in a (cache hot) load and conditional branch,
which can be avoided by using static keys.

Implement it with static keys and optimize for the most critical
case of high performance networking which tends to disable the
timer migration functionality.

No change in functionality.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1801142327490.2371@nanos
Link: https://lkml.kernel.org/r/20171221104205.7269-2-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:35:44 +01:00
Ingo Molnar
57957fb519 Merge branch 'timers/urgent' into timers/core, to pick up dependent fix
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:33:42 +01:00
Eric W. Biederman
eb5346c379 signal: Remove the code to clear siginfo before calling copy_siginfo_from_user32
The new unified copy_siginfo_from_user32 takes care of this.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-15 18:01:19 -06:00
Eric W. Biederman
212a36a17e signal: Unify and correct copy_siginfo_from_user32
The function copy_siginfo_from_user32 is used for two things, in ptrace
since the dawn of siginfo for arbirarily modifying a signal that
user space sees, and in sigqueueinfo to send a signal with arbirary
siginfo data.

Create a single copy of copy_siginfo_from_user32 that all architectures
share, and teach it to handle all of the cases in the siginfo union.

In the generic version of copy_siginfo_from_user32 ensure that all
of the fields in siginfo are initialized so that the siginfo structure
can be safely copied to userspace if necessary.

When copying the embedded sigval union copy the si_int member.  That
ensures the 32bit values passes through the kernel unchanged.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-15 17:55:59 -06:00
Eric W. Biederman
71ee78d538 signal/blackfin: Move the blackfin specific si_codes to asm-generic/siginfo.h
Having si_codes in many different files simply encourages duplicate definitions
that can cause problems later.  To avoid that merge the blackfin specific si_codes
into uapi/asm-generic/siginfo.h

Update copy_siginfo_to_user to copy with the absence of BUS_MCEERR_AR that blackfin
defines to be something else.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-15 17:42:35 -06:00
Kees Cook
5905429ad8 fork: Provide usercopy whitelisting for task_struct
While the blocked and saved_sigmask fields of task_struct are copied to
userspace (via sigmask_to_save() and setup_rt_frame()), it is always
copied with a static length (i.e. sizeof(sigset_t)).

The only portion of task_struct that is potentially dynamically sized and
may be copied to userspace is in the architecture-specific thread_struct
at the end of task_struct.

cache object allocation:
    kernel/fork.c:
        alloc_task_struct_node(...):
            return kmem_cache_alloc_node(task_struct_cachep, ...);

        dup_task_struct(...):
            ...
            tsk = alloc_task_struct_node(node);

        copy_process(...):
            ...
            dup_task_struct(...)

        _do_fork(...):
            ...
            copy_process(...)

example usage trace:

    arch/x86/kernel/fpu/signal.c:
        __fpu__restore_sig(...):
            ...
            struct task_struct *tsk = current;
            struct fpu *fpu = &tsk->thread.fpu;
            ...
            __copy_from_user(&fpu->state.xsave, ..., state_size);

        fpu__restore_sig(...):
            ...
            return __fpu__restore_sig(...);

    arch/x86/kernel/signal.c:
        restore_sigcontext(...):
            ...
            fpu__restore_sig(...)

This introduces arch_thread_struct_whitelist() to let an architecture
declare specifically where the whitelist should be within thread_struct.
If undefined, the entire thread_struct field is left whitelisted.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: "Mickaël Salaün" <mic@digikod.net>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Rik van Riel <riel@redhat.com>
2018-01-15 12:08:04 -08:00
David Windsor
f9d29946c5 fork: Define usercopy region in thread_stack slab caches
In support of usercopy hardening, this patch defines a region in the
thread_stack slab caches in which userspace copy operations are allowed.
Since the entire thread_stack needs to be available to userspace, the
entire slab contents are whitelisted. Note that the slab-based thread
stack is only present on systems with THREAD_SIZE < PAGE_SIZE and
!CONFIG_VMAP_STACK.

cache object allocation:
    kernel/fork.c:
        alloc_thread_stack_node(...):
            return kmem_cache_alloc_node(thread_stack_cache, ...)

        dup_task_struct(...):
            ...
            stack = alloc_thread_stack_node(...)
            ...
            tsk->stack = stack;

        copy_process(...):
            ...
            dup_task_struct(...)

        _do_fork(...):
            ...
            copy_process(...)

This region is known as the slab cache's usercopy region. Slab caches
can now check that each dynamically sized copy operation involving
cache-managed memory falls entirely within the slab's usercopy region.

This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on my
understanding of the code. Changes or omissions from the original code are
mine and don't reflect the original grsecurity/PaX code.

Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split patch, provide usage trace]
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Rik van Riel <riel@redhat.com>
2018-01-15 12:08:03 -08:00
David Windsor
07dcd7fe89 fork: Define usercopy region in mm_struct slab caches
In support of usercopy hardening, this patch defines a region in the
mm_struct slab caches in which userspace copy operations are allowed.
Only the auxv field is copied to userspace.

cache object allocation:
    kernel/fork.c:
        #define allocate_mm()     (kmem_cache_alloc(mm_cachep, GFP_KERNEL))

        dup_mm():
            ...
            mm = allocate_mm();

        copy_mm(...):
            ...
            dup_mm();

        copy_process(...):
            ...
            copy_mm(...)

        _do_fork(...):
            ...
            copy_process(...)

example usage trace:

    fs/binfmt_elf.c:
        create_elf_tables(...):
            ...
            elf_info = (elf_addr_t *)current->mm->saved_auxv;
            ...
            copy_to_user(..., elf_info, ei_index * sizeof(elf_addr_t))

        load_elf_binary(...):
            ...
            create_elf_tables(...);

This region is known as the slab cache's usercopy region. Slab caches
can now check that each dynamically sized copy operation involving
cache-managed memory falls entirely within the slab's usercopy region.

This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on my
understanding of the code. Changes or omissions from the original code are
mine and don't reflect the original grsecurity/PaX code.

Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split patch, provide usage trace]
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Rik van Riel <riel@redhat.com>
2018-01-15 12:08:02 -08:00
Namit Gupta
1323eac7fd ftrace/module: Move ftrace_release_mod() to ddebug_cleanup label
ftrace_module_init happen after dynamic_debug_setup, it is desired that
cleanup should be called after this label however in current implementation
it is called in free module label,ie:even though ftrace in not initialized,
from so many fail case ftrace_release_mod() will be called and unnecessary
traverse the whole list.
In below patch we moved ftrace_release_mod() from free_module label to
ddebug_cleanup label. that is the best possible location, other solution
is to make new label to ftrace_release_mod() but since ftrace_module_init()
is not return with minimum changes it should be in ddebug_cleanup label.

Signed-off-by: Namit Gupta <gupta.namit@samsung.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2018-01-15 20:44:21 +01:00
Randy Dunlap
68e76e034b tracing: Prevent PROFILE_ALL_BRANCHES when FORTIFY_SOURCE=y
I regularly get 50 MB - 60 MB files during kernel randconfig builds.
These large files mostly contain (many repeats of; e.g., 124,594):

In file included from ../include/linux/string.h:6:0,
                 from ../include/linux/uuid.h:20,
                 from ../include/linux/mod_devicetable.h:13,
                 from ../scripts/mod/devicetable-offsets.c:3:
../include/linux/compiler.h:64:4: warning: '______f' is static but declared in inline function 'strcpy' which is not static [enabled by default]
    ______f = {     \
    ^
../include/linux/compiler.h:56:23: note: in expansion of macro '__trace_if'
                       ^
../include/linux/string.h:425:2: note: in expansion of macro 'if'
  if (p_size == (size_t)-1 && q_size == (size_t)-1)
  ^

This only happens when CONFIG_FORTIFY_SOURCE=y and
CONFIG_PROFILE_ALL_BRANCHES=y, so prevent PROFILE_ALL_BRANCHES if
FORTIFY_SOURCE=y.

Link: http://lkml.kernel.org/r/9199446b-a141-c0c3-9678-a3f9107f2750@infradead.org

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-01-15 14:15:31 -05:00
Steven Rostedt (VMware)
a0e3a18f4b ring-buffer: Bring back context level recursive checks
Commit 1a149d7d3f ("ring-buffer: Rewrite trace_recursive_(un)lock() to be
simpler") replaced the context level recursion checks with a simple counter.
This would prevent the ring buffer code from recursively calling itself more
than the max number of contexts that exist (Normal, softirq, irq, nmi). But
this change caused a lockup in a specific case, which was during suspend and
resume using a global clock. Adding a stack dump to see where this occurred,
the issue was in the trace global clock itself:

  trace_buffer_lock_reserve+0x1c/0x50
  __trace_graph_entry+0x2d/0x90
  trace_graph_entry+0xe8/0x200
  prepare_ftrace_return+0x69/0xc0
  ftrace_graph_caller+0x78/0xa8
  queued_spin_lock_slowpath+0x5/0x1d0
  trace_clock_global+0xb0/0xc0
  ring_buffer_lock_reserve+0xf9/0x390

The function graph tracer traced queued_spin_lock_slowpath that was called
by trace_clock_global. This pointed out that the trace_clock_global() is not
reentrant, as it takes a spin lock. It depended on the ring buffer recursive
lock from letting that happen.

By removing the context detection and adding just a max number of allowable
recursions, it allowed the trace_clock_global() to be entered again and try
to retake the spinlock it already held, causing a deadlock.

Fixes: 1a149d7d3f ("ring-buffer: Rewrite trace_recursive_(un)lock() to be simpler")
Reported-by: David Weinehall <david.weinehall@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-01-15 12:28:06 -05:00
NeilBrown
6106c0f824 staging: lustre: lnet: convert selftest to use workqueues
Instead of the cfs workitem library, use workqueues.

As lnet wants to provide a cpu mask of allowed cpus, it
needs to be a WQ_UNBOUND work queue so that tasks can
run on cpus other than where they were submitted.

This patch also exported apply_workqueue_attrs() which is
a documented part of the workqueue API, that isn't currently
exported.  lustre needs it to allow workqueue thread to be limited
to a subset of CPUs.

Acked-by: Tejun Heo <tj@kernel.org> (for export of apply_workqueue_attrs)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-15 15:44:08 +01:00
Jakub Kicinski
a38845729e bpf: offload: add map offload infrastructure
BPF map offload follow similar path to program offload.  At creation
time users may specify ifindex of the device on which they want to
create the map.  Map will be validated by the kernel's
.map_alloc_check callback and device driver will be called for the
actual allocation.  Map will have an empty set of operations
associated with it (save for alloc and free callbacks).  The real
device callbacks are kept in map->offload->dev_ops because they
have slightly different signatures.  Map operations are called in
process context so the driver may communicate with HW freely,
msleep(), wait() etc.

Map alloc and free callbacks are muxed via existing .ndo_bpf, and
are always called with rtnl lock held.  Maps and programs are
guaranteed to be destroyed before .ndo_uninit (i.e. before
unregister_netdev() returns).  Map callbacks are invoked with
bpf_devs_lock *read* locked, drivers must take care of exclusive
locking if necessary.

All offload-specific branches are marked with unlikely() (through
bpf_map_is_dev_bound()), given that branch penalty will be
negligible compared to IO anyway, and we don't want to penalize
SW path unnecessarily.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-14 23:36:30 +01:00
Jakub Kicinski
5bc2d55c6a bpf: offload: factor out netdev checking at allocation time
Add a helper to check if netdev could be found and whether it
has .ndo_bpf callback.  There is no need to check the callback
every time it's invoked, ndos can't reasonably be swapped for
a set without .ndp_bpf while program is loaded.

bpf_dev_offload_check() will also be used by map offload.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-14 23:36:29 +01:00
Jakub Kicinski
0a9c1991f2 bpf: rename bpf_dev_offload -> bpf_prog_offload
With map offload coming, we need to call program offload structure
something less ambiguous.  Pure rename, no functional changes.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-14 23:36:29 +01:00
Jakub Kicinski
bd475643d7 bpf: add helper for copying attrs to struct bpf_map
All map types reimplement the field-by-field copy of union bpf_attr
members into struct bpf_map.  Add a helper to perform this operation.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-14 23:36:29 +01:00
Jakub Kicinski
9328e0d1bc bpf: hashtab: move checks out of alloc function
Use the new callback to perform allocation checks for hash maps.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-14 23:36:29 +01:00
Jakub Kicinski
daffc5a2e6 bpf: hashtab: move attribute validation before allocation
Number of attribute checks are currently performed after hashtab
is already allocated.  Move them to be able to split them out to
the check function later on.  Checks have to now be performed on
the attr union directly instead of the members of bpf_map, since
bpf_map will be allocated later.  No functional changes.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-14 23:36:29 +01:00
Jakub Kicinski
1110f3a9bc bpf: add map_alloc_check callback
.map_alloc callbacks contain a number of checks validating user-
-provided map attributes against constraints of a particular map
type.  For offloaded maps we will need to check map attributes
without actually allocating any memory on the host.  Add a new
callback for validating attributes before any memory is allocated.
This callback can be selectively implemented by map types for
sharing code with offloads, or simply to separate the logical
steps of validation and allocation.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-14 23:36:29 +01:00
Thomas Gleixner
ed4bbf7910 timers: Unconditionally check deferrable base
When the timer base is checked for expired timers then the deferrable base
must be checked as well. This was missed when making the deferrable base
independent of base::nohz_active.

Fixes: ced6d5c11d ("timers: Use deferrable base independent of base::nohz_active")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Cc: rt@linutronix.de
2018-01-14 23:25:33 +01:00
Alexei Starovoitov
68fda450a7 bpf: fix 32-bit divide by zero
due to some JITs doing if (src_reg == 0) check in 64-bit mode
for div/mod operations mask upper 32-bits of src register
before doing the check

Fixes: 622582786c ("net: filter: x86: internal BPF JIT")
Fixes: 7a12b5031c ("sparc64: Add eBPF JIT.")
Reported-by: syzbot+48340bb518e88849e2e3@syzkaller.appspotmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-14 23:05:33 +01:00
Max R. P. Grossmann
a9445e47d8 posix-cpu-timers: Make set_process_cpu_timer() more robust
Because the return value of cpu_timer_sample_group() is not checked,
compilers and static checkers can legitimately warn about a potential use
of the uninitialized variable 'now'. This is not a runtime issue as all call
sites hand in valid clock ids.

Also cpu_timer_sample_group() is invoked unconditionally even when the
result is not used because *oldval is NULL.

Make the invocation conditional and check the return value.

[ tglx: Massage changelog ]

Signed-off-by: Max R. P. Grossmann <m@max.pm>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: john.stultz@linaro.org
Link: https://lkml.kernel.org/r/20180108190157.10048-1-m@max.pm
2018-01-14 20:50:59 +01:00
Li Jinyue
fbe0e839d1 futex: Prevent overflow by strengthen input validation
UBSAN reports signed integer overflow in kernel/futex.c:

 UBSAN: Undefined behaviour in kernel/futex.c:2041:18
 signed integer overflow:
 0 - -2147483648 cannot be represented in type 'int'

Add a sanity check to catch negative values of nr_wake and nr_requeue.

Signed-off-by: Li Jinyue <lijinyue@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: peterz@infradead.org
Cc: dvhart@infradead.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1513242294-31786-1-git-send-email-lijinyue@huawei.com
2018-01-14 18:55:03 +01:00
Peter Zijlstra
c1e2f0eaf0 futex: Avoid violating the 10th rule of futex
Julia reported futex state corruption in the following scenario:

   waiter                                  waker                                            stealer (prio > waiter)

   futex(WAIT_REQUEUE_PI, uaddr, uaddr2,
         timeout=[N ms])
      futex_wait_requeue_pi()
         futex_wait_queue_me()
            freezable_schedule()
            <scheduled out>
                                           futex(LOCK_PI, uaddr2)
                                           futex(CMP_REQUEUE_PI, uaddr,
                                                 uaddr2, 1, 0)
                                              /* requeues waiter to uaddr2 */
                                           futex(UNLOCK_PI, uaddr2)
                                                 wake_futex_pi()
                                                    cmp_futex_value_locked(uaddr2, waiter)
                                                    wake_up_q()
           <woken by waker>
           <hrtimer_wakeup() fires,
            clears sleeper->task>
                                                                                           futex(LOCK_PI, uaddr2)
                                                                                              __rt_mutex_start_proxy_lock()
                                                                                                 try_to_take_rt_mutex() /* steals lock */
                                                                                                    rt_mutex_set_owner(lock, stealer)
                                                                                              <preempted>
         <scheduled in>
         rt_mutex_wait_proxy_lock()
            __rt_mutex_slowlock()
               try_to_take_rt_mutex() /* fails, lock held by stealer */
               if (timeout && !timeout->task)
                  return -ETIMEDOUT;
            fixup_owner()
               /* lock wasn't acquired, so,
                  fixup_pi_state_owner skipped */

   return -ETIMEDOUT;

   /* At this point, we've returned -ETIMEDOUT to userspace, but the
    * futex word shows waiter to be the owner, and the pi_mutex has
    * stealer as the owner */

   futex_lock(LOCK_PI, uaddr2)
     -> bails with EDEADLK, futex word says we're owner.

And suggested that what commit:

  73d786bd04 ("futex: Rework inconsistent rt_mutex/futex_q state")

removes from fixup_owner() looks to be just what is needed. And indeed
it is -- I completely missed that requeue_pi could also result in this
case. So we need to restore that, except that subsequent patches, like
commit:

  16ffa12d74 ("futex: Pull rt_mutex_futex_unlock() out from under hb->lock")

changed all the locking rules. Even without that, the sequence:

-               if (rt_mutex_futex_trylock(&q->pi_state->pi_mutex)) {
-                       locked = 1;
-                       goto out;
-               }

-               raw_spin_lock_irq(&q->pi_state->pi_mutex.wait_lock);
-               owner = rt_mutex_owner(&q->pi_state->pi_mutex);
-               if (!owner)
-                       owner = rt_mutex_next_owner(&q->pi_state->pi_mutex);
-               raw_spin_unlock_irq(&q->pi_state->pi_mutex.wait_lock);
-               ret = fixup_pi_state_owner(uaddr, q, owner);

already suggests there were races; otherwise we'd never have to look
at next_owner.

So instead of doing 3 consecutive wait_lock sections with who knows
what races, we do it all in a single section. Additionally, the usage
of pi_state->owner in fixup_owner() was only safe because only the
rt_mutex owner would modify it, which this additional case wrecks.

Luckily the values can only change away and not to the value we're
testing, this means we can do a speculative test and double check once
we have the wait_lock.

Fixes: 73d786bd04 ("futex: Rework inconsistent rt_mutex/futex_q state")
Reported-by: Julia Cartwright <julia@ni.com>
Reported-by: Gratian Crisan <gratian.crisan@ni.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Julia Cartwright <julia@ni.com>
Tested-by: Gratian Crisan <gratian.crisan@ni.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20171208124939.7livp7no2ov65rrc@hirez.programming.kicks-ass.net
2018-01-14 18:49:16 +01:00
Eric Dumazet
c366287ebd bpf: fix divides by zero
Divides by zero are not nice, lets avoid them if possible.

Also do_div() seems not needed when dealing with 32bit operands,
but this seems a minor detail.

Fixes: bd4cf0ed33 ("net: filter: rework/optimize internal BPF interpreter's instruction set")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-14 09:03:43 -08:00
Linus Torvalds
ed93de8420 Merge branch 'akpm' (patches from Andrew)
Merge misc fixlets from Andrew Morton:
 "4 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  tools/objtool/Makefile: don't assume sync-check.sh is executable
  kdump: write correct address of mem_section into vmcoreinfo
  kmemleak: allow to coexist with fault injection
  MAINTAINERS, nilfs2: change project home URLs
2018-01-13 11:07:55 -08:00
Kirill A. Shutemov
a0b1280368 kdump: write correct address of mem_section into vmcoreinfo
Depending on configuration mem_section can now be an array or a pointer
to an array allocated dynamically.  In most cases, we can continue to
refer to it as 'mem_section' regardless of what it is.

But there's one exception: '&mem_section' means "address of the array"
if mem_section is an array, but if mem_section is a pointer, it would
mean "address of the pointer".

We've stepped onto this in kdump code.  VMCOREINFO_SYMBOL(mem_section)
writes down address of pointer into vmcoreinfo, not array as we wanted.

Let's introduce VMCOREINFO_SYMBOL_ARRAY() that would handle the
situation correctly for both cases.

Link: http://lkml.kernel.org/r/20180112162532.35896-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Fixes: 83e3c48729 ("mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y")
Acked-by: Baoquan He <bhe@redhat.com>
Acked-by: Dave Young <dyoung@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-13 10:42:48 -08:00
Masami Hiramatsu
4b1a29a7f5 error-injection: Support fault injection framework
Support in-kernel fault-injection framework via debugfs.
This allows you to inject a conditional error to specified
function using debugfs interfaces.

Here is the result of test script described in
Documentation/fault-injection/fault-injection.txt

  ===========
  # ./test_fail_function.sh
  1+0 records in
  1+0 records out
  1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0227404 s, 46.1 MB/s
  btrfs-progs v4.4
  See http://btrfs.wiki.kernel.org for more information.

  Label:              (null)
  UUID:               bfa96010-12e9-4360-aed0-42eec7af5798
  Node size:          16384
  Sector size:        4096
  Filesystem size:    1001.00MiB
  Block group profiles:
    Data:             single            8.00MiB
    Metadata:         DUP              58.00MiB
    System:           DUP              12.00MiB
  SSD detected:       no
  Incompat features:  extref, skinny-metadata
  Number of devices:  1
  Devices:
     ID        SIZE  PATH
      1  1001.00MiB  /dev/loop2

  mount: mount /dev/loop2 on /opt/tmpmnt failed: Cannot allocate memory
  SUCCESS!
  ===========

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-12 17:33:38 -08:00
Masami Hiramatsu
540adea380 error-injection: Separate error-injection from kprobe
Since error-injection framework is not limited to be used
by kprobes, nor bpf. Other kernel subsystems can use it
freely for checking safeness of error-injection, e.g.
livepatch, ftrace etc.
So this separate error-injection framework from kprobes.

Some differences has been made:

- "kprobe" word is removed from any APIs/structures.
- BPF_ALLOW_ERROR_INJECTION() is renamed to
  ALLOW_ERROR_INJECTION() since it is not limited for BPF too.
- CONFIG_FUNCTION_ERROR_INJECTION is the config item of this
  feature. It is automatically enabled if the arch supports
  error injection feature for kprobe or ftrace etc.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-12 17:33:38 -08:00
Masami Hiramatsu
66665ad2f1 tracing/kprobe: bpf: Compare instruction pointer with original one
Compare instruction pointer with original one on the
stack instead using per-cpu bpf_kprobe_override flag.

This patch also consolidates reset_current_kprobe() and
preempt_enable_no_resched() blocks. Those can be done
in one place.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-12 17:33:38 -08:00
Masami Hiramatsu
b4da3340ea tracing/kprobe: bpf: Check error injectable event is on function entry
Check whether error injectable event is on function entry or not.
Currently it checks the event is ftrace-based kprobes or not,
but that is wrong. It should check if the event is on the entry
of target function. Since error injection will override a function
to just return with modified return value, that operation must
be done before the target function starts making stackframe.

As a side effect, bpf error injection is no need to depend on
function-tracer. It can work with sw-breakpoint based kprobe
events too.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-12 17:33:37 -08:00
Eric W. Biederman
0326e7ef05 signal: Remove unnecessary ifdefs now that there is only one struct siginfo
Remove HAVE_ARCH_SIGINFO_T
Remove __ARCH_SIGSYS

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-12 14:34:49 -06:00
Eric W. Biederman
30073566ca signal/ia64: switch the last arch-specific copy_siginfo_to_user() to generic version
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-12 14:34:47 -06:00
Eric W. Biederman
9943d3accb signal: Clear si_sys_private before copying siginfo to userspace
In preparation for unconditionally copying the whole of siginfo
to userspace clear si_sys_private.  So this kernel internal
value is guaranteed not to make it to userspace.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-12 14:34:45 -06:00
Eric W. Biederman
aba1be2f8c signal: Ensure no siginfo union member increases the size of struct siginfo
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-12 14:34:45 -06:00
Eric W. Biederman
faf1f22b61 signal: Ensure generic siginfos the kernel sends have all bits initialized
Call clear_siginfo to ensure stack allocated siginfos are fully
initialized before being passed to the signal sending functions.

This ensures that if there is the kind of confusion documented by
TRAP_FIXME, FPE_FIXME, or BUS_FIXME the kernel won't send unitialized
data to userspace when the kernel generates a signal with SI_USER but
the copy to userspace assumes it is a different kind of signal, and
different fields are initialized.

This also prepares the way for turning copy_siginfo_to_user
into a copy_to_user, by removing the need in many cases to perform
a field by field copy simply to skip the uninitialized fields.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-12 14:21:07 -06:00
Eric W. Biederman
526c3ddb6a signal/arm64: Document conflicts with SI_USER and SIGFPE,SIGTRAP,SIGBUS
Setting si_code to 0 results in a userspace seeing an si_code of 0.
This is the same si_code as SI_USER.  Posix and common sense requires
that SI_USER not be a signal specific si_code.  As such this use of 0
for the si_code is a pretty horribly broken ABI.

Further use of si_code == 0 guaranteed that copy_siginfo_to_user saw a
value of __SI_KILL and now sees a value of SIL_KILL with the result
that uid and pid fields are copied and which might copying the si_addr
field by accident but certainly not by design.  Making this a very
flakey implementation.

Utilizing FPE_FIXME, BUS_FIXME, TRAP_FIXME siginfo_layout will now return
SIL_FAULT and the appropriate fields will be reliably copied.

But folks this is a new and unique kind of bad.  This is massively
untested code bad.  This is inventing new and unique was to get
siginfo wrong bad.  This is don't even think about Posix or what
siginfo means bad.  This is lots of eyeballs all missing the fact
that the code does the wrong thing bad.  This is getting stuck
and keep making the same mistake bad.

I really hope we can find a non userspace breaking fix for this on a
port as new as arm64.

Possible ABI fixes include:
- Send the signal without siginfo
- Don't generate a signal
- Possibly assign and use an appropriate si_code
- Don't handle cases which can't happen

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Tyler Baicar <tbaicar@codeaurora.org>
Cc: James Morse <james.morse@arm.com>
Cc: Tony Lindgren <tony@atomide.com>
Cc: Nicolas Pitre <nico@linaro.org>
Cc: Olof Johansson <olof@lixom.net>
Cc: Santosh Shilimkar <santosh.shilimkar@ti.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux-arm-kernel@lists.infradead.org
Ref: 53631b54c8 ("arm64: Floating point and SIMD")
Ref: 32015c2356 ("arm64: exception: handle Synchronous External Abort")
Ref: 1d18c47c73 ("arm64: MMU fault handling and page table management")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-12 14:21:05 -06:00
Sergey Senozhatsky
62635ea8c1 workqueue: avoid hard lockups in show_workqueue_state()
show_workqueue_state() can print out a lot of messages while being in
atomic context, e.g. sysrq-t -> show_workqueue_state(). If the console
device is slow it may end up triggering NMI hard lockup watchdog.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v4.5+
2018-01-12 11:39:49 -08:00
Linus Torvalds
67549d46d4 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "A Kconfig fix, a build fix and a membarrier bug fix"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  membarrier: Disable preemption when calling smp_call_function_many()
  sched/isolation: Make CONFIG_CPU_ISOLATION=y depend on SMP or COMPILE_TEST
  ia64, sched/cputime: Fix build error if CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y
2018-01-12 10:23:59 -08:00
Linus Torvalds
02776b9b53 Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Ingo Molnar:
 "No functional effects intended: removes leftovers from recent lockdep
  and refcounts work"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/refcounts: Remove stale comment from the ARCH_HAS_REFCOUNT Kconfig entry
  locking/lockdep: Remove cross-release leftovers
  locking/Documentation: Remove stale crossrelease_fullstack parameter
2018-01-12 10:14:09 -08:00
Christoph Hellwig
84676c1f21 genirq/affinity: assign vectors to all possible CPUs
Currently we assign managed interrupt vectors to all present CPUs.  This
works fine for systems were we only online/offline CPUs.  But in case of
systems that support physical CPU hotplug (or the virtualized version of
it) this means the additional CPUs covered for in the ACPI tables or on
the command line are not catered for.  To fix this we'd either need to
introduce new hotplug CPU states just for this case, or we can start
assining vectors to possible but not present CPUs.

Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Tested-by: Stefan Haberland <sth@linux.vnet.ibm.com>
Fixes: 4b855ad371 ("blk-mq: Create hctx for each present CPU")
Cc: linux-kernel@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-12 11:01:38 -07:00
David S. Miller
19d28fbd30 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
BPF alignment tests got a conflict because the registers
are output as Rn_w instead of just Rn in net-next, and
in net a fixup for a testcase prohibits logical operations
on pointers before using them.

Also, we should attempt to patch BPF call args if JIT always on is
enabled.  Instead, if we fail to JIT the subprogs we should pass
an error back up and fail immediately.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-11 22:13:42 -05:00
David S. Miller
8c2e6c904f Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2018-01-11

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Various BPF related improvements and fixes to nfp driver: i) do
   not register XDP RXQ structure to control queues, ii) round up
   program stack size to word size for nfp, iii) restrict MTU changes
   when BPF offload is active, iv) add more fully featured relocation
   support to JIT, v) add support for signed compare instructions to
   the nfp JIT, vi) export and reuse verfier log routine for nfp, and
   many more, from Jakub, Quentin and Nic.

2) Fix a syzkaller reported GPF in BPF's copy_verifier_state() when
   we hit kmalloc failure path, from Alexei.

3) Add two follow-up fixes for the recent XDP RXQ series: i) kvzalloc()
   allocated memory was only kfree()'ed, and ii) fix a memory leak where
   RX queue was not freed in netif_free_rx_queues(), from Jakub.

4) Add a sample for transferring XDP meta data into the skb, here it
   is used for setting skb->mark with the buffer from XDP, from Jesper.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-11 13:59:41 -05:00
Miroslav Benes
8869016d3a livepatch: add locking to force and signal functions
klp_send_signals() and klp_force_transition() do not acquire klp_mutex,
because it seemed to be superfluous. A potential race in
klp_send_signals() was harmless and there was nothing in
klp_force_transition() which needed to be synchronized. That changed
with the addition of klp_forced variable during the review process.

There is a small window now, when klp_complete_transition() does not see
klp_forced set to true while all tasks have been already transitioned to
the target state. module_put() is called and the module can be removed.

Acquire klp_mutex in sysfs callback to prevent it. Do the same for the
signal sending just to be sure. There is no real downside to that.

Fixes: c99a2be790 ("livepatch: force transition to finish")
Fixes: 43347d56c8 ("livepatch: send a fake signal to all blocking tasks")
Reported-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2018-01-11 17:36:07 +01:00
Miroslav Benes
d0807da78e livepatch: Remove immediate feature
Immediate flag has been used to disable per-task consistency and patch
all tasks immediately. It could be useful if the patch doesn't change any
function or data semantics.

However, it causes problems on its own. The consistency problem is
currently broken with respect to immediate patches.

func            a
patches         1i
                2i
                3

When the patch 3 is applied, only 2i function is checked (by stack
checking facility). There might be a task sleeping in 1i though. Such
task is migrated to 3, because we do not check 1i in
klp_check_stack_func() at all.

Coming atomic replace feature would be easier to implement and more
reliable without immediate.

Thus, remove immediate feature completely and save us from the problems.

Note that force feature has the similar problem. However it is
considered as a last resort. If used, administrator should not apply any
new live patches and should plan for reboot into an updated kernel.

The architectures would now need to provide HAVE_RELIABLE_STACKTRACE to
fully support livepatch.

Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2018-01-11 10:58:03 +01:00
Daniel Borkmann
bbeb6e4323 bpf, array: fix overflow in max_entries and undefined behavior in index_mask
syzkaller tried to alloc a map with 0xfffffffd entries out of a userns,
and thus unprivileged. With the recently added logic in b2157399cc
("bpf: prevent out-of-bounds speculation") we round this up to the next
power of two value for max_entries for unprivileged such that we can
apply proper masking into potentially zeroed out map slots.

However, this will generate an index_mask of 0xffffffff, and therefore
a + 1 will let this overflow into new max_entries of 0. This will pass
allocation, etc, and later on map access we still enforce on the original
attr->max_entries value which was 0xfffffffd, therefore triggering GPF
all over the place. Thus bail out on overflow in such case.

Moreover, on 32 bit archs roundup_pow_of_two() can also not be used,
since fls_long(max_entries - 1) can result in 32 and 1UL << 32 in 32 bit
space is undefined. Therefore, do this by hand in a 64 bit variable.

This fixes all the issues triggered by syzkaller's reproducers.

Fixes: b2157399cc ("bpf: prevent out-of-bounds speculation")
Reported-by: syzbot+b0efb8e572d01bce1ae0@syzkaller.appspotmail.com
Reported-by: syzbot+6c15e9744f75f2364773@syzkaller.appspotmail.com
Reported-by: syzbot+d2f5524fb46fd3b312ee@syzkaller.appspotmail.com
Reported-by: syzbot+61d23c95395cc90dbc2b@syzkaller.appspotmail.com
Reported-by: syzbot+0d363c942452cca68c01@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-10 14:46:39 -08:00
Daniel Borkmann
7891a87efc bpf: arsh is not supported in 32 bit alu thus reject it
The following snippet was throwing an 'unknown opcode cc' warning
in BPF interpreter:

  0: (18) r0 = 0x0
  2: (7b) *(u64 *)(r10 -16) = r0
  3: (cc) (u32) r0 s>>= (u32) r0
  4: (95) exit

Although a number of JITs do support BPF_ALU | BPF_ARSH | BPF_{K,X}
generation, not all of them do and interpreter does neither. We can
leave existing ones and implement it later in bpf-next for the
remaining ones, but reject this properly in verifier for the time
being.

Fixes: 17a5267067 ("bpf: verifier (add verifier core)")
Reported-by: syzbot+93c4904c5c70348a6890@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-10 14:42:22 -08:00
Colin Ian King
4095034393 bpf: fix spelling mistake: "obusing" -> "abusing"
Trivial fix to spelling mistake in error message text.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-10 14:32:59 -08:00
Roman Gushchin
4f58424da3 cgroup: make cgroup.threads delegatable
Make cgroup.threads file delegatable.
The behavior of cgroup.threads should follow the behavior of cgroup.procs.

Signed-off-by: Roman Gushchin <guro@fb.com>
Discovered-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2018-01-10 09:42:32 -08:00
David S. Miller
661e4e33a9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2018-01-09

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Prevent out-of-bounds speculation in BPF maps by masking the
   index after bounds checks in order to fix spectre v1, and
   add an option BPF_JIT_ALWAYS_ON into Kconfig that allows for
   removing the BPF interpreter from the kernel in favor of
   JIT-only mode to make spectre v2 harder, from Alexei.

2) Remove false sharing of map refcount with max_entries which
   was used in spectre v1, from Daniel.

3) Add a missing NULL psock check in sockmap in order to fix
   a race, from John.

4) Fix test_align BPF selftest case since a recent change in
   verifier rejects the bit-wise arithmetic on pointers
   earlier but test_align update was missing, from Alexei.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10 11:17:21 -05:00
Quentin Monnet
430e68d10b bpf: export function to write into verifier log buffer
Rename the BPF verifier `verbose()` to `bpf_verifier_log_write()` and
export it, so that other components (in particular, drivers for BPF
offload) can reuse the user buffer log to dump error messages at
verification time.

Renaming `verbose()` was necessary in order to avoid a name so generic
to be exported to the global namespace. However to prevent too much pain
for backports, the calls to `verbose()` in the kernel BPF verifier were
not changed. Instead, use function aliasing to make `verbose` point to
`bpf_verifier_log_write`. Another solution could consist in making a
wrapper around `verbose()`, but since it is a variadic function, I don't
see a clean way without creating two identical wrappers, one for the
verifier and one to export.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-10 13:49:36 +01:00
Juri Lelli
07881166a8 sched/deadline: Make bandwidth enforcement scale-invariant
Apply frequency and CPU scale-invariance correction factor to bandwidth
enforcement (similar to what we already do to fair utilization tracking).

Each delta_exec gets scaled considering current frequency and maximum
CPU capacity; which means that the reservation runtime parameter (that
need to be specified profiling the task execution at max frequency on
biggest capacity core) gets thus scaled accordingly.

Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: alessio.balsini@arm.com
Cc: bristot@redhat.com
Cc: dietmar.eggemann@arm.com
Cc: joelaf@google.com
Cc: juri.lelli@redhat.com
Cc: mathieu.poirier@linaro.org
Cc: morten.rasmussen@arm.com
Cc: patrick.bellasi@arm.com
Cc: rjw@rjwysocki.net
Cc: rostedt@goodmis.org
Cc: tkjos@android.com
Cc: tommaso.cucinotta@santannapisa.it
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/20171204102325.5110-9-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10 12:53:35 +01:00
Juri Lelli
7e1a9208f6 sched/cpufreq: Move arch_scale_{freq,cpu}_capacity() outside of #ifdef CONFIG_SMP
Currently, frequency and cpu capacity scaling is only performed on
CONFIG_SMP systems (as CFS PELT signals are only present for such
systems). However, other scheduling classes want to do freq/cpu scaling,
and for !CONFIG_SMP configurations as well.

arch_scale_freq_capacity() is useful to implement frequency scaling even
on !CONFIG_SMP platforms, so we simply move it outside CONFIG_SMP
ifdeffery.

Even if arch_scale_cpu_capacity() is not useful on !CONFIG_SMP platforms,
we make a default implementation available for such configurations anyway
to simplify scheduler code doing CPU scale invariance.

Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: alessio.balsini@arm.com
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: dietmar.eggemann@arm.com
Cc: joelaf@google.com
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: mathieu.poirier@linaro.org
Cc: morten.rasmussen@arm.com
Cc: patrick.bellasi@arm.com
Cc: rjw@rjwysocki.net
Cc: tkjos@android.com
Cc: tommaso.cucinotta@santannapisa.it
Cc: vincent.guittot@linaro.org
Cc: viresh.kumar@linaro.org
Link: http://lkml.kernel.org/r/20171204102325.5110-8-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10 12:53:35 +01:00
Juri Lelli
7673c8a4c7 sched/cpufreq: Remove arch_scale_freq_capacity()'s 'sd' parameter
The 'sd' parameter is never used in arch_scale_freq_capacity() (and it's hard to
see where information coming from scheduling domains might help doing
frequency invariance scaling).

Remove it; also in anticipation of moving arch_scale_freq_capacity()
outside CONFIG_SMP.

Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: alessio.balsini@arm.com
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: dietmar.eggemann@arm.com
Cc: joelaf@google.com
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: mathieu.poirier@linaro.org
Cc: morten.rasmussen@arm.com
Cc: patrick.bellasi@arm.com
Cc: rjw@rjwysocki.net
Cc: rostedt@goodmis.org
Cc: tkjos@android.com
Cc: tommaso.cucinotta@santannapisa.it
Cc: vincent.guittot@linaro.org
Cc: viresh.kumar@linaro.org
Link: http://lkml.kernel.org/r/20171204102325.5110-7-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10 12:53:34 +01:00
Juri Lelli
0fa7d181f1 sched/cpufreq: Always consider all CPUs when deciding next freq
No assumption can be made upon the rate at which frequency updates get
triggered, as there are scheduling policies (like SCHED_DEADLINE) which
don't trigger them so frequently.

Remove such assumption from the code, by always considering
SCHED_DEADLINE utilization signal as not stale.

Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: alessio.balsini@arm.com
Cc: bristot@redhat.com
Cc: dietmar.eggemann@arm.com
Cc: joelaf@google.com
Cc: juri.lelli@redhat.com
Cc: mathieu.poirier@linaro.org
Cc: morten.rasmussen@arm.com
Cc: patrick.bellasi@arm.com
Cc: rjw@rjwysocki.net
Cc: rostedt@goodmis.org
Cc: tkjos@android.com
Cc: tommaso.cucinotta@santannapisa.it
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/20171204102325.5110-6-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10 12:53:34 +01:00
Juri Lelli
d18be45dbf sched/cpufreq: Split utilization signals
To be able to treat utilization signals of different scheduling classes
in different ways (e.g., CFS signal might be stale while DEADLINE signal
is never stale by design) we need to split sugov_cpu::util signal in two:
util_cfs and util_dl.

This patch does that by also changing sugov_get_util() parameter list.
After this change, aggregation of the different signals has to be performed
by sugov_get_util() users (so that they can decide what to do with the
different signals).

Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: alessio.balsini@arm.com
Cc: bristot@redhat.com
Cc: dietmar.eggemann@arm.com
Cc: joelaf@google.com
Cc: juri.lelli@redhat.com
Cc: mathieu.poirier@linaro.org
Cc: morten.rasmussen@arm.com
Cc: patrick.bellasi@arm.com
Cc: rjw@rjwysocki.net
Cc: rostedt@goodmis.org
Cc: tkjos@android.com
Cc: tommaso.cucinotta@santannapisa.it
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/20171204102325.5110-5-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10 12:53:34 +01:00
Juri Lelli
794a56ebd9 sched/cpufreq: Change the worker kthread to SCHED_DEADLINE
Worker kthread needs to be able to change frequency for all other
threads.

Make it special, just under STOP class.

Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Claudio Scordino <claudio@evidence.eu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: alessio.balsini@arm.com
Cc: bristot@redhat.com
Cc: dietmar.eggemann@arm.com
Cc: joelaf@google.com
Cc: juri.lelli@redhat.com
Cc: mathieu.poirier@linaro.org
Cc: morten.rasmussen@arm.com
Cc: patrick.bellasi@arm.com
Cc: rjw@rjwysocki.net
Cc: rostedt@goodmis.org
Cc: tkjos@android.com
Cc: tommaso.cucinotta@santannapisa.it
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/20171204102325.5110-4-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10 12:53:29 +01:00
Juri Lelli
e0367b1267 sched/deadline: Move CPU frequency selection triggering points
Since SCHED_DEADLINE doesn't track utilization signal (but reserves a
fraction of CPU bandwidth to tasks admitted to the system), there is no
point in evaluating frequency changes during each tick event.

Move frequency selection triggering points to where running_bw changes.

Co-authored-by: Claudio Scordino <claudio@evidence.eu.com>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: alessio.balsini@arm.com
Cc: bristot@redhat.com
Cc: dietmar.eggemann@arm.com
Cc: joelaf@google.com
Cc: juri.lelli@redhat.com
Cc: mathieu.poirier@linaro.org
Cc: morten.rasmussen@arm.com
Cc: patrick.bellasi@arm.com
Cc: rjw@rjwysocki.net
Cc: rostedt@goodmis.org
Cc: tkjos@android.com
Cc: tommaso.cucinotta@santannapisa.it
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/20171204102325.5110-3-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10 11:30:32 +01:00
Juri Lelli
d4edd662ac sched/cpufreq: Use the DEADLINE utilization signal
SCHED_DEADLINE tracks active utilization signal with a per dl_rq
variable named running_bw.

Make use of that to drive CPU frequency selection: add up FAIR and
DEADLINE contribution to get the required CPU capacity to handle both
requirements (while RT still selects max frequency).

Co-authored-by: Claudio Scordino <claudio@evidence.eu.com>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: alessio.balsini@arm.com
Cc: bristot@redhat.com
Cc: dietmar.eggemann@arm.com
Cc: joelaf@google.com
Cc: juri.lelli@redhat.com
Cc: mathieu.poirier@linaro.org
Cc: morten.rasmussen@arm.com
Cc: patrick.bellasi@arm.com
Cc: rjw@rjwysocki.net
Cc: rostedt@goodmis.org
Cc: tkjos@android.com
Cc: tommaso.cucinotta@santannapisa.it
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/20171204102325.5110-2-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10 11:30:32 +01:00
Juri Lelli
34be39305a sched/deadline: Implement "runtime overrun signal" support
This patch adds the possibility of getting the delivery of a SIGXCPU
signal whenever there is a runtime overrun. The request is done through
the sched_flags field within the sched_attr structure.

Forward port of https://lkml.org/lkml/2009/10/16/170

Tested-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Claudio Scordino <claudio@evidence.eu.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/1513077024-25461-1-git-send-email-claudio@evidence.eu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10 11:30:31 +01:00
Mel Gorman
7332dec055 sched/fair: Only immediately migrate tasks due to interrupts if prev and target CPUs share cache
If waking from an idle CPU due to an interrupt then it's possible that
the waker task will be pulled to wake on the current CPU. Unfortunately,
depending on the type of interrupt and IRQ configuration, there may not
be a strong relationship between the CPU an interrupt was delivered on
and the CPU a task was running on. For example, the interrupts could all
be delivered to CPUs on one particular node due to the machine topology
or IRQ affinity configuration. Another example is an interrupt for an IO
completion which can be delivered to any CPU where there is no guarantee
the data is either cache hot or even local.

This patch was motivated by the observation that an IO workload was
being pulled cross-node on a frequent basis when IO completed.  From a
wakeup latency perspective, it's still useful to know that an idle CPU is
immediately available for use but lets only consider an automatic migration
if the CPUs share cache to limit damage due to NUMA migrations. Migrations
may still occur if wake_affine_weight determines it's appropriate.

These are the throughput results for dbench running on ext4 comparing
4.15-rc3 and this patch on a 2-socket machine where interrupts due to IO
completions can happen on any CPU.

                          4.15.0-rc3             4.15.0-rc3
                             vanilla            lessmigrate
Hmean     1        854.64 (   0.00%)      865.01 (   1.21%)
Hmean     2       1229.60 (   0.00%)     1274.44 (   3.65%)
Hmean     4       1591.81 (   0.00%)     1628.08 (   2.28%)
Hmean     8       1845.04 (   0.00%)     1831.80 (  -0.72%)
Hmean     16      2038.61 (   0.00%)     2091.44 (   2.59%)
Hmean     32      2327.19 (   0.00%)     2430.29 (   4.43%)
Hmean     64      2570.61 (   0.00%)     2568.54 (  -0.08%)
Hmean     128     2481.89 (   0.00%)     2499.28 (   0.70%)
Stddev    1         14.31 (   0.00%)        5.35 (  62.65%)
Stddev    2         21.29 (   0.00%)       11.09 (  47.92%)
Stddev    4          7.22 (   0.00%)        6.80 (   5.92%)
Stddev    8         26.70 (   0.00%)        9.41 (  64.76%)
Stddev    16        22.40 (   0.00%)       20.01 (  10.70%)
Stddev    32        45.13 (   0.00%)       44.74 (   0.85%)
Stddev    64        93.10 (   0.00%)       93.18 (  -0.09%)
Stddev    128      184.28 (   0.00%)      177.85 (   3.49%)

Note the small increase in throughput for low thread counts but also
note that the standard deviation for each sample during the test run is
lower. The throughput figures for dbench can be misleading so the benchmark
is actually modified to time the latency of the processing of one load
file with many samples taken. The difference in latency is

                           4.15.0-rc3             4.15.0-rc3
                              vanilla            lessmigrate
Amean      1         21.71 (   0.00%)       21.47 (   1.08%)
Amean      2         30.89 (   0.00%)       29.58 (   4.26%)
Amean      4         47.54 (   0.00%)       46.61 (   1.97%)
Amean      8         82.71 (   0.00%)       82.81 (  -0.12%)
Amean      16       149.45 (   0.00%)      145.01 (   2.97%)
Amean      32       265.49 (   0.00%)      248.43 (   6.42%)
Amean      64       463.23 (   0.00%)      463.55 (  -0.07%)
Amean      128      933.97 (   0.00%)      935.50 (  -0.16%)
Stddev     1          1.58 (   0.00%)        1.54 (   2.26%)
Stddev     2          2.84 (   0.00%)        2.95 (  -4.15%)
Stddev     4          6.78 (   0.00%)        6.85 (  -0.99%)
Stddev     8         16.85 (   0.00%)       16.37 (   2.85%)
Stddev     16        41.59 (   0.00%)       41.04 (   1.32%)
Stddev     32       111.05 (   0.00%)      105.11 (   5.35%)
Stddev     64       285.94 (   0.00%)      288.01 (  -0.72%)
Stddev     128      803.39 (   0.00%)      809.73 (  -0.79%)

It's a small improvement which is not surprising given that migrations that
migrate to a different node as not that common. However, it is noticeable
in the CPU migration statistics which are reduced by 24%.

There was a query for v1 of this patch about NAS so here are the results
for C-class using MPI for parallelisation on the same machine

nas-mpi
                      4.15.0-rc3             4.15.0-rc3
                         vanilla                  noirq
Time cg.C       24.25 (   0.00%)       23.17 (   4.45%)
Time ep.C        8.22 (   0.00%)        8.29 (  -0.85%)
Time ft.C       22.67 (   0.00%)       20.34 (  10.28%)
Time is.C        1.42 (   0.00%)        1.47 (  -3.52%)
Time lu.C       55.62 (   0.00%)       54.81 (   1.46%)
Time mg.C        7.93 (   0.00%)        7.91 (   0.25%)

          4.15.0-rc3  4.15.0-rc3
             vanilla  noirq-v1r1
User         3799.96     3748.34
System        672.10      626.15
Elapsed        91.91       79.49

lu.C sees a small gain, ft.C a large gain and ep.C and is.C see small
regressions but in terms of absolute time, the difference is small and
likely within run-to-run variance. System CPU usage is slightly reduced.

schbench from Facebook was also requested. This is a bit of a mixed bag but
it's important to note that this workload should not be heavily impacted
by wakeups from interrupt context.

                                 4.15.0-rc3             4.15.0-rc3
                                    vanilla             noirq-v1r1
Lat 50.00th-qrtle-1        41.00 (   0.00%)       41.00 (   0.00%)
Lat 75.00th-qrtle-1        42.00 (   0.00%)       42.00 (   0.00%)
Lat 90.00th-qrtle-1        43.00 (   0.00%)       44.00 (  -2.33%)
Lat 95.00th-qrtle-1        44.00 (   0.00%)       46.00 (  -4.55%)
Lat 99.00th-qrtle-1        57.00 (   0.00%)       58.00 (  -1.75%)
Lat 99.50th-qrtle-1        59.00 (   0.00%)       59.00 (   0.00%)
Lat 99.90th-qrtle-1        67.00 (   0.00%)       78.00 ( -16.42%)
Lat 50.00th-qrtle-2        40.00 (   0.00%)       51.00 ( -27.50%)
Lat 75.00th-qrtle-2        45.00 (   0.00%)       56.00 ( -24.44%)
Lat 90.00th-qrtle-2        53.00 (   0.00%)       59.00 ( -11.32%)
Lat 95.00th-qrtle-2        57.00 (   0.00%)       61.00 (  -7.02%)
Lat 99.00th-qrtle-2        67.00 (   0.00%)       71.00 (  -5.97%)
Lat 99.50th-qrtle-2        69.00 (   0.00%)       74.00 (  -7.25%)
Lat 99.90th-qrtle-2        83.00 (   0.00%)       77.00 (   7.23%)
Lat 50.00th-qrtle-4        51.00 (   0.00%)       51.00 (   0.00%)
Lat 75.00th-qrtle-4        57.00 (   0.00%)       56.00 (   1.75%)
Lat 90.00th-qrtle-4        60.00 (   0.00%)       59.00 (   1.67%)
Lat 95.00th-qrtle-4        62.00 (   0.00%)       62.00 (   0.00%)
Lat 99.00th-qrtle-4        73.00 (   0.00%)       72.00 (   1.37%)
Lat 99.50th-qrtle-4        76.00 (   0.00%)       74.00 (   2.63%)
Lat 99.90th-qrtle-4        85.00 (   0.00%)       78.00 (   8.24%)
Lat 50.00th-qrtle-8        54.00 (   0.00%)       58.00 (  -7.41%)
Lat 75.00th-qrtle-8        59.00 (   0.00%)       62.00 (  -5.08%)
Lat 90.00th-qrtle-8        65.00 (   0.00%)       66.00 (  -1.54%)
Lat 95.00th-qrtle-8        67.00 (   0.00%)       70.00 (  -4.48%)
Lat 99.00th-qrtle-8        78.00 (   0.00%)       79.00 (  -1.28%)
Lat 99.50th-qrtle-8        81.00 (   0.00%)       80.00 (   1.23%)
Lat 99.90th-qrtle-8       116.00 (   0.00%)       83.00 (  28.45%)
Lat 50.00th-qrtle-16       65.00 (   0.00%)       64.00 (   1.54%)
Lat 75.00th-qrtle-16       77.00 (   0.00%)       71.00 (   7.79%)
Lat 90.00th-qrtle-16       83.00 (   0.00%)       82.00 (   1.20%)
Lat 95.00th-qrtle-16       87.00 (   0.00%)       87.00 (   0.00%)
Lat 99.00th-qrtle-16       95.00 (   0.00%)       96.00 (  -1.05%)
Lat 99.50th-qrtle-16       99.00 (   0.00%)      103.00 (  -4.04%)
Lat 99.90th-qrtle-16      104.00 (   0.00%)      122.00 ( -17.31%)
Lat 50.00th-qrtle-32       71.00 (   0.00%)       73.00 (  -2.82%)
Lat 75.00th-qrtle-32       91.00 (   0.00%)       92.00 (  -1.10%)
Lat 90.00th-qrtle-32      108.00 (   0.00%)      107.00 (   0.93%)
Lat 95.00th-qrtle-32      118.00 (   0.00%)      115.00 (   2.54%)
Lat 99.00th-qrtle-32      134.00 (   0.00%)      129.00 (   3.73%)
Lat 99.50th-qrtle-32      138.00 (   0.00%)      133.00 (   3.62%)
Lat 99.90th-qrtle-32      149.00 (   0.00%)      146.00 (   2.01%)
Lat 50.00th-qrtle-39       83.00 (   0.00%)       81.00 (   2.41%)
Lat 75.00th-qrtle-39      105.00 (   0.00%)      102.00 (   2.86%)
Lat 90.00th-qrtle-39      120.00 (   0.00%)      119.00 (   0.83%)
Lat 95.00th-qrtle-39      129.00 (   0.00%)      128.00 (   0.78%)
Lat 99.00th-qrtle-39      153.00 (   0.00%)      149.00 (   2.61%)
Lat 99.50th-qrtle-39      166.00 (   0.00%)      156.00 (   6.02%)
Lat 99.90th-qrtle-39    12304.00 (   0.00%)    12848.00 (  -4.42%)

When heavily loaded (e.g. 99.50th-qrtle-39 indicates 39 threads), there
are small gains in many cases. Otherwise it depends on the quartile used
where it can be bad -- e.g. 75.00th-qrtle-2. However, even these results
are probably a co-incidence. For this workload, much depends on what node
the threads get placed on and their relative locality and not wakeups from
interrupt context. A larger component on how it behaves would be automatic
NUMA balancing where a fault incurred to measure locality would be a much
larger contributer to latency than the wakeup path.

This is the results from an almost identical machine that happened to run
the same test.  They only differ in terms of storage which is irrelevant
for this test.

                                 4.15.0-rc3             4.15.0-rc3
                                    vanilla             noirq-v1r1
Lat 50.00th-qrtle-1        41.00 (   0.00%)       41.00 (   0.00%)
Lat 75.00th-qrtle-1        42.00 (   0.00%)       42.00 (   0.00%)
Lat 90.00th-qrtle-1        44.00 (   0.00%)       43.00 (   2.27%)
Lat 95.00th-qrtle-1        53.00 (   0.00%)       45.00 (  15.09%)
Lat 99.00th-qrtle-1        59.00 (   0.00%)       58.00 (   1.69%)
Lat 99.50th-qrtle-1        60.00 (   0.00%)       59.00 (   1.67%)
Lat 99.90th-qrtle-1        86.00 (   0.00%)       61.00 (  29.07%)
Lat 50.00th-qrtle-2        52.00 (   0.00%)       41.00 (  21.15%)
Lat 75.00th-qrtle-2        57.00 (   0.00%)       46.00 (  19.30%)
Lat 90.00th-qrtle-2        60.00 (   0.00%)       53.00 (  11.67%)
Lat 95.00th-qrtle-2        62.00 (   0.00%)       57.00 (   8.06%)
Lat 99.00th-qrtle-2        73.00 (   0.00%)       68.00 (   6.85%)
Lat 99.50th-qrtle-2        74.00 (   0.00%)       71.00 (   4.05%)
Lat 99.90th-qrtle-2        90.00 (   0.00%)       75.00 (  16.67%)
Lat 50.00th-qrtle-4        57.00 (   0.00%)       52.00 (   8.77%)
Lat 75.00th-qrtle-4        60.00 (   0.00%)       58.00 (   3.33%)
Lat 90.00th-qrtle-4        62.00 (   0.00%)       62.00 (   0.00%)
Lat 95.00th-qrtle-4        65.00 (   0.00%)       65.00 (   0.00%)
Lat 99.00th-qrtle-4        76.00 (   0.00%)       75.00 (   1.32%)
Lat 99.50th-qrtle-4        77.00 (   0.00%)       77.00 (   0.00%)
Lat 99.90th-qrtle-4        87.00 (   0.00%)       81.00 (   6.90%)
Lat 50.00th-qrtle-8        59.00 (   0.00%)       57.00 (   3.39%)
Lat 75.00th-qrtle-8        63.00 (   0.00%)       62.00 (   1.59%)
Lat 90.00th-qrtle-8        66.00 (   0.00%)       67.00 (  -1.52%)
Lat 95.00th-qrtle-8        68.00 (   0.00%)       70.00 (  -2.94%)
Lat 99.00th-qrtle-8        79.00 (   0.00%)       80.00 (  -1.27%)
Lat 99.50th-qrtle-8        80.00 (   0.00%)       84.00 (  -5.00%)
Lat 99.90th-qrtle-8        84.00 (   0.00%)       90.00 (  -7.14%)
Lat 50.00th-qrtle-16       65.00 (   0.00%)       65.00 (   0.00%)
Lat 75.00th-qrtle-16       77.00 (   0.00%)       75.00 (   2.60%)
Lat 90.00th-qrtle-16       84.00 (   0.00%)       83.00 (   1.19%)
Lat 95.00th-qrtle-16       88.00 (   0.00%)       87.00 (   1.14%)
Lat 99.00th-qrtle-16       97.00 (   0.00%)       96.00 (   1.03%)
Lat 99.50th-qrtle-16      100.00 (   0.00%)      104.00 (  -4.00%)
Lat 99.90th-qrtle-16      110.00 (   0.00%)      126.00 ( -14.55%)
Lat 50.00th-qrtle-32       70.00 (   0.00%)       71.00 (  -1.43%)
Lat 75.00th-qrtle-32       92.00 (   0.00%)       94.00 (  -2.17%)
Lat 90.00th-qrtle-32      110.00 (   0.00%)      110.00 (   0.00%)
Lat 95.00th-qrtle-32      121.00 (   0.00%)      118.00 (   2.48%)
Lat 99.00th-qrtle-32      135.00 (   0.00%)      137.00 (  -1.48%)
Lat 99.50th-qrtle-32      140.00 (   0.00%)      146.00 (  -4.29%)
Lat 99.90th-qrtle-32      150.00 (   0.00%)      160.00 (  -6.67%)
Lat 50.00th-qrtle-39       80.00 (   0.00%)       71.00 (  11.25%)
Lat 75.00th-qrtle-39      102.00 (   0.00%)       91.00 (  10.78%)
Lat 90.00th-qrtle-39      118.00 (   0.00%)      108.00 (   8.47%)
Lat 95.00th-qrtle-39      128.00 (   0.00%)      117.00 (   8.59%)
Lat 99.00th-qrtle-39      149.00 (   0.00%)      133.00 (  10.74%)
Lat 99.50th-qrtle-39      160.00 (   0.00%)      139.00 (  13.12%)
Lat 99.90th-qrtle-39    13808.00 (   0.00%)     4920.00 (  64.37%)

Despite being nearly identical, it showed a variety of major gains so
I'm not convinced that heavy emphasis should be placed on this particular
workload in terms of evaluating this particular patch. Further evidence of
this is the fact that testing on a UMA machine showed small gains/losses
even though the patch should be a no-op on UMA.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171219085947.13136-2-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10 11:30:31 +01:00
Joel Fernandes
9783be2c0e sched/fair: Correct obsolete comment about cpufreq_update_util()
Since the remote cpufreq callback work, the cpufreq_update_util() call can happen
from remote CPUs. The comment about local CPUs is thus obsolete. Update it
accordingly.

Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Android Kernel <kernel-team@android.com>
Cc: Atish Patra <atish.patra@oracle.com>
Cc: Chris Redpath <Chris.Redpath@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: EAS Dev <eas-dev@lists.linaro.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Ramussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Rohit Jain <rohit.k.jain@oracle.com>
Cc: Saravana Kannan <skannan@quicinc.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vikram Mulukutla <markivx@codeaurora.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20171215153944.220146-2-joelaf@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10 11:30:30 +01:00
Joel Fernandes
18cec7e0dd sched/fair: Remove impossible condition from find_idlest_group_cpu()
find_idlest_group_cpu() goes through CPUs of a group previous selected by
find_idlest_group(). find_idlest_group() returns NULL if the local group is the
selected one and doesn't execute find_idlest_group_cpu if the group to which
'cpu' belongs to is chosen. So we're always guaranteed to call
find_idlest_group_cpu() with a group to which 'cpu' is non-local.

This makes one of the conditions in find_idlest_group_cpu() an impossible one,
which we can get rid off.

Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Brendan Jackman <brendan.jackman@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Android Kernel <kernel-team@android.com>
Cc: Atish Patra <atish.patra@oracle.com>
Cc: Chris Redpath <Chris.Redpath@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: EAS Dev <eas-dev@lists.linaro.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Ramussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Rohit Jain <rohit.k.jain@oracle.com>
Cc: Saravana Kannan <skannan@quicinc.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vikram Mulukutla <markivx@codeaurora.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20171215153944.220146-3-joelaf@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10 11:30:30 +01:00
Viresh Kumar
5083452f8c sched/cpufreq: Don't pass flags to sugov_set_iowait_boost()
We are already passing sg_cpu as argument to sugov_set_iowait_boost()
helper and the same can be used to retrieve the flags value. Get rid of
the redundant argument.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael Wysocki <rjw@rjwysocki.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: dietmar.eggemann@arm.com
Cc: joelaf@google.com
Cc: juri.lelli@redhat.com
Cc: morten.rasmussen@arm.com
Cc: tkjos@android.com
Link: http://lkml.kernel.org/r/4ec5562b1a87e146ebab11fb5dde1ca9c763a7fb.1513158452.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10 11:30:29 +01:00
Viresh Kumar
6257e70478 sched/cpufreq: Initialize sg_cpu->flags to 0
Initializing sg_cpu->flags to SCHED_CPUFREQ_RT has no obvious benefit.
The flags field wouldn't be used until the utilization update handler is
called for the first time, and once that is called we will overwrite
flags anyway.

Initialize it to 0.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael Wysocki <rjw@rjwysocki.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: dietmar.eggemann@arm.com
Cc: joelaf@google.com
Cc: morten.rasmussen@arm.com
Cc: tkjos@android.com
Link: http://lkml.kernel.org/r/763feda6424ced8486b25a0c52979634e6104478.1513158452.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10 11:30:29 +01:00
Joel Fernandes
f453ae2200 sched/fair: Consider RT/IRQ pressure in capacity_spare_wake()
capacity_spare_wake() in the slow path influences choice of idlest groups,
as we search for groups with maximum spare capacity. In scenarios where
RT pressure is high, a sub optimal group can be chosen and hurt
performance of the task being woken up.

Fix this by using capacity_of() instead of capacity_orig_of() in capacity_spare_wake().

Tests results from improvements with this change are below. More tests
were also done by myself and Matt Fleming to ensure no degradation in
different benchmarks.

1) Rohit ran barrier.c test (details below) with following improvements:
------------------------------------------------------------------------
This was Rohit's original use case for a patch he posted at [1] however
from his recent tests he showed my patch can replace his slow path
changes [1] and there's no need to selectively scan/skip CPUs in
find_idlest_group_cpu in the slow path to get the improvement he sees.

barrier.c (open_mp code) as a micro-benchmark. It does a number of
iterations and barrier sync at the end of each for loop.

Here barrier,c is running in along with ping on CPU 0 and 1 as:
'ping -l 10000 -q -s 10 -f hostX'

barrier.c can be found at:
http://www.spinics.net/lists/kernel/msg2506955.html

Following are the results for the iterations per second with this
micro-benchmark (higher is better), on a 44 core, 2 socket 88 Threads
Intel x86 machine:
+--------+------------------+---------------------------+
|Threads | Without patch    | With patch                |
|        |                  |                           |
+--------+--------+---------+-----------------+---------+
|        | Mean   | Std Dev | Mean            | Std Dev |
+--------+--------+---------+-----------------+---------+
|1       | 539.36 | 60.16   | 572.54 (+6.15%) | 40.95   |
|2       | 481.01 | 19.32   | 530.64 (+10.32%)| 56.16   |
|4       | 474.78 | 22.28   | 479.46 (+0.99%) | 18.89   |
|8       | 450.06 | 24.91   | 447.82 (-0.50%) | 12.36   |
|16      | 436.99 | 22.57   | 441.88 (+1.12%) | 7.39    |
|32      | 388.28 | 55.59   | 429.4  (+10.59%)| 31.14   |
|64      | 314.62 | 6.33    | 311.81 (-0.89%) | 11.99   |
+--------+--------+---------+-----------------+---------+

2) ping+hackbench test on bare-metal sever (by Rohit)
-----------------------------------------------------
Here hackbench is running in threaded mode along
with, running ping on CPU 0 and 1 as:
'ping -l 10000 -q -s 10 -f hostX'

This test is running on 2 socket, 20 core and 40 threads Intel x86
machine:
Number of loops is 10000 and runtime is in seconds (Lower is better).

+--------------+-----------------+--------------------------+
|Task Groups   | Without patch   |  With patch              |
|              +-------+---------+----------------+---------+
|(Groups of 40)| Mean  | Std Dev |  Mean          | Std Dev |
+--------------+-------+---------+----------------+---------+
|1             | 0.851 | 0.007   |  0.828 (+2.77%)| 0.032   |
|2             | 1.083 | 0.203   |  1.087 (-0.37%)| 0.246   |
|4             | 1.601 | 0.051   |  1.611 (-0.62%)| 0.055   |
|8             | 2.837 | 0.060   |  2.827 (+0.35%)| 0.031   |
|16            | 5.139 | 0.133   |  5.107 (+0.63%)| 0.085   |
|25            | 7.569 | 0.142   |  7.503 (+0.88%)| 0.143   |
+--------------+-------+---------+----------------+---------+

[1] https://patchwork.kernel.org/patch/9991635/

Matt Fleming also ran several different hackbench tests and cyclic test
to santiy-check that the patch doesn't harm other usecases.

Tested-by: Matt Fleming <matt@codeblueprint.co.uk>
Tested-by: Rohit Jain <rohit.k.jain@oracle.com>
Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Atish Patra <atish.patra@oracle.com>
Cc: Brendan Jackman <brendan.jackman@arm.com>
Cc: Chris Redpath <Chris.Redpath@arm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Ramussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Saravana Kannan <skannan@quicinc.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vikram Mulukutla <markivx@codeaurora.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20171214212158.188190-1-joelaf@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10 11:30:28 +01:00
Patrick Bellasi
f01415fdbf sched/fair: Use 'unsigned long' for utilization, consistently
Utilization and capacity are tracked as 'unsigned long', however some
functions using them return an 'int' which is ultimately assigned back to
'unsigned long' variables.

Since there is not scope on using a different and signed type,
consolidate the signature of functions returning utilization to always
use the native type.

This change improves code consistency, and it also benefits
code paths where utilizations should be clamped by avoiding
further type conversions or ugly type casts.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chris Redpath <chris.redpath@arm.com>
Reviewed-by: Brendan Jackman <brendan.jackman@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@android.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20171205171018.9203-2-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10 11:30:28 +01:00
rodrigosiqueira
31cb1bc0dc sched/core: Rework and clarify prepare_lock_switch()
The prepare_lock_switch() function has an unused parameter, and also the
function name was not descriptive. To improve readability and remove
the extra parameter, do the following changes:

* Move prepare_lock_switch() from kernel/sched/sched.h to
  kernel/sched/core.c, rename it to prepare_task(), and remove the
  unused parameter.

* Split the smp_store_release() out from finish_lock_switch() to a
  function named finish_task.

* Comments ajdustments.

Signed-off-by: Rodrigo Siqueira <rodrigosiqueiramelo@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171215140603.gxe5i2y6fg5ojfpp@smtp.gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10 11:30:27 +01:00
Ingo Molnar
cb1f34ddcc Merge branch 'sched/urgent' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10 10:52:40 +01:00
Mathieu Desnoyers
541676078b membarrier: Disable preemption when calling smp_call_function_many()
smp_call_function_many() requires disabling preemption around the call.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: <stable@vger.kernel.org> # v4.14+
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Andrew Hunter <ahh@google.com>
Cc: Avi Kivity <avi@scylladb.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maged Michael <maged.michael@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171215192310.25293-1-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-10 08:50:31 +01:00
Bart Van Assche
4bf236a333 PM / sleep: Make lock/unlock_system_sleep() available to kernel modules
Since pm_mutex is not exported using lock/unlock_system_sleep() from
inside a kernel module causes a "pm_mutex undefined" linker error.
Hence move lock/unlock_system_sleep() into kernel/power/main.c and
export these.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-01-10 01:07:46 +01:00
Alexei Starovoitov
290af86629 bpf: introduce BPF_JIT_ALWAYS_ON config
The BPF interpreter has been used as part of the spectre 2 attack CVE-2017-5715.

A quote from goolge project zero blog:
"At this point, it would normally be necessary to locate gadgets in
the host kernel code that can be used to actually leak data by reading
from an attacker-controlled location, shifting and masking the result
appropriately and then using the result of that as offset to an
attacker-controlled address for a load. But piecing gadgets together
and figuring out which ones work in a speculation context seems annoying.
So instead, we decided to use the eBPF interpreter, which is built into
the host kernel - while there is no legitimate way to invoke it from inside
a VM, the presence of the code in the host kernel's text section is sufficient
to make it usable for the attack, just like with ordinary ROP gadgets."

To make attacker job harder introduce BPF_JIT_ALWAYS_ON config
option that removes interpreter from the kernel in favor of JIT-only mode.
So far eBPF JIT is supported by:
x64, arm64, arm32, sparc64, s390, powerpc64, mips64

The start of JITed program is randomized and code page is marked as read-only.
In addition "constant blinding" can be turned on with net.core.bpf_jit_harden

v2->v3:
- move __bpf_prog_ret0 under ifdef (Daniel)

v1->v2:
- fix init order, test_bpf and cBPF (Daniel's feedback)
- fix offloaded bpf (Jakub's feedback)
- add 'return 0' dummy in case something can invoke prog->bpf_func
- retarget bpf tree. For bpf-next the patch would need one extra hunk.
  It will be sent when the trees are merged back to net-next

Considered doing:
  int bpf_jit_enable __read_mostly = BPF_EBPF_JIT_DEFAULT;
but it seems better to land the patch as-is and in bpf-next remove
bpf_jit_enable global variable from all JITs, consolidate in one place
and remove this jit_init() function.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-09 22:25:26 +01:00
David S. Miller
a0ce093180 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-01-09 10:37:00 -05:00
Sergey Senozhatsky
04b8eb7a4c symbol lookup: introduce dereference_symbol_descriptor()
dereference_symbol_descriptor() invokes appropriate ARCH specific
function descriptor dereference callbacks:
- dereference_kernel_function_descriptor() if the pointer is a
  kernel symbol;

- dereference_module_function_descriptor() if the pointer is a
  module symbol.

This is the last step needed to make '%pS/%ps' smart enough to
handle function descriptor dereference on affected ARCHs and
to retire '%pF/%pf'.

To refresh it:
  Some architectures (ia64, ppc64, parisc64) use an indirect pointer
  for C function pointers - the function pointer points to a function
  descriptor and we need to dereference it to get the actual function
  pointer.

  Function descriptors live in .opd elf section and all affected
  ARCHs (ia64, ppc64, parisc64) handle it properly for kernel and
  modules. So we, technically, can decide if the dereference is
  needed by simply looking at the pointer: if it belongs to .opd
  section then we need to dereference it.

  The kernel and modules have their own .opd sections, obviously,
  that's why we need to split dereference_function_descriptor()
  and use separate kernel and module dereference arch callbacks.

Link: http://lkml.kernel.org/r/20171206043649.GB15885@jagdpanzerIV
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: James Bottomley <jejb@parisc-linux.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-ia64@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Tested-by: Tony Luck <tony.luck@intel.com> #ia64
Tested-by: Santosh Sivaraj <santosh@fossix.org> #powerpc
Tested-by: Helge Deller <deller@gmx.de> #parisc64
Signed-off-by: Petr Mladek <pmladek@suse.com>
2018-01-09 10:45:38 +01:00
Sergey Senozhatsky
b865ea6430 sections: split dereference_function_descriptor()
There are two format specifiers to print out a pointer in symbolic
format: '%pS/%ps' and '%pF/%pf'. On most architectures, the two
mean exactly the same thing, but some architectures (ia64, ppc64,
parisc64) use an indirect pointer for C function pointers, where
the function pointer points to a function descriptor (which in
turn contains the actual pointer to the code). The '%pF/%pf, when
used appropriately, automatically does the appropriate function
descriptor dereference on such architectures.

The "when used appropriately" part is tricky. Basically this is
a subtle ABI detail, specific to some platforms, that made it to
the API level and people can be unaware of it and miss the whole
"we need to dereference the function" business out. [1] proves
that point (note that it fixes only '%pF' and '%pS', there might
be '%pf' and '%ps' cases as well).

It appears that we can handle everything within the affected
arches and make '%pS/%ps' smart enough to retire '%pF/%pf'.
Function descriptors live in .opd elf section and all affected
arches (ia64, ppc64, parisc64) handle it properly for kernel
and modules. So we, technically, can decide if the dereference
is needed by simply looking at the pointer: if it belongs to
.opd section then we need to dereference it.

The kernel and modules have their own .opd sections, obviously,
that's why we need to split dereference_function_descriptor()
and use separate kernel and module dereference arch callbacks.

This patch does the first step, it
a) adds dereference_kernel_function_descriptor() function.
b) adds a weak alias to dereference_module_function_descriptor()
   function.

So, for the time being, we will have:
1) dereference_function_descriptor()
   A generic function, that simply dereferences the pointer. There is
   bunch of places that call it: kgdbts, init/main.c, extable, etc.

2) dereference_kernel_function_descriptor()
   A function to call on kernel symbols that does kernel .opd section
   address range test.

3) dereference_module_function_descriptor()
   A function to call on modules' symbols that does modules' .opd
   section address range test.

[1] https://marc.info/?l=linux-kernel&m=150472969730573

Link: http://lkml.kernel.org/r/20171109234830.5067-2-sergey.senozhatsky@gmail.com
To: Fenghua Yu <fenghua.yu@intel.com>
To: Benjamin Herrenschmidt <benh@kernel.crashing.org>
To: Paul Mackerras <paulus@samba.org>
To: Michael Ellerman <mpe@ellerman.id.au>
To: James Bottomley <jejb@parisc-linux.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-ia64@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Tested-by: Tony Luck <tony.luck@intel.com> #ia64
Tested-by: Santosh Sivaraj <santosh@fossix.org> #powerpc
Tested-by: Helge Deller <deller@gmx.de> #parisc64
Signed-off-by: Petr Mladek <pmladek@suse.com>
2018-01-09 10:45:37 +01:00
Alexei Starovoitov
b2157399cc bpf: prevent out-of-bounds speculation
Under speculation, CPUs may mis-predict branches in bounds checks. Thus,
memory accesses under a bounds check may be speculated even if the
bounds check fails, providing a primitive for building a side channel.

To avoid leaking kernel data round up array-based maps and mask the index
after bounds check, so speculated load with out of bounds index will load
either valid value from the array or zero from the padded area.

Unconditionally mask index for all array types even when max_entries
are not rounded to power of 2 for root user.
When map is created by unpriv user generate a sequence of bpf insns
that includes AND operation to make sure that JITed code includes
the same 'index & index_mask' operation.

If prog_array map is created by unpriv user replace
  bpf_tail_call(ctx, map, index);
with
  if (index >= max_entries) {
    index &= map->index_mask;
    bpf_tail_call(ctx, map, index);
  }
(along with roundup to power 2) to prevent out-of-bounds speculation.
There is secondary redundant 'if (index >= max_entries)' in the interpreter
and in all JITs, but they can be optimized later if necessary.

Other array-like maps (cpumap, devmap, sockmap, perf_event_array, cgroup_array)
cannot be used by unpriv, so no changes there.

That fixes bpf side of "Variant 1: bounds check bypass (CVE-2017-5753)" on
all architectures with and without JIT.

v2->v3:
Daniel noticed that attack potentially can be crafted via syscall commands
without loading the program, so add masking to those paths as well.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-09 00:53:49 +01:00
Christoph Hellwig
e697c5b90e memremap: merge find_dev_pagemap into get_dev_pagemap
There is only one caller of the trivial function find_dev_pagemap left,
so just merge it into the caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-01-08 11:46:23 -08:00
Christoph Hellwig
e8d5134833 memremap: change devm_memremap_pages interface to use struct dev_pagemap
This new interface is similar to how struct device (and many others)
work. The caller initializes a 'struct dev_pagemap' as required
and calls 'devm_memremap_pages'. This allows the pagemap structure to
be embedded in another structure and thus container_of can be used. In
this way application specific members can be stored in a containing
struct.

This will be used by the P2P infrastructure and HMM could probably
be cleaned up to use it as well (instead of having it's own, similar
'hmm_devmem_pages_create' function).

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-01-08 11:46:23 -08:00
Logan Gunthorpe
e7744aa25c memremap: drop private struct page_map
'struct page_map' is a private structure of 'struct dev_pagemap' but the
latter replicates all the same fields as the former so there isn't much
value in it. Thus drop it in favour of a completely public struct.

This is a clean up in preperation for a more generally useful
'devm_memeremap_pages' interface.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-01-08 11:46:23 -08:00
Christoph Hellwig
7003e3b1f6 memremap: simplify duplicate region handling in devm_memremap_pages
__radix_tree_insert already checks for duplicates and returns -EEXIST in
that case, so remove the duplicate (and racy) duplicates check.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-01-08 11:46:23 -08:00
Christoph Hellwig
0628b8c650 memremap: remove to_vmem_altmap
All callers are gone now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-01-08 11:46:23 -08:00
Christoph Hellwig
832d7aa051 mm: optimize dev_pagemap reference counting around get_dev_pagemap
Change the calling convention so that get_dev_pagemap always consumes the
previous reference instead of doing this using an explicit earlier call to
put_dev_pagemap in the callers.

The callers will still need to put the final reference after finishing the
loop over the pages.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-01-08 11:46:23 -08:00
Christoph Hellwig
0822acb86c mm: move get_dev_pagemap out of line
This is a pretty big function, which should be out of line in general,
and a no-op stub if CONFIG_ZONE_DEVICЕ is not set.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-01-08 11:46:23 -08:00
Christoph Hellwig
a99583e780 mm: pass the vmem_altmap to memmap_init_zone
Pass the vmem_altmap two levels down instead of needing a lookup.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-01-08 11:46:23 -08:00
Christoph Hellwig
da024512a1 mm: pass the vmem_altmap to arch_remove_memory and __remove_pages
We can just pass this on instead of having to do a radix tree lookup
without proper locking 2 levels into the callchain.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-01-08 11:46:23 -08:00
Christoph Hellwig
24e6d5a59a mm: pass the vmem_altmap to arch_add_memory and __add_pages
We can just pass this on instead of having to do a radix tree lookup
without proper locking 2 levels into the callchain.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-01-08 11:46:23 -08:00
Linus Torvalds
29f7e49941 Merge branch 'for-4.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
 "This contains fixes for the following two non-trivial issues:

   - The task iterator got broken while adding thread mode support for
     v4.14. It was less visible because it only triggers when both
     cgroup1 and cgroup2 hierarchies are in use. The recent versions of
     systemd uses cgroup2 for process management even when cgroup1 is
     used for resource control exposing this issue.

   - cpuset CPU hotplug path could deadlock when racing against exits.

  There also are two patches to replace unlimited strcpy() usages with
  strlcpy()"

* 'for-4.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: fix css_task_iter crash on CSS_TASK_ITER_PROC
  cgroup: Fix deadlock in cpu hotplug path
  cgroup: use strlcpy() instead of strscpy() to avoid spurious warning
  cgroup: avoid copying strings longer than the buffers
2018-01-08 11:13:08 -08:00
Bartosz Golaszewski
6baf9e67c9 irq/work: Improve the flag definitions
IRQ_WORK_FLAGS is defined simply to 3UL. This is confusing as it
says nothing about its purpose. Define IRQ_WORK_FLAGS as a bitwise
OR of IRQ_WORK_PENDING and IRQ_WORK_BUSY and change its name to
IRQ_WORK_CLAIMED.

While we're at it: use the BIT() macro for all flags.

Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1515125996-21564-1-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-08 19:43:15 +01:00
Alexei Starovoitov
5896351ea9 bpf: fix verifier GPF in kmalloc failure path
syzbot reported the following panic in the verifier triggered
by kmalloc error injection:

kasan: GPF could be caused by NULL-ptr deref or user memory access
RIP: 0010:copy_func_state kernel/bpf/verifier.c:403 [inline]
RIP: 0010:copy_verifier_state+0x364/0x590 kernel/bpf/verifier.c:431
Call Trace:
 pop_stack+0x8c/0x270 kernel/bpf/verifier.c:449
 push_stack kernel/bpf/verifier.c:491 [inline]
 check_cond_jmp_op kernel/bpf/verifier.c:3598 [inline]
 do_check+0x4b60/0xa050 kernel/bpf/verifier.c:4731
 bpf_check+0x3296/0x58c0 kernel/bpf/verifier.c:5489
 bpf_prog_load+0xa2a/0x1b00 kernel/bpf/syscall.c:1198
 SYSC_bpf kernel/bpf/syscall.c:1807 [inline]
 SyS_bpf+0x1044/0x4420 kernel/bpf/syscall.c:1769

when copy_verifier_state() aborts in the middle due to kmalloc failure
some of the frames could have been partially copied while
current free_verifier_state() loop
for (i = 0; i <= state->curframe; i++)
assumed that all frames are non-null.
Simply fix it by adding 'if (!state)' to free_func_state().
Also avoid stressing copy frame logic more if kzalloc fails
in push_stack() free env->cur_state right away.

Fixes: f4d7e40a5b ("bpf: introduce function calls (verification)")
Reported-by: syzbot+32ac5a3e473f2e01cfc7@syzkaller.appspotmail.com
Reported-by: syzbot+fa99e24f3c29d269a7d5@syzkaller.appspotmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-08 18:19:03 +01:00
Ingo Molnar
527187d285 locking/lockdep: Remove cross-release leftovers
There's two cross-release leftover facilities:

 - the crossrelease_hist_*() irq-tracing callbacks (NOPs currently)
 - the complete_release_commit() callback (NOP as well)

Remove them.

Cc: David Sterba <dsterba@suse.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-08 17:30:45 +01:00
Jiri Olsa
99e818cc88 perf: Return empty callchain instead of NULL
It simplifies the code a bit, because we dump the callchain
Link: http://lkml.kernel.org/n/tip-uqp7qd6aif47g39glnbu95yl@git.kernel.org
even if it's empty. With 'empty' callchain we can remove
all the NULL-checking code paths.

Original-patch-from: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/20180107160356.28203-7-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-01-08 12:35:01 -03:00
Jiri Olsa
8cf7e0e224 perf: Make perf_callchain function static
And move it to core.c, because there's no caller of this function other
than the one in core.c

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180107160356.28203-6-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-01-08 12:33:01 -03:00
Jiri Olsa
313ccb9615 perf: Allocate context task_ctx_data for child event
Currently we use perf_event_context::task_ctx_data to save and restore
the LBR status when the task is scheduled out and in.

We don't allocate it for child contexts, which results in shorter task's
LBR stack, because we don't save the history from previous run and start
over every time we schedule the task in.

I made a test to generate samples with LBR call stack and got higher
numbers on bigger chain depths:

                            before:     after:
  LBR call chain: nr: 1       60561     498127
  LBR call chain: nr: 2           0          0
  LBR call chain: nr: 3      107030       2172
  LBR call chain: nr: 4      466685      62758
  LBR call chain: nr: 5     2307319     878046
  LBR call chain: nr: 6       48713     495218
  LBR call chain: nr: 7        1040       4551
  LBR call chain: nr: 8         481        172
  LBR call chain: nr: 9         878        120
  LBR call chain: nr: 10       2377       6698
  LBR call chain: nr: 11      28830     151487
  LBR call chain: nr: 12      29347     339867
  LBR call chain: nr: 13          4         22
  LBR call chain: nr: 14          3         53

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Fixes: 4af57ef28c ("perf: Add pmu specific data for perf task context")
Link: http://lkml.kernel.org/r/20180107160356.28203-4-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-01-08 12:26:46 -03:00
Tejun Heo
40c17f75df workqueue: allow WQ_MEM_RECLAIM on early init workqueues
Workqueues can be created early during boot before workqueue subsystem
in fully online - work items are queued waiting for later full
initialization.  However, early init wasn't supported for
WQ_MEM_RECLAIM workqueues causing unnecessary annoyances for a subset
of users.  Expand early init support to include WQ_MEM_RECLAIM
workqueues.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
2018-01-08 05:38:37 -08:00
Tejun Heo
983c751532 workqueue: separate out init_rescuer()
Separate out init_rescuer() from __alloc_workqueue_key() to prepare
for early init support for WQ_MEM_RECLAIM.  This patch doesn't
introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
2018-01-08 05:38:32 -08:00
Linus Torvalds
75d4276e83 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:

 - untangle sys_close() abuses in xt_bpf

 - deal with register_shrinker() failures in sget()

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fix "netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'"
  sget(): handle failures of register_shrinker()
  mm,vmscan: Make unregister_shrinker() no-op if register_shrinker() failed.
2018-01-06 17:13:21 -08:00
John Fastabend
5731a879d0 bpf: sockmap missing NULL psock check
Add psock NULL check to handle a racing sock event that can get the
sk_callback_lock before this case but after xchg happens causing the
refcnt to hit zero and sock user data (psock) to be null and queued
for garbage collection.

Also add a comment in the code because this is a bit subtle and
not obvious in my opinion.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-07 00:01:46 +01:00
Yonghong Song
16f07c551e bpf: implement syscall command BPF_MAP_GET_NEXT_KEY for stacktrace map
Currently, bpf syscall command BPF_MAP_GET_NEXT_KEY is not
supported for stacktrace map. However, there are use cases where
user space wants to enumerate all stacktrace map entries where
BPF_MAP_GET_NEXT_KEY command will be really helpful.
In addition, if user space wants to delete all map entries
in order to save memory and does not want to close the
map file descriptor, BPF_MAP_GET_NEXT_KEY may help improve
performance if map entries are sparsely populated.

The implementation has similar behavior for
BPF_MAP_GET_NEXT_KEY implementation in hashtab. If user provides
a NULL key pointer or an invalid key, the first key is returned.
Otherwise, the first valid key after the input parameter "key"
is returned, or -ENOENT if no valid key can be found.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-06 23:52:22 +01:00
Ming Lei
263663cd3c block: convert to bio_first_bvec_all & bio_first_page_all
This patch converts to bio_first_bvec_all() & bio_first_page_all() for
retrieving the 1st bvec/page, and prepares for supporting multipage bvec.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:18:00 -07:00
Al Viro
a4a0683fd5 bpf_obj_do_pin(): switch to vfs_mkobj(), quit abusing ->mknod()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-05 11:54:33 -05:00
Al Viro
040ee69226 fix "netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'"
Descriptor table is a shared object; it's not a place where you can
stick temporary references to files, especially when we don't need
an opened file at all.

Cc: stable@vger.kernel.org # v4.14
Fixes: 98589a0998 ("netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-05 11:43:39 -05:00
Sergey Senozhatsky
3b14f08d16 irq debug: do not use print_symbol()
print_symbol() is a very old API that has been obsoleted by %pS format
specifier in a normal printk() call.

Replace print_symbol() with a direct printk("%pS") call and avoid
using continuous lines.

Link: http://lkml.kernel.org/r/20171212073453.21455-1-sergey.senozhatsky@gmail.com
To: Andrew Morton <akpm@linux-foundation.org>
To: Russell King <linux@armlinux.org.uk>
To: Catalin Marinas <catalin.marinas@arm.com>
To: Mark Salter <msalter@redhat.com>
To: Tony Luck <tony.luck@intel.com>
To: David Howells <dhowells@redhat.com>
To: Yoshinori Sato <ysato@users.sourceforge.jp>
To: Guan Xuetao <gxt@mprc.pku.edu.cn>
To: Borislav Petkov <bp@alien8.de>
To: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: Thomas Gleixner <tglx@linutronix.de>
To: Peter Zijlstra <peterz@infradead.org>
To: Vineet Gupta <vgupta@synopsys.com>
To: Fengguang Wu <fengguang.wu@intel.com>
To: David Laight <David.Laight@ACULAB.COM>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-c6x-dev@linux-c6x.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-am33-list@redhat.com
Cc: linux-sh@vger.kernel.org
Cc: linux-edac@vger.kernel.org
Cc: x86@kernel.org
Cc: linux-snps-arc@lists.infradead.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
[pmladek@suse.com: updated commit message]
Signed-off-by: Petr Mladek <pmladek@suse.com>
2018-01-05 15:23:59 +01:00
Rainer Fiebig
bdbc98abb3 PM: hibernate: Do not subtract NR_FILE_MAPPED in minimum_image_size()
s2disk/s2both may fail unnecessarily and erratically if NR_FILE_MAPPED
is high - for instance when using VMs with VirtualBox and perhaps VMware
Player. In those situations s2disk becomes unreliable and therefore
unusable.

A typical scenario is: user issues a s2disk and it fails. User issues
a second s2disk immediately after that and it succeeds.  And user
wonders why.

The problem is caused by minimum_image_size() in snapshot.c.  The
value it returns is roughly 100% too high because NR_FILE_MAPPED is
subtracted in its calculation.  Eventually the number of preallocated
image pages is falsely too low.

This doesn't matter as long as NR_FILE_MAPPED-values are in a normal
range or in 32bit-environments as the code allows for allocation of
additional pages from highmem.

But with the high values generated by VirtualBox-VMs (a 2-GB-VM causes
NR_FILE_MAPPED go up by 2 GB) it may lead to failure in 64bit-systems.

Not subtracting NR_FILE_MAPPED in minimum_image_size() solves the
problem.

I've done at least hundreds of successful s2both/s2disk now on an
x86_64 system (with and without VirtualBox) which gives me some
confidence that this is right.  It has turned s2disk/s2both from
unusable into 100% reliable.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=97201
Signed-off-by: Rainer Fiebig <jrf@mailbox.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-01-05 14:46:25 +01:00
Cheah Kok Cheong
08b21fbf4b padata: add SPDX identifier
Add SPDX license identifier according to the type of license text found
in the file.

Cc: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Cheah Kok Cheong <thrust73@gmail.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-01-05 18:43:00 +11:00
Andrew Morton
dc8635b78c kernel/exit.c: export abort() to modules
gcc -fisolate-erroneous-paths-dereference can generate calls to abort()
from modular code too.

[arnd@arndb.de: drop duplicate exports of abort()]
  Link: http://lkml.kernel.org/r/20180102103311.706364-1-arnd@arndb.de
Reported-by: Vineet Gupta <Vineet.Gupta1@synopsys.com>
Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Alexey Brodkin <Alexey.Brodkin@synopsys.com>
Cc: Russell King <rmk+kernel@armlinux.org.uk>
Cc: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-04 16:45:09 -08:00
Oleg Nesterov
4d9570158b kernel/acct.c: fix the acct->needcheck check in check_free_space()
As Tsukada explains, the time_is_before_jiffies(acct->needcheck) check
is very wrong, we need time_is_after_jiffies() to make sys_acct() work.

Ignoring the overflows, the code should "goto out" if needcheck >
jiffies, while currently it checks "needcheck < jiffies" and thus in the
likely case check_free_space() does nothing until jiffies overflow.

In particular this means that sys_acct() is simply broken, acct_on()
sets acct->needcheck = jiffies and expects that check_free_space()
should set acct->active = 1 after the free-space check, but this won't
happen if jiffies increments in between.

This was broken by commit 32dc730860 ("get rid of timer in
kern/acct.c") in 2011, then another (correct) commit 795a2f22a8
("acct() should honour the limits from the very beginning") made the
problem more visible.

Link: http://lkml.kernel.org/r/20171213133940.GA6554@redhat.com
Fixes: 32dc730860 ("get rid of timer in kern/acct.c")
Reported-by: TSUKADA Koutaro <tsukada@ascade.co.jp>
Suggested-by: TSUKADA Koutaro <tsukada@ascade.co.jp>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-04 16:45:09 -08:00
John Fastabend
5f103c5d4d bpf: only build sockmap with CONFIG_INET
The sockmap infrastructure is only aware of TCP sockets at the
moment. In the future we plan to add UDP. In both cases CONFIG_NET
should be built-in.

So lets only build sockmap if CONFIG_INET is enabled.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-04 19:01:14 +01:00
John Fastabend
c20a71a7a3 bpf: sockmap remove unused function
This was added for some work that was eventually factored out but the
helper call was missed. Remove it now and add it back later if needed.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-04 19:01:14 +01:00
Nick Desaulniers
29f1b2b0fe posix-timers: Prevent UB from shifting negative signed value
Shifting a negative signed number is undefined behavior. Looking at the
macros MAKE_PROCESS_CPUCLOCK and FD_TO_CLOCKID, it seems that the
subexpression:

(~(clockid_t) (pid) << 3)

where clockid_t resolves to a signed int, which once negated, is
undefined behavior to shift the value of if the results thus far are
negative.

It was further suggested to make these macros into inline functions.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Nick Desaulniers <nick.desaulniers@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Dimitri Sivanich <sivanich@hpe.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-kselftest@vger.kernel.org
Cc: Shuah Khan <shuah@kernel.org>
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Link: https://lkml.kernel.org/r/1514517100-18051-1-git-send-email-nick.desaulniers@gmail.com
2018-01-04 14:57:10 +01:00
Sergey Senozhatsky
cca10d58d2 printk: add console_msg_format command line option
0day and kernelCI automatically parse kernel log - basically some sort
of grepping using the pre-defined text patterns - in order to detect
and report regressions/errors. There are several sources they get the
kernel logs from:

a) dmesg or /proc/ksmg

   This is the preferred way. Because `dmesg --raw' (see later Note)
   and /proc/kmsg output contains facility and log level, which greatly
   simplifies grepping for EMERG/ALERT/CRIT/ERR messages.

b) serial consoles

   This option is harder to maintain, because serial console messages
   don't contain facility and log level.

This patch introduces a `console_msg_format=' command line option,
to switch between different message formatting on serial consoles.
For the time being we have just two options - default and syslog.
The "default" option just keeps the existing format. While the
"syslog" option makes serial console messages to appear in syslog
format [syslog() syscall], matching the `dmesg -S --raw' and
`cat /proc/kmsg' output formats:

- facility and log level
- time stamp (depends on printk_time/PRINTK_TIME)
- message

	<%u>[time stamp] text\n

NOTE: while Kevin and Fengguang talk about "dmesg --raw", it's actually
"dmesg -S --raw" that always prints messages in syslog format [per
Petr Mladek]. Running "dmesg --raw" may produce output in non-syslog
format sometimes. console_msg_format=syslog enables syslog format,
thus in documentation we mention "dmesg -S --raw", not "dmesg --raw".

Per Kevin Hilman:

: Right now we can get this info from a "dmesg --raw" after bootup,
: but it would be really nice in certain automation frameworks to
: have a kernel command-line option to enable printing of loglevels
: in default boot log.
:
: This is especially useful when ingesting kernel logs into advanced
: search/analytics frameworks (I'm playing with and ELK stack: Elastic
: Search, Logstash, Kibana).
:
: The other important reason for having this on the command line is that
: for testing linux-next (and other bleeding edge developer branches),
: it's common that we never make it to userspace, so can't even run
: "dmesg --raw" (or equivalent.)  So we really want this on the primary
: boot (serial) console.

Per Fengguang Wu, 0day scripts should quickly benefit from that
feature, because they will be able to switch to a more reliable
parsing, based on messages' facility and log levels [1]:

`#{grep} -a -E -e '^<[0123]>' -e '^kern  :(err   |crit  |alert |emerg )'

instead of doing text pattern matching

`#{grep} -a -F -f /lkp/printk-error-messages #{kmsg_file} |
      grep -a -v -E -f #{LKP_SRC}/etc/oops-pattern |
      grep -a -v -F -f #{LKP_SRC}/etc/kmsg-blacklist`

[1] https://github.com/fengguang/lkp-tests/blob/master/lib/dmesg.rb

Link: http://lkml.kernel.org/r/20171221054149.4398-1-sergey.senozhatsky@gmail.com
To: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Kevin Hilman <khilman@baylibre.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Kevin Hilman <khilman@baylibre.com>
Tested-by: Kevin Hilman <khilman@baylibre.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2018-01-04 14:51:27 +01:00
Eric W. Biederman
0b44bf9a6f signal: Simplify and fix kdb_send_sig
- Rename from kdb_send_sig_info to kdb_send_sig
  As there is no meaningful siginfo sent

- Use SEND_SIG_PRIV instead of generating a siginfo for a kdb
  signal.  The generated siginfo had a bogus rationale and was
  not correct in the face of pid namespaces.  SEND_SIG_PRIV
  is simpler and actually correct.

- As the code grabs siglock just send the signal with siglock
  held instead of dropping siglock and attempting to grab it again.

- Move the sig_valid test into kdb_kill where it can generate
  a good error message.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2018-01-03 18:01:08 -06:00
Linus Torvalds
d6bbd51587 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull pid allocation bug fix from Eric Biederman:
 "The replacement of the pid hash table and the pid bitmap with an idr
  resulted in an implementation that now fails more often in low memory
  situations. Allowing fuzzers to observe bad behavior from a memory
  allocation failure during pid allocation.

  This is a small change to fix this by making the kernel more robust in
  the case of error. The non-error paths are left alone so the only
  danger is to the already broken error path. I have manually injected
  errors and verified that this new error handling works"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  pid: Handle failure to allocate the first pid in a pid namespace
2018-01-03 11:03:07 -08:00
Ingo Molnar
475c5ee193 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

- Updates to use cond_resched() instead of cond_resched_rcu_qs()
  where feasible (currently everywhere except in kernel/rcu and
  in kernel/torture.c).  Also a couple of fixes to avoid sending
  IPIs to offline CPUs.

- Updates to simplify RCU's dyntick-idle handling.

- Updates to remove almost all uses of smp_read_barrier_depends()
  and read_barrier_depends().

- Miscellaneous fixes.

- Torture-test updates.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-03 14:14:18 +01:00
Suzuki K Poulose
82975c46da perf: Export perf_event_update_userpage
Export perf_event_update_userpage() so that PMU driver using them,
can be built as modules.

Acked-by: Peter Zilstra <peterz@infradead.org>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-01-02 16:43:12 +00:00
Linus Torvalds
cea92e843e Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
 "A pile of fixes for long standing issues with the timer wheel and the
  NOHZ code:

   - Prevent timer base confusion accross the nohz switch, which can
     cause unlocked access and data corruption

   - Reinitialize the stale base clock on cpu hotplug to prevent subtle
     side effects including rollovers on 32bit

   - Prevent an interrupt storm when the timer softirq is already
     pending caused by tick_nohz_stop_sched_tick()

   - Move the timer start tracepoint to a place where it actually makes
     sense

   - Add documentation to timerqueue functions as they caused confusion
     several times now"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timerqueue: Document return values of timerqueue_add/del()
  timers: Invoke timer_start_debug() where it makes sense
  nohz: Prevent a timer interrupt storm in tick_nohz_stop_sched_tick()
  timers: Reinitialize per cpu bases on hotplug
  timers: Use deferrable base independent of base::nohz_active
2017-12-31 12:30:34 -08:00
Linus Torvalds
8d517bdfb5 Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull smp fixlet from Thomas Gleixner:
 "A trivial build warning fix for newer compilers"

* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu/hotplug: Move inline keyword at the beginning of declaration
2017-12-31 12:29:02 -08:00
Linus Torvalds
4c470317f9 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
 "Three patches addressing the fallout of the CPU_ISOLATION changes
  especially with NO_HZ_FULL plus documentation of boot parameter
  dependency"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/isolation: Document boot parameters dependency on CONFIG_CPU_ISOLATION=y
  sched/isolation: Enable CONFIG_CPU_ISOLATION=y by default
  sched/isolation: Make CONFIG_NO_HZ_FULL select CONFIG_CPU_ISOLATION
2017-12-31 12:27:19 -08:00
Linus Torvalds
88fa025d30 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
 "A rather large update after the kaisered maintainer finally found time
  to handle regression reports.

   - The larger part addresses a regression caused by the x86 vector
     management rework.

     The reservation based model does not work reliably for MSI
     interrupts, if they cannot be masked (yes, yet another hw
     engineering trainwreck). The reason is that the reservation mode
     assigns a dummy vector when the interrupt is allocated and switches
     to a real vector when the interrupt is requested.

     If the MSI entry cannot be masked then the initialization might
     raise an interrupt before the interrupt is requested, which ends up
     as spurious interrupt and causes device malfunction and worse. The
     fix is to exclude MSI interrupts which do not support masking from
     reservation mode and assign a real vector right away.

   - Extend the extra lockdep class setup for nested interrupts with a
     class for the recently added irq_desc::request_mutex so lockdep can
     differeniate and does not emit false positive warnings.

   - A ratelimit guard for the bad irq printout so in case a bad irq
     comes back immediately the system does not drown in dmesg spam"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq/msi, x86/vector: Prevent reservation mode for non maskable MSI
  genirq/irqdomain: Rename early argument of irq_domain_activate_irq()
  x86/vector: Use IRQD_CAN_RESERVE flag
  genirq: Introduce IRQD_CAN_RESERVE flag
  genirq/msi: Handle reactivation only on success
  gpio: brcmstb: Make really use of the new lockdep class
  genirq: Guard handle_bad_irq log messages
  kernel/irq: Extend lockdep class for request mutex
2017-12-31 11:23:11 -08:00
Jakub Kicinski
675fc275a3 bpf: offload: report device information for offloaded programs
Report to the user ifindex and namespace information of offloaded
programs.  If device has disappeared return -ENODEV.  Specify the
namespace using dev/inode combination.

CC: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-31 16:12:23 +01:00
Jakub Kicinski
ad8ad79f4f bpf: offload: free program id when device disappears
Bound programs are quite useless after their device disappears.
They are simply waiting for reference count to go to zero,
don't list them in BPF_PROG_GET_NEXT_ID by freeing their ID
early.

Note that orphaned offload programs will return -ENODEV on
BPF_OBJ_GET_INFO_BY_FD so user will never see ID 0.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-31 16:12:23 +01:00
Jakub Kicinski
ce3b9db4db bpf: offload: free prog->aux->offload when device disappears
All bpf offload operations should now be under bpf_devs_lock,
it's safe to free and clear the entire offload structure,
not only the netdev pointer.

__bpf_prog_offload_destroy() will no longer be called multiple
times.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-31 16:12:23 +01:00
Jakub Kicinski
cae1927c0b bpf: offload: allow netdev to disappear while verifier is running
To allow verifier instruction callbacks without any extra locking
NETDEV_UNREGISTER notification would wait on a waitqueue for verifier
to finish.  This design decision was made when rtnl lock was providing
all the locking.  Use the read/write lock instead and remove the
workqueue.

Verifier will now call into the offload code, so dev_ops are moved
to offload structure.  Since verifier calls are all under
bpf_prog_is_dev_bound() we no longer need static inline implementations
to please builds with CONFIG_NET=n.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-31 16:12:23 +01:00
Jakub Kicinski
9a18eedb14 bpf: offload: don't use prog->aux->offload as boolean
We currently use aux->offload to indicate that program is bound
to a specific device.  This forces us to keep the offload structure
around even after the device is gone.  Add a bool member to
struct bpf_prog_aux to indicate if offload was requested.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-31 16:12:22 +01:00
Jakub Kicinski
e0d3974ac7 bpf: offload: don't require rtnl for dev list manipulation
We don't need the RTNL lock for all operations on offload state.
We only need to hold it around ndo calls.  The device offload
initialization doesn't require it.  The soon-to-come querying
of the offload info will only need it partially.  We will also
be able to remove the waitqueue in following patches.

Use struct rw_semaphore because map offload will require sleeping
with the semaphore held for read.

Suggested-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-31 16:12:22 +01:00
Thomas Gleixner
fd45bb77ad timers: Invoke timer_start_debug() where it makes sense
The timer start debug function is called before the proper timer base is
set. As a consequence the trace data contains the stale CPU and flags
values.

Call the debug function after setting the new base and flags.

Fixes: 500462a9de ("timers: Switch to a non-cascading wheel")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: stable@vger.kernel.org
Cc: rt@linutronix.de
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Link: https://lkml.kernel.org/r/20171222145337.792907137@linutronix.de
2017-12-29 23:13:10 +01:00
Thomas Gleixner
5d62c183f9 nohz: Prevent a timer interrupt storm in tick_nohz_stop_sched_tick()
The conditions in irq_exit() to invoke tick_nohz_irq_exit() which
subsequently invokes tick_nohz_stop_sched_tick() are:

  if ((idle_cpu(cpu) && !need_resched()) || tick_nohz_full_cpu(cpu))

If need_resched() is not set, but a timer softirq is pending then this is
an indication that the softirq code punted and delegated the execution to
softirqd. need_resched() is not true because the current interrupted task
takes precedence over softirqd.

Invoking tick_nohz_irq_exit() in this case can cause an endless loop of
timer interrupts because the timer wheel contains an expired timer, but
softirqs are not yet executed. So it returns an immediate expiry request,
which causes the timer to fire immediately again. Lather, rinse and
repeat....

Prevent that by adding a check for a pending timer soft interrupt to the
conditions in tick_nohz_stop_sched_tick() which avoid calling
get_next_timer_interrupt(). That keeps the tick sched timer on the tick and
prevents a repetitive programming of an already expired timer.

Reported-by: Sebastian Siewior <bigeasy@linutronix.d>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712272156050.2431@nanos
2017-12-29 23:13:10 +01:00
Thomas Gleixner
26456f87ac timers: Reinitialize per cpu bases on hotplug
The timer wheel bases are not (re)initialized on CPU hotplug. That leaves
them with a potentially stale clk and next_expiry valuem, which can cause
trouble then the CPU is plugged.

Add a prepare callback which forwards the clock, sets next_expiry to far in
the future and reset the control flags to a known state.

Set base->must_forward_clk so the first timer which is queued will try to
forward the clock to current jiffies.

Fixes: 500462a9de ("timers: Switch to a non-cascading wheel")
Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712272152200.2431@nanos
2017-12-29 23:13:09 +01:00
Anna-Maria Gleixner
ced6d5c11d timers: Use deferrable base independent of base::nohz_active
During boot and before base::nohz_active is set in the timer bases, deferrable
timers are enqueued into the standard timer base. This works correctly as
long as base::nohz_active is false.

Once it base::nohz_active is set and a timer which was enqueued before that
is accessed the lock selector code choses the lock of the deferred
base. This causes unlocked access to the standard base and in case the
timer is removed it does not clear the pending flag in the standard base
bitmap which causes get_next_timer_interrupt() to return bogus values.

To prevent that, the deferrable timers must be enqueued in the deferrable
base, even when base::nohz_active is not set. Those deferrable timers also
need to be expired unconditional.

Fixes: 500462a9de ("timers: Switch to a non-cascading wheel")
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: stable@vger.kernel.org
Cc: rt@linutronix.de
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Link: https://lkml.kernel.org/r/20171222145337.633328378@linutronix.de
2017-12-29 23:13:09 +01:00
David S. Miller
6bb8824732 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
net/ipv6/ip6_gre.c is a case of parallel adds.

include/trace/events/tcp.h is a little bit more tricky.  The removal
of in-trace-macro ifdefs in 'net' paralleled with moving
show_tcp_state_name and friends over to include/trace/events/sock.h
in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-29 15:42:26 -05:00
Thomas Gleixner
bc976233a8 genirq/msi, x86/vector: Prevent reservation mode for non maskable MSI
The new reservation mode for interrupts assigns a dummy vector when the
interrupt is allocated and assigns a real vector when the interrupt is
requested. The reservation mode prevents vector pressure when devices with
a large amount of queues/interrupts are initialized, but only a minimal
subset of those queues/interrupts is actually used.

This mode has an issue with MSI interrupts which cannot be masked. If the
driver is not careful or the hardware emits an interrupt before the device
irq is requestd by the driver then the interrupt ends up on the dummy
vector as a spurious interrupt which can cause malfunction of the device or
in the worst case a lockup of the machine.

Change the logic for the reservation mode so that the early activation of
MSI interrupts checks whether:

 - the device is a PCI/MSI device
 - the reservation mode of the underlying irqdomain is activated
 - PCI/MSI masking is globally enabled
 - the PCI/MSI device uses either MSI-X, which supports masking, or
   MSI with the maskbit supported.

If one of those conditions is false, then clear the reservation mode flag
in the irq data of the interrupt and invoke irq_domain_activate_irq() with
the reserve argument cleared. In the x86 vector code, clear the can_reserve
flag in the vector allocation data so a subsequent free_irq() won't create
the same situation again. The interrupt stays assigned to a real vector
until pci_disable_msi() is invoked and all allocations are undone.

Fixes: 4900be8360 ("x86/vector/msi: Switch to global reservation mode")
Reported-by: Alexandru Chirvasitu <achirvasub@gmail.com>
Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Alexandru Chirvasitu <achirvasub@gmail.com>
Tested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Maciej W. Rozycki <macro@linux-mips.org>
Cc: Mikael Pettersson <mikpelinux@gmail.com>
Cc: Josh Poulson <jopoulso@microsoft.com>
Cc: Mihai Costache <v-micos@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: linux-pci@vger.kernel.org
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Simon Xiao <sixiao@microsoft.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Jork Loeser <Jork.Loeser@microsoft.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: devel@linuxdriverproject.org
Cc: KY Srinivasan <kys@microsoft.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Sakari Ailus <sakari.ailus@intel.com>,
Cc: linux-media@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712291406420.1899@nanos
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712291409460.1899@nanos
2017-12-29 21:13:05 +01:00
Thomas Gleixner
702cb0a028 genirq/irqdomain: Rename early argument of irq_domain_activate_irq()
The 'early' argument of irq_domain_activate_irq() is actually used to
denote reservation mode. To avoid confusion, rename it before abuse
happens.

No functional change.

Fixes: 7249164346 ("genirq/irqdomain: Update irq_domain_ops.activate() signature")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexandru Chirvasitu <achirvasub@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Maciej W. Rozycki <macro@linux-mips.org>
Cc: Mikael Pettersson <mikpelinux@gmail.com>
Cc: Josh Poulson <jopoulso@microsoft.com>
Cc: Mihai Costache <v-micos@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: linux-pci@vger.kernel.org
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Simon Xiao <sixiao@microsoft.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Jork Loeser <Jork.Loeser@microsoft.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: devel@linuxdriverproject.org
Cc: KY Srinivasan <kys@microsoft.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Sakari Ailus <sakari.ailus@intel.com>,
Cc: linux-media@vger.kernel.org
2017-12-29 21:13:04 +01:00
Thomas Gleixner
69790ba92b genirq: Introduce IRQD_CAN_RESERVE flag
Add a new flag to mark interrupts which can use reservation mode. This is
going to be used in subsequent patches to disable reservation mode for a
certain class of MSI devices.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Alexandru Chirvasitu <achirvasub@gmail.com>
Tested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Maciej W. Rozycki <macro@linux-mips.org>
Cc: Mikael Pettersson <mikpelinux@gmail.com>
Cc: Josh Poulson <jopoulso@microsoft.com>
Cc: Mihai Costache <v-micos@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: linux-pci@vger.kernel.org
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Simon Xiao <sixiao@microsoft.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Jork Loeser <Jork.Loeser@microsoft.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: devel@linuxdriverproject.org
Cc: KY Srinivasan <kys@microsoft.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Sakari Ailus <sakari.ailus@intel.com>,
Cc: linux-media@vger.kernel.org
2017-12-29 21:13:04 +01:00
Thomas Gleixner
da5dd9e854 genirq/msi: Handle reactivation only on success
When analyzing the fallout of the x86 vector allocation rework it turned
out that the error handling in msi_domain_alloc_irqs() is broken.

If MSI_FLAG_MUST_REACTIVATE is set for a MSI domain then it clears the
activation flag for a successfully initialized msi descriptor. If a
subsequent initialization fails then the error handling code path does not
deactivate the interrupt because the activation flag got cleared.

Move the clearing of the activation flag outside of the initialization loop
so that an eventual failure can be cleaned up correctly.

Fixes: 22d0b12f35 ("genirq/irqdomain: Add force reactivation flag to irq domains")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Alexandru Chirvasitu <achirvasub@gmail.com>
Tested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Maciej W. Rozycki <macro@linux-mips.org>
Cc: Mikael Pettersson <mikpelinux@gmail.com>
Cc: Josh Poulson <jopoulso@microsoft.com>
Cc: Mihai Costache <v-micos@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: linux-pci@vger.kernel.org
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Simon Xiao <sixiao@microsoft.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Jork Loeser <Jork.Loeser@microsoft.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: devel@linuxdriverproject.org
Cc: KY Srinivasan <kys@microsoft.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Sakari Ailus <sakari.ailus@intel.com>,
Cc: linux-media@vger.kernel.org
2017-12-29 21:13:04 +01:00
Linus Torvalds
61233580f1 Power management fix for v4.15-rc6
This fixes a schedutil cpufreq governor regression from the 4.14
 cycle that may cause a CPU idleness check to return incorrect results
 in some cases which leads to suboptimal decisions (Joel Fernandes).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJaROaWAAoJEILEb/54YlRxCNMP/3Dr1wAhR26HgVyAVXN3tR8j
 RjYvn/G7WFuczl+nqsMT8A2Gig61BAWVhADh5HJrqoZbYlwOLXCnMJOcQy4FDy2g
 THPQdzf8eT0OHGF/+SZvOWTPX+nL63cN+i+U1bCYyuQ6+N1Kk3vZdMx8fbyJScQ2
 KtszEg1zh2CIzHSDlxqQFOzE8O2GAU+evBQYMrwLpPBUbi9yN9nCZJ8XKarfJwnB
 YgALt0BdLU+n7kFj4ZCx76Mjf3qQL7rV+3YNs49OJU90p72O4aiadv7XxDMv5Fpf
 Ie9xh0gouZLPziJiHwFPb3PqxMVojJPKpqRk73mitCc2T93Kk4oWozGt60k0tahO
 pYBh2OGHycnMI/cn2wpOk/hIdHVz6IUD5Of8LtXAiDhAnmvlj5PFdnHGDgvMwbga
 2Kzbw8n2fKfiX2r9UfJRjrQHaomvdy4kgneMVjgrh0dIu9GL5zK2sntF0pZeyyKH
 XzIBZ9Ue1PLOWIDW1LSPLMq6YKh95FLRTebodiWRbq2CYITCXn37qGMd4BhXwOEx
 CLkSE3vmWbh6Yr6pITlFU0oTMenc0SCHAqh+hHTr4NAYA5rooLspH0swRk5Lf2bz
 OC7K4BkPc0Lyv2MUj5ZKYa3UCgJ9QF/ekL4JW4PsWRZ1Afk8XqiCC5O9Kn4/iYGT
 xzcKW02R1p/hIWTOY2kw
 =TPkr
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fix from Rafael Wysocki:
 "This fixes a schedutil cpufreq governor regression from the 4.14 cycle
  that may cause a CPU idleness check to return incorrect results in
  some cases which leads to suboptimal decisions (Joel Fernandes)"

* tag 'pm-4.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: schedutil: Use idle_calls counter of the remote CPU
2017-12-29 11:54:15 -08:00
Guenter Roeck
11bca0a83f genirq: Guard handle_bad_irq log messages
An interrupt storm on a bad interrupt will cause the kernel
log to be clogged.

[   60.089234] ->handle_irq():  ffffffffbe2f803f,
[   60.090455] 0xffffffffbf2af380
[   60.090510] handle_bad_irq+0x0/0x2e5
[   60.090522] ->irq_data.chip(): ffffffffbf2af380,
[   60.090553]    IRQ_NOPROBE set
[   60.090584] ->handle_irq():  ffffffffbe2f803f,
[   60.090590] handle_bad_irq+0x0/0x2e5
[   60.090596] ->irq_data.chip(): ffffffffbf2af380,
[   60.090602] 0xffffffffbf2af380
[   60.090608] ->action():           (null)
[   60.090779] handle_bad_irq+0x0/0x2e5

This was seen when running an upstream kernel on Acer Chromebook R11.  The
system was unstable as result.

Guard the log message with __printk_ratelimit to reduce the impact.  This
won't prevent the interrupt storm from happening, but at least the system
remains stable.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Dmitry Torokhov <dtor@chromium.org>
Cc: Joe Perches <joe@perches.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=197953
Link: https://lkml.kernel.org/r/1512234784-21038-1-git-send-email-linux@roeck-us.net
2017-12-28 12:28:29 +01:00
Joel Fernandes
466a2b42d6 cpufreq: schedutil: Use idle_calls counter of the remote CPU
Since the recent remote cpufreq callback work, its possible that a cpufreq
update is triggered from a remote CPU. For single policies however, the current
code uses the local CPU when trying to determine if the remote sg_cpu entered
idle or is busy. This is incorrect. To remedy this, compare with the nohz tick
idle_calls counter of the remote CPU.

Fixes: 674e75411f (sched: cpufreq: Allow remote cpufreq callbacks)
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Joel Fernandes <joelaf@google.com>
Cc: 4.14+ <stable@vger.kernel.org> # 4.14+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-12-28 12:26:54 +01:00
Andrew Lunn
39c3fd5895 kernel/irq: Extend lockdep class for request mutex
The IRQ code already has support for lockdep class for the lock mutex
in an interrupt descriptor. Extend this to add a second class for the
request mutex in the descriptor. Not having a class is resulting in
false positive splats in some code paths.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: linus.walleij@linaro.org
Cc: grygorii.strashko@ti.com
Cc: f.fainelli@gmail.com
Link: https://lkml.kernel.org/r/1512234664-21555-1-git-send-email-andrew@lunn.ch
2017-12-28 12:26:35 +01:00
David S. Miller
fcffe2edbd Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2017-12-28

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Fix incorrect state pruning related to recognition of zero initialized
   stack slots, where stacksafe exploration would mistakenly return a
   positive pruning verdict too early ignoring other slots, from Gianluca.

2) Various BPF to BPF calls related follow-up fixes. Fix an off-by-one
   in maximum call depth check, and rework maximum stack depth tracking
   logic to fix a bypass of the total stack size check reported by Jann.
   Also fix a bug in arm64 JIT where prog->jited_len was uninitialized.
   Addition of various test cases to BPF selftests, from Alexei.

3) Addition of a BPF selftest to test_verifier that is related to BPF to
   BPF calls which demonstrates a late caller stack size increase and
   thus out of bounds access. Fixed above in 2). Test case from Jann.

4) Addition of correlating BPF helper calls, BPF to BPF calls as well
   as BPF maps to bpftool xlated dump in order to allow for better
   BPF program introspection and debugging, from Daniel.

5) Fixing several bugs in BPF to BPF calls kallsyms handling in order
   to get it actually to work for subprogs, from Daniel.

6) Extending sparc64 JIT support for BPF to BPF calls and fix a couple
   of build errors for libbpf on sparc64, from David.

7) Allow narrower context access for BPF dev cgroup typed programs in
   order to adapt to LLVM code generation. Also adjust memlock rlimit
   in the test_dev_cgroup BPF selftest, from Yonghong.

8) Add netdevsim Kconfig entry to BPF selftests since test_offload.py
   relies on netdevsim device being available, from Jakub.

9) Reduce scope of xdp_do_generic_redirect_map() to being static,
   from Xiongwei.

10) Minor cleanups and spelling fixes in BPF verifier, from Colin.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-27 20:40:32 -05:00
Linus Torvalds
5f520fc318 While doing tests on tracing over the network, I found that the packets
were getting corrupted. In the process I found three bugs. One was the
 culprit, but the other two scared me. After deeper investigation, they
 were not as major as I thought they were, due to a signed compared to
 an unsigned that prevented a negative number from doing actual harm.
 
 The two bigger bugs:
 
  - Mask the ring buffer data page length. There are data flags at the
    high bits of the length field. These were not cleared via the
    length function, and the length could return a negative number.
    (Although the number returned was unsigned, but was assigned to a
    signed number) Luckily, this value was compared to PAGE_SIZE which is
    unsigned and kept it from entering the path that could have caused damage.
 
  - Check the page usage before reusing the ring buffer reader page.
    TCP increments the page ref when passing the page off to the network.
    The page is passed back to the ring buffer for use on free. But
    the page could still be in use by the TCP stack.
 
 Minor bugs:
 
  - Related to the first bug. No need to clear out the unused ring buffer
    data before sending to user space. It is now done by the ring buffer
    code itself.
 
  - Reset pointers after free on error path. There were some cases in
    the error path that pointers were freed but not set to NULL, and could
    have them freed again, having a pointer freed twice.
 -----BEGIN PGP SIGNATURE-----
 
 iQHIBAABCgAyFiEEPm6V/WuN2kyArTUe1a05Y9njSUkFAlpD9O8UHHJvc3RlZHRA
 Z29vZG1pcy5vcmcACgkQ1a05Y9njSUnC0Av9EqzJjJXlZuleCSiuh1umx33esgZv
 gOYTOXH9QLdKFHLpwVzeCsrhrLXNhbUfrGMQ0ERcpvVacHCKVwRyzx0nfI5W3rbt
 9sCsNsVR2SCVpzSWOvP9iJM0J/myFdZtYmGLC2BBJerXZFwl9Ciw+1bF7MFprb4v
 6r+49YrYMAR/H/obT3Aoh/XCOz0W0czk9ECGPhuwqAjWoNPwSgpbTdqpR92bJf85
 hGYppIX9d+4Gv4pZ2lfXDKrgiAPvHpp5I/znLDY8cG7GhcBjyXaetBb+XlfHI6D4
 jTS59f13CqcEhyFE5x2qwQBr9TTh043EKviixDud+nI1L7aNhDIBtb6tYrAmGWWh
 Rj1268gFjspi3pYTjI8cHXXCJSdQiAqFesiFLviU1c17PgjbBAnmkcsFSgOPxHqc
 j225jravcXtUqQq5J0qKR6Sn3LObfYJQk6tqpN6gWN76P75QgUms5W4+/NiEI0a3
 0LVjapxHZkDEYNRGmI+d0CvIJ3BWyb781Siw
 =xhPf
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "While doing tests on tracing over the network, I found that the
  packets were getting corrupted.

  In the process I found three bugs.

  One was the culprit, but the other two scared me. After deeper
  investigation, they were not as major as I thought they were, due to a
  signed compared to an unsigned that prevented a negative number from
  doing actual harm.

  The two bigger bugs:

   - Mask the ring buffer data page length. There are data flags at the
     high bits of the length field. These were not cleared via the
     length function, and the length could return a negative number.
     (Although the number returned was unsigned, but was assigned to a
     signed number) Luckily, this value was compared to PAGE_SIZE which
     is unsigned and kept it from entering the path that could have
     caused damage.

   - Check the page usage before reusing the ring buffer reader page.
     TCP increments the page ref when passing the page off to the
     network. The page is passed back to the ring buffer for use on
     free. But the page could still be in use by the TCP stack.

  Minor bugs:

   - Related to the first bug. No need to clear out the unused ring
     buffer data before sending to user space. It is now done by the
     ring buffer code itself.

   - Reset pointers after free on error path. There were some cases in
     the error path that pointers were freed but not set to NULL, and
     could have them freed again, having a pointer freed twice"

* tag 'trace-v4.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix possible double free on failure of allocating trace buffer
  tracing: Fix crash when it fails to alloc ring buffer
  ring-buffer: Do no reuse reader page if still in use
  tracing: Remove extra zeroing out of the ring buffer page
  ring-buffer: Mask out the info bits when returning buffer page length
2017-12-27 13:06:57 -08:00
Steven Rostedt (VMware)
4397f04575 tracing: Fix possible double free on failure of allocating trace buffer
Jing Xia and Chunyan Zhang reported that on failing to allocate part of the
tracing buffer, memory is freed, but the pointers that point to them are not
initialized back to NULL, and later paths may try to free the freed memory
again. Jing and Chunyan fixed one of the locations that does this, but
missed a spot.

Link: http://lkml.kernel.org/r/20171226071253.8968-1-chunyan.zhang@spreadtrum.com

Cc: stable@vger.kernel.org
Fixes: 737223fbca ("tracing: Consolidate buffer allocation code")
Reported-by: Jing Xia <jing.xia@spreadtrum.com>
Reported-by: Chunyan Zhang <chunyan.zhang@spreadtrum.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-12-27 14:21:27 -05:00
Jing Xia
24f2aaf952 tracing: Fix crash when it fails to alloc ring buffer
Double free of the ring buffer happens when it fails to alloc new
ring buffer instance for max_buffer if TRACER_MAX_TRACE is configured.
The root cause is that the pointer is not set to NULL after the buffer
is freed in allocate_trace_buffers(), and the freeing of the ring
buffer is invoked again later if the pointer is not equal to Null,
as:

instance_mkdir()
    |-allocate_trace_buffers()
        |-allocate_trace_buffer(tr, &tr->trace_buffer...)
	|-allocate_trace_buffer(tr, &tr->max_buffer...)

          // allocate fail(-ENOMEM),first free
          // and the buffer pointer is not set to null
        |-ring_buffer_free(tr->trace_buffer.buffer)

       // out_free_tr
    |-free_trace_buffers()
        |-free_trace_buffer(&tr->trace_buffer);

	      //if trace_buffer is not null, free again
	    |-ring_buffer_free(buf->buffer)
                |-rb_free_cpu_buffer(buffer->buffers[cpu])
                    // ring_buffer_per_cpu is null, and
                    // crash in ring_buffer_per_cpu->pages

Link: http://lkml.kernel.org/r/20171226071253.8968-1-chunyan.zhang@spreadtrum.com

Cc: stable@vger.kernel.org
Fixes: 737223fbca ("tracing: Consolidate buffer allocation code")
Signed-off-by: Jing Xia <jing.xia@spreadtrum.com>
Signed-off-by: Chunyan Zhang <chunyan.zhang@spreadtrum.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-12-27 14:21:16 -05:00
Steven Rostedt (VMware)
ae415fa4c5 ring-buffer: Do no reuse reader page if still in use
To free the reader page that is allocated with ring_buffer_alloc_read_page(),
ring_buffer_free_read_page() must be called. For faster performance, this
page can be reused by the ring buffer to avoid having to free and allocate
new pages.

The issue arises when the page is used with a splice pipe into the
networking code. The networking code may up the page counter for the page,
and keep it active while sending it is queued to go to the network. The
incrementing of the page ref does not prevent it from being reused in the
ring buffer, and this can cause the page that is being sent out to the
network to be modified before it is sent by reading new data.

Add a check to the page ref counter, and only reuse the page if it is not
being used anywhere else.

Cc: stable@vger.kernel.org
Fixes: 73a757e631 ("ring-buffer: Return reader page back into existing ring buffer")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-12-27 14:21:09 -05:00
Steven Rostedt (VMware)
6b7e633fe9 tracing: Remove extra zeroing out of the ring buffer page
The ring_buffer_read_page() takes care of zeroing out any extra data in the
page that it returns. There's no need to zero it out again from the
consumer. It was removed from one consumer of this function, but
read_buffers_splice_read() did not remove it, and worse, it contained a
nasty bug because of it.

Cc: stable@vger.kernel.org
Fixes: 2711ca237a ("ring-buffer: Move zeroing out excess in page to ring buffer code")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-12-27 14:20:59 -05:00
Steven Rostedt (VMware)
45d8b80c2a ring-buffer: Mask out the info bits when returning buffer page length
Two info bits were added to the "commit" part of the ring buffer data page
when returned to be consumed. This was to inform the user space readers that
events have been missed, and that the count may be stored at the end of the
page.

What wasn't handled, was the splice code that actually called a function to
return the length of the data in order to zero out the rest of the page
before sending it up to user space. These data bits were returned with the
length making the value negative, and that negative value was not checked.
It was compared to PAGE_SIZE, and only used if the size was less than
PAGE_SIZE. Luckily PAGE_SIZE is unsigned long which made the compare an
unsigned compare, meaning the negative size value did not end up causing a
large portion of memory to be randomly zeroed out.

Cc: stable@vger.kernel.org
Fixes: 66a8cb95ed ("ring-buffer: Add place holder recording of dropped events")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-12-27 14:18:10 -05:00
Mathieu Malaterre
76dc6c097d cpu/hotplug: Move inline keyword at the beginning of declaration
Fix non-fatal warnings such as:

kernel/cpu.c:95:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]
 static void inline cpuhp_lock_release(bool bringup) { }
 ^~~~~~

Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Link: https://lkml.kernel.org/r/20171226140855.16583-1-malat@debian.org
2017-12-27 19:41:04 +01:00
Alexei Starovoitov
aada9ce644 bpf: fix max call depth check
fix off by one error in max call depth check
and add a test

Fixes: f4d7e40a5b ("bpf: introduce function calls (verification)")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-27 18:36:23 +01:00
Alexei Starovoitov
70a87ffea8 bpf: fix maximum stack depth tracking logic
Instead of computing max stack depth for current call chain
during the main verifier pass track stack depth of each
function independently and after do_check() is done do
another pass over all instructions analyzing depth
of all possible call stacks.

Fixes: f4d7e40a5b ("bpf: introduce function calls (verification)")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-27 18:36:23 +01:00
Eric W. Biederman
c0ee554906 pid: Handle failure to allocate the first pid in a pid namespace
With the replacement of the pid bitmap and hashtable with an idr in
alloc_pid started occassionally failing when allocating the first pid
in a pid namespace.  Things were not completely reset resulting in
the first allocated pid getting the number 2 (not 1).  Which
further resulted in ns->proc_mnt not getting set and eventually
causing an oops in proc_flush_task.

Oops: 0000 [#1] SMP
CPU: 2 PID: 6743 Comm: trinity-c117 Not tainted 4.15.0-rc4-think+ #2
RIP: 0010:proc_flush_task+0x8e/0x1b0
RSP: 0018:ffffc9000bbffc40 EFLAGS: 00010286
RAX: 0000000000000001 RBX: 0000000000000001 RCX: 00000000fffffffb
RDX: 0000000000000000 RSI: ffffc9000bbffc50 RDI: 0000000000000000
RBP: ffffc9000bbffc63 R08: 0000000000000000 R09: 0000000000000002
R10: ffffc9000bbffb70 R11: ffffc9000bbffc64 R12: 0000000000000003
R13: 0000000000000000 R14: 0000000000000003 R15: ffff8804c10d7840
FS:  00007f7cb8965700(0000) GS:ffff88050a200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000003e21ae003 CR4: 00000000001606e0
DR0: 00007fb1d6c22000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Call Trace:
 ? release_task+0xaf/0x680
 release_task+0xd2/0x680
 ? wait_consider_task+0xb82/0xce0
 wait_consider_task+0xbe9/0xce0
 ? do_wait+0xe1/0x330
 do_wait+0x151/0x330
 kernel_wait4+0x8d/0x150
 ? task_stopped_code+0x50/0x50
 SYSC_wait4+0x95/0xa0
 ? rcu_read_lock_sched_held+0x6c/0x80
 ? syscall_trace_enter+0x2d7/0x340
 ? do_syscall_64+0x60/0x210
 do_syscall_64+0x60/0x210
 entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7f7cb82603aa
RSP: 002b:00007ffd60770bc8 EFLAGS: 00000246
 ORIG_RAX: 000000000000003d
RAX: ffffffffffffffda RBX: 00007f7cb6cd4000 RCX: 00007f7cb82603aa
RDX: 000000000000000b RSI: 00007ffd60770bd0 RDI: 0000000000007cca
RBP: 0000000000007cca R08: 00007f7cb8965700 R09: 00007ffd607c7080
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffd60770bd0 R14: 00007f7cb6cd4058 R15: 00000000cccccccd
Code: c1 e2 04 44 8b 60 30 48 8b 40 38 44 8b 34 11 48 c7 c2 60 3a f5 81 44 89 e1 4c 8b 68 58 e8 4b b4 77 00 89 44 24 14 48 8d 74 24 10 <49> 8b 7d 00 e8 b9 6a f9 ff 48 85 c0 74 1a 48 89 c7 48 89 44 24
RIP: proc_flush_task+0x8e/0x1b0 RSP: ffffc9000bbffc40
CR2: 0000000000000000
---[ end trace 53d67a6481059862 ]---

Improve the quality of the implementation by resetting the place to
start allocating pids on failure to allocate the first pid.

As improving the quality of the implementation is the goal remove the now
unnecesarry disable_pid_allocations call when we fail to mount proc.

Fixes: 95846ecf9d ("pid: replace pid bitmap implementation with IDR API")
Fixes: 8ef047aaae ("pid namespaces: make alloc_pid(), free_pid() and put_pid() work with struct upid")
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-12-23 21:00:09 -06:00
Linus Torvalds
caf9a82657 Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 PTI preparatory patches from Thomas Gleixner:
 "Todays Advent calendar window contains twentyfour easy to digest
  patches. The original plan was to have twenty three matching the date,
  but a late fixup made that moot.

   - Move the cpu_entry_area mapping out of the fixmap into a separate
     address space. That's necessary because the fixmap becomes too big
     with NRCPUS=8192 and this caused already subtle and hard to
     diagnose failures.

     The top most patch is fresh from today and cures a brain slip of
     that tall grumpy german greybeard, who ignored the intricacies of
     32bit wraparounds.

   - Limit the number of CPUs on 32bit to 64. That's insane big already,
     but at least it's small enough to prevent address space issues with
     the cpu_entry_area map, which have been observed and debugged with
     the fixmap code

   - A few TLB flush fixes in various places plus documentation which of
     the TLB functions should be used for what.

   - Rename the SYSENTER stack to CPU_ENTRY_AREA stack as it is used for
     more than sysenter now and keeping the name makes backtraces
     confusing.

   - Prevent LDT inheritance on exec() by moving it to arch_dup_mmap(),
     which is only invoked on fork().

   - Make vysycall more robust.

   - A few fixes and cleanups of the debug_pagetables code. Check
     PAGE_PRESENT instead of checking the PTE for 0 and a cleanup of the
     C89 initialization of the address hint array which already was out
     of sync with the index enums.

   - Move the ESPFIX init to a different place to prepare for PTI.

   - Several code moves with no functional change to make PTI
     integration simpler and header files less convoluted.

   - Documentation fixes and clarifications"

* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  x86/cpu_entry_area: Prevent wraparound in setup_cpu_entry_area_ptes() on 32bit
  init: Invoke init_espfix_bsp() from mm_init()
  x86/cpu_entry_area: Move it out of the fixmap
  x86/cpu_entry_area: Move it to a separate unit
  x86/mm: Create asm/invpcid.h
  x86/mm: Put MMU to hardware ASID translation in one place
  x86/mm: Remove hard-coded ASID limit checks
  x86/mm: Move the CR3 construction functions to tlbflush.h
  x86/mm: Add comments to clarify which TLB-flush functions are supposed to flush what
  x86/mm: Remove superfluous barriers
  x86/mm: Use __flush_tlb_one() for kernel memory
  x86/microcode: Dont abuse the TLB-flush interface
  x86/uv: Use the right TLB-flush API
  x86/entry: Rename SYSENTER_stack to CPU_ENTRY_AREA_entry_stack
  x86/doc: Remove obvious weirdnesses from the x86 MM layout documentation
  x86/mm/64: Improve the memory map documentation
  x86/ldt: Prevent LDT inheritance on exec
  x86/ldt: Rework locking
  arch, mm: Allow arch_dup_mmap() to fail
  x86/vsyscall/64: Warn and fail vsyscall emulation in NATIVE mode
  ...
2017-12-23 11:53:04 -08:00
Gianluca Borello
fd05e57bb3 bpf: fix stacksafe exploration when comparing states
Commit cc2b14d510 ("bpf: teach verifier to recognize zero initialized
stack") introduced a very relaxed check when comparing stacks of different
states, effectively returning a positive result in many cases where it
shouldn't.

This can create problems in cases such as this following C pseudocode:

long var;
long *x = bpf_map_lookup(...);
if (!x)
        return;

if (*x != 0xbeef)
        var = 0;
else
        var = 1;

/* This is the key part, calling a helper causes an explored state
 * to be saved with the information that "var" is on the stack as
 * STACK_ZERO, since the helper is first met by the verifier after
 * the "var = 0" assignment. This state will however be wrongly used
 * also for the "var = 1" case, so the verifier assumes "var" is always
 * 0 and will replace the NULL assignment with nops, because the
 * search pruning prevents it from exploring the faulty branch.
 */
bpf_ktime_get_ns();

if (var)
        *(long *)0 = 0xbeef;

Fix the issue by making sure that the stack is fully explored before
returning a positive comparison result.

Also attach a couple tests that highlight the bad behavior. In the first
test, without this fix instructions 16 and 17 are replaced with nops
instead of being rejected by the verifier.

The second test, instead, allows a program to make a potentially illegal
read from the stack.

Fixes: cc2b14d510 ("bpf: teach verifier to recognize zero initialized stack")
Signed-off-by: Gianluca Borello <g.borello@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2017-12-23 11:04:58 -08:00
Thomas Gleixner
c10e83f598 arch, mm: Allow arch_dup_mmap() to fail
In order to sanitize the LDT initialization on x86 arch_dup_mmap() must be
allowed to fail. Fix up all instances.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: dan.j.williams@intel.com
Cc: hughd@google.com
Cc: keescook@google.com
Cc: kirill.shutemov@linux.intel.com
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-22 20:13:01 +01:00
David S. Miller
fba961ab29 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Lots of overlapping changes.  Also on the net-next side
the XDP state management is handled more in the generic
layers so undo the 'net' nfp fix which isn't applicable
in net-next.

Include a necessary change by Jakub Kicinski, with log message:

====================
cls_bpf no longer takes care of offload tracking.  Make sure
netdevsim performs necessary checks.  This fixes a warning
caused by TC trying to remove a filter it has not added.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-22 11:16:31 -05:00
Linus Torvalds
ead68f2161 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller"
 "What's a holiday weekend without some networking bug fixes? [1]

   1) Fix some eBPF JIT bugs wrt. SKB pointers across helper function
      calls, from Daniel Borkmann.

   2) Fix regression from errata limiting change to marvell PHY driver,
      from Zhao Qiang.

   3) Fix u16 overflow in SCTP, from Xin Long.

   4) Fix potential memory leak during bridge newlink, from Nikolay
      Aleksandrov.

   5) Fix BPF selftest build on s390, from Hendrik Brueckner.

   6) Don't append to cfg80211 automatically generated certs file,
      always write new ones from scratch. From Thierry Reding.

   7) Fix sleep in atomic in mac80211 hwsim, from Jia-Ju Bai.

   8) Fix hang on tg3 MTU change with certain chips, from Brian King.

   9) Add stall detection to arc emac driver and reset chip when this
      happens, from Alexander Kochetkov.

  10) Fix MTU limitng in GRE tunnel drivers, from Xin Long.

  11) Fix stmmac timestamping bug due to mis-shifting of field. From
      Fredrik Hallenberg.

  12) Fix metrics match when deleting an ipv4 route. The kernel sets
      some internal metrics bits which the user isn't going to set when
      it makes the delete request. From Phil Sutter.

  13) mvneta driver loop over RX queues limits on "txq_number" :-) Fix
      from Yelena Krivosheev.

  14) Fix double free and memory corruption in get_net_ns_by_id, from
      Eric W. Biederman.

  15) Flush ipv4 FIB tables in the reverse order. Some tables can share
      their actual backing data, in particular this happens for the MAIN
      and LOCAL tables. We have to kill the LOCAL table first, because
      it uses MAIN's backing memory. Fix from Ido Schimmel.

  16) Several eBPF verifier value tracking fixes, from Edward Cree, Jann
      Horn, and Alexei Starovoitov.

  17) Make changes to ipv6 autoflowlabel sysctl really propagate to
      sockets, unless the socket has set the per-socket value
      explicitly. From Shaohua Li.

  18) Fix leaks and double callback invocations of zerocopy SKBs, from
      Willem de Bruijn"

[1] Is this a trick question? "Relaxing"? "Quiet"? "Fine"? - Linus.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (77 commits)
  skbuff: skb_copy_ubufs must release uarg even without user frags
  skbuff: orphan frags before zerocopy clone
  net: reevalulate autoflowlabel setting after sysctl setting
  openvswitch: Fix pop_vlan action for double tagged frames
  ipv6: Honor specified parameters in fibmatch lookup
  bpf: do not allow root to mangle valid pointers
  selftests/bpf: add tests for recent bugfixes
  bpf: fix integer overflows
  bpf: don't prune branches when a scalar is replaced with a pointer
  bpf: force strict alignment checks for stack pointers
  bpf: fix missing error return in check_stack_boundary()
  bpf: fix 32-bit ALU op verification
  bpf: fix incorrect tracking of register size truncation
  bpf: fix incorrect sign extension in check_alu_op()
  bpf/verifier: fix bounds calculation on BPF_RSH
  ipv4: Fix use-after-free when flushing FIB tables
  s390/qeth: fix error handling in checksum cmd callback
  tipc: remove joining group member from congested list
  selftests: net: Adding config fragment CONFIG_NUMA=y
  nfp: bpf: keep track of the offloaded program
  ...
2017-12-21 15:57:30 -08:00
Daniel Borkmann
7105e828c0 bpf: allow for correlation of maps and helpers in dump
Currently a dump of an xlated prog (post verifier stage) doesn't
correlate used helpers as well as maps. The prog info lists
involved map ids, however there's no correlation of where in the
program they are used as of today. Likewise, bpftool does not
correlate helper calls with the target functions.

The latter can be done w/o any kernel changes through kallsyms,
and also has the advantage that this works with inlined helpers
and BPF calls.

Example, via interpreter:

  # tc filter show dev foo ingress
  filter protocol all pref 49152 bpf chain 0
  filter protocol all pref 49152 bpf chain 0 handle 0x1 foo.o:[ingress] \
                      direct-action not_in_hw id 1 tag c74773051b364165   <-- prog id:1

  * Output before patch (calls/maps remain unclear):

  # bpftool prog dump xlated id 1             <-- dump prog id:1
   0: (b7) r1 = 2
   1: (63) *(u32 *)(r10 -4) = r1
   2: (bf) r2 = r10
   3: (07) r2 += -4
   4: (18) r1 = 0xffff95c47a8d4800
   6: (85) call unknown#73040
   7: (15) if r0 == 0x0 goto pc+18
   8: (bf) r2 = r10
   9: (07) r2 += -4
  10: (bf) r1 = r0
  11: (85) call unknown#73040
  12: (15) if r0 == 0x0 goto pc+23
  [...]

  * Output after patch:

  # bpftool prog dump xlated id 1
   0: (b7) r1 = 2
   1: (63) *(u32 *)(r10 -4) = r1
   2: (bf) r2 = r10
   3: (07) r2 += -4
   4: (18) r1 = map[id:2]                     <-- map id:2
   6: (85) call bpf_map_lookup_elem#73424     <-- helper call
   7: (15) if r0 == 0x0 goto pc+18
   8: (bf) r2 = r10
   9: (07) r2 += -4
  10: (bf) r1 = r0
  11: (85) call bpf_map_lookup_elem#73424
  12: (15) if r0 == 0x0 goto pc+23
  [...]

  # bpftool map show id 2                     <-- show/dump/etc map id:2
  2: hash_of_maps  flags 0x0
        key 4B  value 4B  max_entries 3  memlock 4096B

Example, JITed, same prog:

  # tc filter show dev foo ingress
  filter protocol all pref 49152 bpf chain 0
  filter protocol all pref 49152 bpf chain 0 handle 0x1 foo.o:[ingress] \
                  direct-action not_in_hw id 3 tag c74773051b364165 jited

  # bpftool prog show id 3
  3: sched_cls  tag c74773051b364165
        loaded_at Dec 19/13:48  uid 0
        xlated 384B  jited 257B  memlock 4096B  map_ids 2

  # bpftool prog dump xlated id 3
   0: (b7) r1 = 2
   1: (63) *(u32 *)(r10 -4) = r1
   2: (bf) r2 = r10
   3: (07) r2 += -4
   4: (18) r1 = map[id:2]                      <-- map id:2
   6: (85) call __htab_map_lookup_elem#77408   <-+ inlined rewrite
   7: (15) if r0 == 0x0 goto pc+2                |
   8: (07) r0 += 56                              |
   9: (79) r0 = *(u64 *)(r0 +0)                <-+
  10: (15) if r0 == 0x0 goto pc+24
  11: (bf) r2 = r10
  12: (07) r2 += -4
  [...]

Example, same prog, but kallsyms disabled (in that case we are
also not allowed to pass any relative offsets, etc, so prog
becomes pointer sanitized on dump):

  # sysctl kernel.kptr_restrict=2
  kernel.kptr_restrict = 2

  # bpftool prog dump xlated id 3
   0: (b7) r1 = 2
   1: (63) *(u32 *)(r10 -4) = r1
   2: (bf) r2 = r10
   3: (07) r2 += -4
   4: (18) r1 = map[id:2]
   6: (85) call bpf_unspec#0
   7: (15) if r0 == 0x0 goto pc+2
  [...]

Example, BPF calls via interpreter:

  # bpftool prog dump xlated id 1
   0: (85) call pc+2#__bpf_prog_run_args32
   1: (b7) r0 = 1
   2: (95) exit
   3: (b7) r0 = 2
   4: (95) exit

Example, BPF calls via JIT:

  # sysctl net.core.bpf_jit_enable=1
  net.core.bpf_jit_enable = 1
  # sysctl net.core.bpf_jit_kallsyms=1
  net.core.bpf_jit_kallsyms = 1

  # bpftool prog dump xlated id 1
   0: (85) call pc+2#bpf_prog_3b185187f1855c4c_F
   1: (b7) r0 = 1
   2: (95) exit
   3: (b7) r0 = 2
   4: (95) exit

And finally, an example for tail calls that is now working
as well wrt correlation:

  # bpftool prog dump xlated id 2
  [...]
  10: (b7) r2 = 8
  11: (85) call bpf_trace_printk#-41312
  12: (bf) r1 = r6
  13: (18) r2 = map[id:1]
  15: (b7) r3 = 0
  16: (85) call bpf_tail_call#12
  17: (b7) r1 = 42
  18: (6b) *(u16 *)(r6 +46) = r1
  19: (b7) r0 = 0
  20: (95) exit

  # bpftool map show id 1
  1: prog_array  flags 0x0
        key 4B  value 4B  max_entries 1  memlock 4096B

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2017-12-20 18:09:40 -08:00
Daniel Borkmann
4f74d80971 bpf: fix kallsyms handling for subprogs
Right now kallsyms handling is not working with JITed subprogs.
The reason is that when in 1c2a088a66 ("bpf: x64: add JIT support
for multi-function programs") in jit_subprogs() they are passed
to bpf_prog_kallsyms_add(), then their prog type is 0, which BPF
core will think it's a cBPF program as only cBPF programs have a
0 type. Thus, they need to inherit the type from the main prog.

Once that is fixed, they are indeed added to the BPF kallsyms
infra, but their tag is 0. Therefore, since intention is to add
them as bpf_prog_F_<tag>, we need to pass them to bpf_prog_calc_tag()
first. And once this is resolved, there is a use-after-free on
prog cleanup: we remove the kallsyms entry from the main prog,
later walk all subprogs and call bpf_jit_free() on them. However,
the kallsyms linkage was never released on them. Thus, do that
for all subprogs right in __bpf_prog_put() when refcount hits 0.

Fixes: 1c2a088a66 ("bpf: x64: add JIT support for multi-function programs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2017-12-20 18:09:40 -08:00
Alexei Starovoitov
82abbf8d2f bpf: do not allow root to mangle valid pointers
Do not allow root to convert valid pointers into unknown scalars.
In particular disallow:
 ptr &= reg
 ptr <<= reg
 ptr += ptr
and explicitly allow:
 ptr -= ptr
since pkt_end - pkt == length

1.
This minimizes amount of address leaks root can do.
In the future may need to further tighten the leaks with kptr_restrict.

2.
If program has such pointer math it's likely a user mistake and
when verifier complains about it right away instead of many instructions
later on invalid memory access it's easier for users to fix their progs.

3.
when register holding a pointer cannot change to scalar it allows JITs to
optimize better. Like 32-bit archs could use single register for pointers
instead of a pair required to hold 64-bit scalars.

4.
reduces architecture dependent behavior. Since code:
r1 = r10;
r1 &= 0xff;
if (r1 ...)
will behave differently arm64 vs x64 and offloaded vs native.

A significant chunk of ptr mangling was allowed by
commit f1174f77b5 ("bpf/verifier: rework value tracking")
yet some of it was allowed even earlier.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-21 02:26:29 +01:00
Alexei Starovoitov
bb7f0f989c bpf: fix integer overflows
There were various issues related to the limited size of integers used in
the verifier:
 - `off + size` overflow in __check_map_access()
 - `off + reg->off` overflow in check_mem_access()
 - `off + reg->var_off.value` overflow or 32-bit truncation of
   `reg->var_off.value` in check_mem_access()
 - 32-bit truncation in check_stack_boundary()

Make sure that any integer math cannot overflow by not allowing
pointer math with large values.

Also reduce the scope of "scalar op scalar" tracking.

Fixes: f1174f77b5 ("bpf/verifier: rework value tracking")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-21 02:15:41 +01:00
Jann Horn
179d1c5602 bpf: don't prune branches when a scalar is replaced with a pointer
This could be made safe by passing through a reference to env and checking
for env->allow_ptr_leaks, but it would only work one way and is probably
not worth the hassle - not doing it will not directly lead to program
rejection.

Fixes: f1174f77b5 ("bpf/verifier: rework value tracking")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-21 02:15:41 +01:00
Jann Horn
a5ec6ae161 bpf: force strict alignment checks for stack pointers
Force strict alignment checks for stack pointers because the tracking of
stack spills relies on it; unaligned stack accesses can lead to corruption
of spilled registers, which is exploitable.

Fixes: f1174f77b5 ("bpf/verifier: rework value tracking")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-21 02:15:41 +01:00
Jann Horn
ea25f914dc bpf: fix missing error return in check_stack_boundary()
Prevent indirect stack accesses at non-constant addresses, which would
permit reading and corrupting spilled pointers.

Fixes: f1174f77b5 ("bpf/verifier: rework value tracking")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-21 02:15:41 +01:00
Jann Horn
468f6eafa6 bpf: fix 32-bit ALU op verification
32-bit ALU ops operate on 32-bit values and have 32-bit outputs.
Adjust the verifier accordingly.

Fixes: f1174f77b5 ("bpf/verifier: rework value tracking")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-21 02:15:41 +01:00
Jann Horn
0c17d1d2c6 bpf: fix incorrect tracking of register size truncation
Properly handle register truncation to a smaller size.

The old code first mirrors the clearing of the high 32 bits in the bitwise
tristate representation, which is correct. But then, it computes the new
arithmetic bounds as the intersection between the old arithmetic bounds and
the bounds resulting from the bitwise tristate representation. Therefore,
when coerce_reg_to_32() is called on a number with bounds
[0xffff'fff8, 0x1'0000'0007], the verifier computes
[0xffff'fff8, 0xffff'ffff] as bounds of the truncated number.
This is incorrect: The truncated number could also be in the range [0, 7],
and no meaningful arithmetic bounds can be computed in that case apart from
the obvious [0, 0xffff'ffff].

Starting with v4.14, this is exploitable by unprivileged users as long as
the unprivileged_bpf_disabled sysctl isn't set.

Debian assigned CVE-2017-16996 for this issue.

v2:
 - flip the mask during arithmetic bounds calculation (Ben Hutchings)
v3:
 - add CVE number (Ben Hutchings)

Fixes: b03c9f9fdc ("bpf/verifier: track signed and unsigned min/max values")
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-21 02:15:41 +01:00
Jann Horn
95a762e2c8 bpf: fix incorrect sign extension in check_alu_op()
Distinguish between
BPF_ALU64|BPF_MOV|BPF_K (load 32-bit immediate, sign-extended to 64-bit)
and BPF_ALU|BPF_MOV|BPF_K (load 32-bit immediate, zero-padded to 64-bit);
only perform sign extension in the first case.

Starting with v4.14, this is exploitable by unprivileged users as long as
the unprivileged_bpf_disabled sysctl isn't set.

Debian assigned CVE-2017-16995 for this issue.

v3:
 - add CVE number (Ben Hutchings)

Fixes: 484611357c ("bpf: allow access into map value arrays")
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-21 02:15:41 +01:00