Commit graph

706608 commits

Author SHA1 Message Date
Mengting Zhang
9789e7e93f perf report: Fix debug messages with --call-graph option
With --call-graph option, perf report can display call chains using
type, min percent threshold, optional print limit and order. And the
default call-graph parameter is 'graph,0.5,caller,function,percent'.

Before this patch, 'perf report --call-graph' shows incorrect debug
messages as below:

  # perf report --call-graph
  Invalid callchain mode: 0.5
  Invalid callchain order: 0.5
  Invalid callchain sort key: 0.5
  Invalid callchain config key: 0.5
  Invalid callchain mode: caller
  Invalid callchain mode: function
  Invalid callchain order: function
  Invalid callchain mode: percent
  Invalid callchain order: percent
  Invalid callchain sort key: percent

That is because in function __parse_callchain_report_opt(),each field of
the call-graph parameter is passed to parse_callchain_{mode,order,
sort_key,value} in turn until it meets the matching value.

For example, the order field "caller" is passed to
parse_callchain_mode() firstly and obviously it doesn't match any mode
field. Therefore parse_callchain_mode() will shows the debug message
"Invalid callchain mode: caller", which could confuse users.

The patch fixes this issue by moving the warning out of the function
parse_callchain_{mode,order,sort_key,value}.

Signed-off-by: Mengting Zhang <zhangmengting@huawei.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Krister Johansen <kjlx@templeofstupid.com>
Cc: Li Bin <huawei.libin@huawei.com>
Cc: Milian Wolff <milian.wolff@kdab.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wang Nan <wangnan0@huawei.com>
Cc: Yao Jin <yao.jin@linux.intel.com>
Link: http://lkml.kernel.org/r/1506154694-39691-1-git-send-email-zhangmengting@huawei.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-09-25 12:20:12 -03:00
Shaohua Li
f5c156c4c2 block: fix a crash caused by wrong API
part_stat_show takes a part device not a disk, so we should use
part_to_disk.

Fixes: d62e26b3ffd2("block: pass in queue to inflight accounting")
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25 08:56:05 -06:00
Lukas Czerner
332391a993 fs: Fix page cache inconsistency when mixing buffered and AIO DIO
Currently when mixing buffered reads and asynchronous direct writes it
is possible to end up with the situation where we have stale data in the
page cache while the new data is already written to disk. This is
permanent until the affected pages are flushed away. Despite the fact
that mixing buffered and direct IO is ill-advised it does pose a thread
for a data integrity, is unexpected and should be fixed.

Fix this by deferring completion of asynchronous direct writes to a
process context in the case that there are mapped pages to be found in
the inode. Later before the completion in dio_complete() invalidate
the pages in question. This ensures that after the completion the pages
in the written area are either unmapped, or populated with up-to-date
data. Also do the same for the iomap case which uses
iomap_dio_complete() instead.

This has a side effect of deferring the completion to a process context
for every AIO DIO that happens on inode that has pages mapped. However
since the consensus is that this is ill-advised practice the performance
implication should not be a problem.

This was based on proposal from Jeff Moyer, thanks!

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25 08:56:05 -06:00
James Smart
bb1cc74790 nvmet: implement valid sqhd values in completions
To support sqhd, for initiators that are following the spec and
paying attention to sqhd vs their sqtail values:

- add sqhd to struct nvmet_sq
- initialize sqhd to 0 in nvmet_sq_setup
- rather than propagate the 0's-based qsize value from the connect message
  which requires a +1 in every sqhd update, and as nothing else references
  it, convert to 1's-based value in nvmt_sq/cq_setup() calls.
- validate connect message sqsize being non-zero per spec.
- updated assign sqhd for every completion that goes back.

Also remove handling the NULL sq case in __nvmet_req_complete, as it can't
happen with the current code.

Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25 08:56:05 -06:00
Guilherme G. Piccoli
8edd11c9ad nvme-fabrics: Allow 0 as KATO value
Currently, driver code allows user to set 0 as KATO
(Keep Alive TimeOut), but this is not being respected.
This patch enforces the expected behavior.

Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25 08:56:05 -06:00
James Smart
0951338d96 nvme: allow timed-out ios to retry
Currently the nvme_req_needs_retry() applies several checks to see if
a retry is allowed. On of those is whether the current time has exceeded
the start time of the io plus the timeout length. This check, if an io
times out, means there is never a retry allowed for the io. Which means
applications see the io failure.

Remove this check and allow the io to timeout, like it does on other
protocols, and retries to be made.

On the FC transport, a frame can be lost for an individual io, and there
may be no other errors that escalate for the connection/association.
The io will timeout, which causes the transport to escalate into creating
a new association, but the io that timed out, due to this retry logic, has
already failed back to the application and things are hosed.

Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25 08:56:05 -06:00
James Smart
cd48282cc7 nvme: stop aer posting if controller state not live
If an nvme async_event command completes, in most cases, a new
async event is posted. However, if the controller enters a
resetting or reconnecting state, there is nothing to block the
scheduled work element from posting the async event again. Nor are
there calls from the transport to stop async events when an
association dies.

In the case of FC, where the association is torn down, the aer must
be aborted on the FC link and completes through the normal job
completion path. Thus the terminated async event ends up being
rescheduled even though the controller isn't in a valid state for
the aer, and the reposting gets the transport into a partially torn
down data structure.

It's possible to hit the scenario on rdma, although much less likely
due to an aer completing right as the association is terminated and
as the association teardown reclaims the blk requests via
nvme_cancel_request() so its immediate, not a link-related action
like on FC.

Fix by putting controller state checks in both the async event
completion routine where it schedules the async event and in the
async event work routine before it calls into the transport. It's
effectively a "stop_async_events()" behavior.  The transport, when
it creates a new association with the subsystem will transition
the state back to live and is already restarting the async event
posting.

Signed-off-by: James Smart <james.smart@broadcom.com>
[hch: remove taking a lock over reading the controller state]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25 08:56:05 -06:00
Keith Busch
d087747384 nvme-pci: Print invalid SGL only once
The WARN_ONCE macro returns true if the condition is true, not if the
warn was raised, so we're printing the scatter list every time it's
invalid. This is excessive and makes debugging harder, so this patch
prints it just once.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25 08:56:05 -06:00
Keith Busch
161b8be2bd nvme-pci: initialize queue memory before interrupts
A spurious interrupt before the nvme driver has initialized the completion
queue may inadvertently cause the driver to believe it has a completion
to process. This may result in a NULL dereference since the nvmeq's tags
are not set at this point.

The patch initializes the host's CQ memory so that a spurious interrupt
isn't mistaken for a real completion.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25 08:56:05 -06:00
James Smart
deb61742e0 nvmet-fc: fix failing max io queue connections
fc transport is treating NVMET_NR_QUEUES as maximum queue count, e.g.
admin queue plus NVMET_NR_QUEUES-1 io queues.  But NVMET_NR_QUEUES is
the number of io queues, so maximum queue count is really
NVMET_NR_QUEUES+1.

Fix the handling in the target fc transport

Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25 08:56:05 -06:00
James Smart
d9d34c0b23 nvme-fc: use transport-specific sgl format
Sync with NVM Express spec change and FC-NVME 1.18.

FC transport sets SGL type to Transport SGL Data Block Descriptor and
subtype to transport-specific value 0x0A.

Removed the warn-on's on the PRP fields. They are unneeded. They were
to check for values from the upper layer that weren't set right, and
for the most part were fine. But, with Async events, which reuse the
same structure and 2nd time issued the SGL overlay converted them to
the Transport SGL values - the warn-on's were errantly firing.

Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25 08:56:05 -06:00
James Smart
d85cf20749 nvme: add transport SGL definitions
Add transport SGL defintions from NVMe TP 4008, required for
the final NVMe-FC standard.

Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25 08:56:05 -06:00
James Smart
c98cb3bd88 nvme.h: remove FC transport-specific error values
The NVM express group recinded the reserved range for the transport.
Remove the FC-centric values that had been defined.

Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25 08:56:05 -06:00
James Smart
39a550d2d9 qla2xxx: remove use of FC-specific error codes
The qla2xxx driver uses the FC-specific error when it needed to return an
error to the FC-NVME transport.  Convert to use a generic value instead.

Signed-off-by: James Smart <james.smart@broadcom.com>
Acked-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25 08:56:05 -06:00
James Smart
8e009ce846 lpfc: remove use of FC-specific error codes
The lpfc driver uses the FC-specific error when it needed to return an
error to the FC-NVME transport. Convert to use a generic value instead.

Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25 08:56:05 -06:00
James Smart
fc9608e8b4 nvmet-fcloop: remove use of FC-specific error codes
The FC-NVME transport loopback test module used the FC-specific error
codes in cases where it emulated a transport abort case. Instead of
using the FC-specific values, now use a generic value (NVME_SC_INTERNAL).

Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25 08:56:05 -06:00
James Smart
29b3d26ecc nvmet-fc: remove use of FC-specific error codes
The FC-NVME target transport used the FC-specific error codes in
return codes when the transport or lldd failed. Instead of using the
FC-specific values, now use a generic value (NVME_SC_INTERNAL).

Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25 08:56:05 -06:00
James Smart
56b7103a06 nvme-fc: remove use of FC-specific error codes
The FC-NVME transport used the FC-specific error codes in cases where
it had to fabricate an error to go back up stack. Instead of using the
FC-specific values, now use a generic value (NVME_SC_INTERNAL).

Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25 08:56:05 -06:00
Omar Sandoval
e5313c141b loop: remove union of use_aio and ref in struct loop_cmd
When the request is completed, lo_complete_rq() checks cmd->use_aio.
However, if this is in fact an aio request, cmd->use_aio will have
already been reused as cmd->ref by lo_rw_aio*. Fix it by not using a
union. On x86_64, there's a hole after the union anyways, so this
doesn't make struct loop_cmd any bigger.

Fixes: 92d773324b ("block/loop: fix use after free")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25 08:56:05 -06:00
Waiman Long
5acb3cc2c2 blktrace: Fix potential deadlock between delete & sysfs ops
The lockdep code had reported the following unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(s_active#228);
                               lock(&bdev->bd_mutex/1);
                               lock(s_active#228);
  lock(&bdev->bd_mutex);

 *** DEADLOCK ***

The deadlock may happen when one task (CPU1) is trying to delete a
partition in a block device and another task (CPU0) is accessing
tracing sysfs file (e.g. /sys/block/dm-1/trace/act_mask) in that
partition.

The s_active isn't an actual lock. It is a reference count (kn->count)
on the sysfs (kernfs) file. Removal of a sysfs file, however, require
a wait until all the references are gone. The reference count is
treated like a rwsem using lockdep instrumentation code.

The fact that a thread is in the sysfs callback method or in the
ioctl call means there is a reference to the opended sysfs or device
file. That should prevent the underlying block structure from being
removed.

Instead of using bd_mutex in the block_device structure, a new
blk_trace_mutex is now added to the request_queue structure to protect
access to the blk_trace structure.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

Fix typo in patch subject line, and prune a comment detailing how
the code used to work.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25 08:56:05 -06:00
Josef Bacik
1dae69bede nbd: ignore non-nbd ioctl's
In testing we noticed that nbd would spew if you ran a fio job against
the raw device itself.  This is because fio calls a block device
specific ioctl, however the block layer will first pass this back to the
driver ioctl handler in case the driver wants to do something special.
Since the device was setup using netlink this caused us to spew every
time fio called this ioctl.  Since we don't have special handling, just
error out for any non-nbd specific ioctl's that come in.  This fixes the
spew.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25 08:56:05 -06:00
Christoph Hellwig
f507b54dcc bsg-lib: don't free job in bsg_prepare_job
The job structure is allocated as part of the request, so we should not
free it in the error path of bsg_prepare_job.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25 08:56:05 -06:00
Mikulas Patocka
02a4843618 brd: fix overflow in __brd_direct_access
The code in __brd_direct_access multiplies the pgoff variable by page size
and divides it by 512. It can cause overflow on 32-bit architectures. The
overflow happens if we create ramdisk larger than 4G and use it as a
sparse device.

This patch replaces multiplication and division with multiplication by the
number of sectors per page.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 1647b9b959 ("brd: add dax_operations support")
Cc: stable@vger.kernel.org	# 4.12+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25 08:56:05 -06:00
Alexandru Moise
2827a418ca genirq: Check __free_irq() return value for NULL
__free_irq() can return a NULL irqaction for example when trying to free
already-free IRQ, but the callsite unconditionally dereferences the
returned pointer.

Fix this by adding a check and return NULL.

Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20170919200412.GA29985@gmail.com
2017-09-25 16:40:31 +02:00
Peter Zijlstra
c74aef2d06 futex: Fix pi_state->owner serialization
There was a reported suspicion about a race between exit_pi_state_list()
and put_pi_state(). The same report mentioned the comment with
put_pi_state() said it should be called with hb->lock held, and it no
longer is in all places.

As it turns out, the pi_state->owner serialization is indeed broken. As per
the new rules:

  734009e96d ("futex: Change locking rules")

pi_state->owner should be serialized by pi_state->pi_mutex.wait_lock.
For the sites setting pi_state->owner we already hold wait_lock (where
required) but exit_pi_state_list() and put_pi_state() were not and
raced on clearing it.

Fixes: 734009e96d ("futex: Change locking rules")
Reported-by: Gratian Crisan <gratian.crisan@ni.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: dvhart@infradead.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20170922154806.jd3ffltfk24m4o4y@hirez.programming.kicks-ass.net
2017-09-25 16:37:11 +02:00
Eric Biggers
e007ce9c59 KEYS: use kmemdup() in request_key_auth_new()
kmemdup() is preferred to kmalloc() followed by memcpy().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2017-09-25 15:19:57 +01:00
Eric Biggers
4aa68e07d8 KEYS: restrict /proc/keys by credentials at open time
When checking for permission to view keys whilst reading from
/proc/keys, we should use the credentials with which the /proc/keys file
was opened.  This is because, in a classic type of exploit, it can be
possible to bypass checks for the *current* credentials by passing the
file descriptor to a suid program.

Following commit 34dbbcdbf6 ("Make file credentials available to the
seqfile interfaces") we can finally fix it.  So let's do it.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2017-09-25 15:19:57 +01:00
Eric Biggers
8f674565d4 KEYS: reset parent each time before searching key_user_tree
In key_user_lookup(), if there is no key_user for the given uid, we drop
key_user_lock, allocate a new key_user, and search the tree again.  But
we failed to set 'parent' to NULL at the beginning of the second search.
If the tree were to be empty for the second search, the insertion would
be done with an invalid 'parent', scribbling over freed memory.

Fortunately this can't actually happen currently because the tree always
contains at least the root_key_user.  But it still should be fixed to
make the code more robust.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2017-09-25 15:19:57 +01:00
Eric Biggers
37863c43b2 KEYS: prevent KEYCTL_READ on negative key
Because keyctl_read_key() looks up the key with no permissions
requested, it may find a negatively instantiated key.  If the key is
also possessed, we went ahead and called ->read() on the key.  But the
key payload will actually contain the ->reject_error rather than the
normal payload.  Thus, the kernel oopses trying to read the
user_key_payload from memory address (int)-ENOKEY = 0x00000000ffffff82.

Fortunately the payload data is stored inline, so it shouldn't be
possible to abuse this as an arbitrary memory read primitive...

Reproducer:
    keyctl new_session
    keyctl request2 user desc '' @s
    keyctl read $(keyctl show | awk '/user: desc/ {print $1}')

It causes a crash like the following:
     BUG: unable to handle kernel paging request at 00000000ffffff92
     IP: user_read+0x33/0xa0
     PGD 36a54067 P4D 36a54067 PUD 0
     Oops: 0000 [#1] SMP
     CPU: 0 PID: 211 Comm: keyctl Not tainted 4.14.0-rc1 #337
     Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
     task: ffff90aa3b74c3c0 task.stack: ffff9878c0478000
     RIP: 0010:user_read+0x33/0xa0
     RSP: 0018:ffff9878c047bee8 EFLAGS: 00010246
     RAX: 0000000000000001 RBX: ffff90aa3d7da340 RCX: 0000000000000017
     RDX: 0000000000000000 RSI: 00000000ffffff82 RDI: ffff90aa3d7da340
     RBP: ffff9878c047bf00 R08: 00000024f95da94f R09: 0000000000000000
     R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
     R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
     FS:  00007f58ece69740(0000) GS:ffff90aa3e200000(0000) knlGS:0000000000000000
     CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
     CR2: 00000000ffffff92 CR3: 0000000036adc001 CR4: 00000000003606f0
     Call Trace:
      keyctl_read_key+0xac/0xe0
      SyS_keyctl+0x99/0x120
      entry_SYSCALL_64_fastpath+0x1f/0xbe
     RIP: 0033:0x7f58ec787bb9
     RSP: 002b:00007ffc8d401678 EFLAGS: 00000206 ORIG_RAX: 00000000000000fa
     RAX: ffffffffffffffda RBX: 00007ffc8d402800 RCX: 00007f58ec787bb9
     RDX: 0000000000000000 RSI: 00000000174a63ac RDI: 000000000000000b
     RBP: 0000000000000004 R08: 00007ffc8d402809 R09: 0000000000000020
     R10: 0000000000000000 R11: 0000000000000206 R12: 00007ffc8d402800
     R13: 00007ffc8d4016e0 R14: 0000000000000000 R15: 0000000000000000
     Code: e5 41 55 49 89 f5 41 54 49 89 d4 53 48 89 fb e8 a4 b4 ad ff 85 c0 74 09 80 3d b9 4c 96 00 00 74 43 48 8b b3 20 01 00 00 4d 85 ed <0f> b7 5e 10 74 29 4d 85 e4 74 24 4c 39 e3 4c 89 e2 4c 89 ef 48
     RIP: user_read+0x33/0xa0 RSP: ffff9878c047bee8
     CR2: 00000000ffffff92

Fixes: 61ea0c0ba9 ("KEYS: Skip key state checks when checking for possession")
Cc: <stable@vger.kernel.org>	[v3.13+]
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2017-09-25 15:19:57 +01:00
Eric Biggers
237bbd29f7 KEYS: prevent creating a different user's keyrings
It was possible for an unprivileged user to create the user and user
session keyrings for another user.  For example:

    sudo -u '#3000' sh -c 'keyctl add keyring _uid.4000 "" @u
                           keyctl add keyring _uid_ses.4000 "" @u
                           sleep 15' &
    sleep 1
    sudo -u '#4000' keyctl describe @u
    sudo -u '#4000' keyctl describe @us

This is problematic because these "fake" keyrings won't have the right
permissions.  In particular, the user who created them first will own
them and will have full access to them via the possessor permissions,
which can be used to compromise the security of a user's keys:

    -4: alswrv-----v------------  3000     0 keyring: _uid.4000
    -5: alswrv-----v------------  3000     0 keyring: _uid_ses.4000

Fix it by marking user and user session keyrings with a flag
KEY_FLAG_UID_KEYRING.  Then, when searching for a user or user session
keyring by name, skip all keyrings that don't have the flag set.

Fixes: 69664cf16a ("keys: don't generate user and user session keyrings unless they're accessed")
Cc: <stable@vger.kernel.org>	[v2.6.26+]
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2017-09-25 15:19:57 +01:00
Eric Biggers
e645016abc KEYS: fix writing past end of user-supplied buffer in keyring_read()
Userspace can call keyctl_read() on a keyring to get the list of IDs of
keys in the keyring.  But if the user-supplied buffer is too small, the
kernel would write the full list anyway --- which will corrupt whatever
userspace memory happened to be past the end of the buffer.  Fix it by
only filling the space that is available.

Fixes: b2a4df200d ("KEYS: Expand the capacity of a keyring")
Cc: <stable@vger.kernel.org>	[v3.13+]
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2017-09-25 15:19:57 +01:00
Eric Biggers
7fc0786d95 KEYS: fix key refcount leak in keyctl_read_key()
In keyctl_read_key(), if key_permission() were to return an error code
other than EACCES, we would leak a the reference to the key.  This can't
actually happen currently because key_permission() can only return an
error code other than EACCES if security_key_permission() does, only
SELinux and Smack implement that hook, and neither can return an error
code other than EACCES.  But it should still be fixed, as it is a bug
waiting to happen.

Fixes: 29db919063 ("[PATCH] Keys: Add LSM hooks for key management [try #3]")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2017-09-25 15:19:57 +01:00
Eric Biggers
884bee0215 KEYS: fix key refcount leak in keyctl_assume_authority()
In keyctl_assume_authority(), if keyctl_change_reqkey_auth() were to
fail, we would leak the reference to the 'authkey'.  Currently this can
only happen if prepare_creds() fails to allocate memory.  But it still
should be fixed, as it is a more severe bug waiting to happen.

This patch also moves the read of 'authkey->serial' to before the
reference to the authkey is dropped.  Doing the read after dropping the
reference is very fragile because it assumes we still hold another
reference to the key.  (Which we do, in current->cred->request_key_auth,
but there's no reason not to write it in the "obviously correct" way.)

Fixes: d84f4f992c ("CRED: Inaugurate COW credentials")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2017-09-25 15:19:57 +01:00
Eric Biggers
f7b48cf08f KEYS: don't revoke uninstantiated key in request_key_auth_new()
If key_instantiate_and_link() were to fail (which fortunately isn't
possible currently), the call to key_revoke(authkey) would crash with a
NULL pointer dereference in request_key_auth_revoke() because the key
has not yet been instantiated.

Fix this by removing the call to key_revoke().  key_put() is sufficient,
as it's not possible for an uninstantiated authkey to have been used for
anything yet.

Fixes: b5f545c880 ("[PATCH] keys: Permit running process to instantiate keys")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2017-09-25 15:19:56 +01:00
Eric Biggers
44d8143340 KEYS: fix cred refcount leak in request_key_auth_new()
In request_key_auth_new(), if key_alloc() or key_instantiate_and_link()
were to fail, we would leak a reference to the 'struct cred'.  Currently
this can only happen if key_alloc() fails to allocate memory.  But it
still should be fixed, as it is a more severe bug waiting to happen.

Fix it by cleaning things up to use a helper function which frees a
'struct request_key_auth' correctly.

Fixes: d84f4f992c ("CRED: Inaugurate COW credentials")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2017-09-25 15:03:55 +01:00
Arnaldo Carvalho de Melo
f1e52f14a6 perf evsel: Fix attr.exclude_kernel setting for default cycles:p
Yet another fix for probing the max attr.precise_ip setting: it is not
enough settting attr.exclude_kernel for !root users, as they _can_
profile the kernel if the kernel.perf_event_paranoid sysctl is set to
-1, so check that as well.

Testing it:

As non root:

  $ sysctl kernel.perf_event_paranoid
  kernel.perf_event_paranoid = 2
  $ perf record sleep 1
  $ perf evlist -v
  cycles:uppp: ..., exclude_kernel: 1, ... precise_ip: 3, ...

Now as non-root, but with kernel.perf_event_paranoid set set to the
most permissive value, -1:

  $ sysctl kernel.perf_event_paranoid
  kernel.perf_event_paranoid = -1
  $ perf record sleep 1
  $ perf evlist -v
  cycles:ppp: ..., exclude_kernel: 0, ... precise_ip: 3, ...
  $

I.e. non-root, default kernel.perf_event_paranoid: :uppp modifier = not allowed to sample the kernel,
     non-root, most permissible kernel.perf_event_paranoid: :ppp = allowed to sample the kernel.

In both cases, use the highest available precision: attr.precise_ip = 3.

Reported-and-Tested-by: Ingo Molnar <mingo@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Fixes: d37a369790 ("perf evsel: Fix attr.exclude_kernel setting for default cycles:p")
Link: http://lkml.kernel.org/n/tip-nj2qkf75xsd6pw6hhjzfqqdx@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-09-25 10:39:45 -03:00
Ingo Molnar
549a397652 tools include: Sync kernel ABI headers with tooling headers
Time for a sync with ABI/uapi headers with the upcoming v4.14 kernel.

None of the ABI changes require any source code level changes to our
existing in-kernel tooling code:

  - tools/arch/s390/include/uapi/asm/kvm.h:

      New KVM_S390_VM_TOD_EXT ABI, not used by in-kernel tooling.

  - tools/arch/x86/include/asm/cpufeatures.h:
    tools/arch/x86/include/asm/disabled-features.h:

      New PCID, SME and VGIF x86 CPU feature bits defined.

  - tools/include/asm-generic/hugetlb_encode.h:
    tools/include/uapi/asm-generic/mman-common.h:
    tools/include/uapi/linux/mman.h:

      Two new madvise() flags, plus a hugetlb system call mmap flags
      restructuring/extension changes.

  - tools/include/uapi/drm/drm.h:
    tools/include/uapi/drm/i915_drm.h:

      New drm_syncobj_create flags definitions, new drm_syncobj_wait
      and drm_syncobj_array ABIs. DRM_I915_PERF_* calls and a new
      I915_PARAM_HAS_EXEC_FENCE_ARRAY ABI for the Intel driver.

  - tools/include/uapi/linux/bpf.h:

      New bpf_sock fields (::mark and ::priority), new XDP_REDIRECT
      action, new kvm_ppc_smmu_info fields (::data_keys, instr_keys)

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Milian Wolff <milian.wolff@kdab.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Taeung Song <treeze.taeung@gmail.com>
Cc: Wang Nan <wangnan0@huawei.com>
Cc: Yao Jin <yao.jin@linux.intel.com>
Link: http://lkml.kernel.org/r/20170913073823.lxmi4c7ejqlfabjx@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-09-25 10:39:44 -03:00
Arnaldo Carvalho de Melo
89975bd335 perf tools: Get all of tools/{arch,include}/ in the MANIFEST
Now that I'm switching the container builds from using a local volume
pointing to the kernel repository with the perf sources, instead getting
a detached tarball to be able to use a container cluster, some places
broke because I forgot to put some of the required files in
tools/perf/MANIFEST, namely some bitsperlong.h files.

So, to fix it do the same as for tools/build/ and pack the whole
tools/arch/ directory.

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/n/tip-wmenpjfjsobwdnfde30qqncj@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-09-25 10:39:43 -03:00
Babu Moger
428dbf156c arch: change default endian for microblaze
Fix the default for microblaze. Michal Simek mentioned default for
microblaze should be CPU_LITTLE_ENDIAN.

Fixes : commit 206d3642d8 ("arch/microblaze: add choice for endianness
	and update Makefile")

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Cc: Michal Simek <monstr@monstr.eu>
Signed-off-by: Michal Simek <michal.simek@xilinx.com>
2017-09-25 15:31:26 +02:00
Thomas Meyer
64c99853ba microblaze: Cocci spatch "vma_pages"
Use vma_pages function on vma object instead of explicit computation.
Found by coccinelle spatch "api/vma_pages.cocci"

Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Michal Simek <michal.simek@xilinx.com>
2017-09-25 15:29:39 +02:00
Michal Simek
0add53713b microblaze: Add missing kvm_para.h to Kbuild
Running make allmodconfig;make is throwing compilation error:
  CC      kernel/watchdog.o
In file included from ./include/linux/kvm_para.h:4:0,
                 from kernel/watchdog.c:29:
./include/uapi/linux/kvm_para.h:32:26: fatal error: asm/kvm_para.h: No
such file or directory
 #include <asm/kvm_para.h>
                          ^
compilation terminated.
make[1]: *** [kernel/watchdog.o] Error 1
make: *** [kernel/watchdog.o] Error 2

Reported-by: Michal Hocko <mhocko@kernel.org>
Suggested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Michal Simek <michal.simek@xilinx.com>
Fixes: 83f0124ad8 ("microblaze: remove asm-generic wrapper headers")
Reviewed-by: Tobias Klauser <tklauser@distanz.ch>
Tested-by: Michal Hocko <mhocko@suse.com>
2017-09-25 15:25:47 +02:00
Kan Liang
29b46dfb13 perf/x86/intel/uncore: Correct num_boxes for IIO and IRP
There are 6 IIO/IRP boxes for CBDMA, PCIe0-2, MCP 0 and MCP 1
separately. Correct the num_boxes.

Signed-off-by: Kan Liang <Kan.liang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ak@linux.intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: acme@kernel.org
Link: http://lkml.kernel.org/r/1505149816-12580-1-git-send-email-kan.liang@intel.com
2017-09-25 12:43:56 +02:00
Kan Liang
450a978935 perf/x86/intel/rapl: Add missing CPU IDs
DENVERTON and GEMINI_LAKE support same RAPL counters as Apollo Lake.

Signed-off-by: Kan Liang <Kan.liang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ak@linux.intel.com
Cc: peterz@infradead.org
Cc: piotr.luc@intel.com
Cc: harry.pan@intel.com
Cc: srinivas.pandruvada@linux.intel.com
Link: http://lkml.kernel.org/r/20170908213449.6224-3-kan.liang@intel.com
2017-09-25 09:36:17 +02:00
Kan Liang
1aaccc40a1 perf/x86/msr: Add missing CPU IDs
Goldmont, Glodmont plus and Xeon Phi have MSR_SMI_COUNT as well.

Signed-off-by: Kan Liang <Kan.liang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ak@linux.intel.com
Cc: peterz@infradead.org
Cc: piotr.luc@intel.com
Cc: harry.pan@intel.com
Cc: srinivas.pandruvada@linux.intel.com
Link: http://lkml.kernel.org/r/20170908213449.6224-2-kan.liang@intel.com
2017-09-25 09:36:17 +02:00
Kan Liang
b09c146f8f perf/x86/intel/cstate: Add missing CPU IDs
Skylake server uses the same C-state residency events as Sandy Bridge.

Denverton and Gemini lake use the same C-state residency events as
Apollo Lake.

Signed-off-by: Kan Liang <Kan.liang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ak@linux.intel.com
Cc: peterz@infradead.org
Cc: piotr.luc@intel.com
Cc: harry.pan@intel.com
Cc: srinivas.pandruvada@linux.intel.com
Link: http://lkml.kernel.org/r/20170908213449.6224-1-kan.liang@intel.com
2017-09-25 09:36:17 +02:00
Ville Syrjälä
5ac751d9e6 x86: Don't cast away the __user in __get_user_asm_u64()
Don't cast away the __user in __get_user_asm_u64() on x86-32.
Prevents sparse getting upset.

Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20170912164000.13745-1-ville.syrjala@linux.intel.com
2017-09-25 09:36:16 +02:00
Sean Fu
7d7099433d x86/sysfs: Fix off-by-one error in loop termination
An off-by-one error in loop terminantion conditions in
create_setup_data_nodes() will lead to memory leak when
create_setup_data_node() failed.

Signed-off-by: Sean Fu <fxinrong@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1505090001-1157-1-git-send-email-fxinrong@gmail.com
2017-09-25 09:36:16 +02:00
Laurent Dufour
a3c4fb7c9c x86/mm: Fix fault error path using unsafe vma pointer
commit 7b2d0dbac4 ("x86/mm/pkeys: Pass VMA down in to fault signal
generation code") passes down a vma pointer to the error path, but that is
done once the mmap_sem is released when calling mm_fault_error() from
__do_page_fault().

This is dangerous as the vma structure is no more safe to be used once the
mmap_sem has been released. As only the protection key value is required in
the error processing, we could just pass down this value.

Fix it by passing a pointer to a protection key value down to the fault
signal generation code. The use of a pointer allows to keep the check
generating a warning message in fill_sig_info_pkey() when the vma was not
known. If the pointer is valid, the protection value can be accessed by
deferencing the pointer.

[ tglx: Made *pkey u32 as that's the type which is passed in siginfo ]

Fixes: 7b2d0dbac4 ("x86/mm/pkeys: Pass VMA down in to fault signal generation code")
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1504513935-12742-1-git-send-email-ldufour@linux.vnet.ibm.com
2017-09-25 09:36:15 +02:00
Bhumika Goyal
10430364eb x86/numachip: Add const and __initconst to numachip2_clockevent
Make this const as it is only used during a copy operation and add
__initconst as this usage is during the initialization phase.

Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: julia.lawall@lip6.fr
Cc: daniel.lezcano@linaro.org
Link: http://lkml.kernel.org/r/1504030631-10812-1-git-send-email-bhumirks@gmail.com
2017-09-25 09:36:15 +02:00
Eric Biggers
d5c8028b47 x86/fpu: Reinitialize FPU registers if restoring FPU state fails
Userspace can change the FPU state of a task using the ptrace() or
rt_sigreturn() system calls.  Because reserved bits in the FPU state can
cause the XRSTOR instruction to fail, the kernel has to carefully
validate that no reserved bits or other invalid values are being set.

Unfortunately, there have been bugs in this validation code.  For
example, we were not checking that the 'xcomp_bv' field in the
xstate_header was 0.  As-is, such bugs are exploitable to read the FPU
registers of other processes on the system.  To do so, an attacker can
create a task, assign to it an invalid FPU state, then spin in a loop
and monitor the values of the FPU registers.  Because the task's FPU
registers are not being restored, sometimes the FPU registers will have
the values from another process.

This is likely to continue to be a problem in the future because the
validation done by the CPU instructions like XRSTOR is not immediately
visible to kernel developers.  Nor will invalid FPU states ever be
encountered during ordinary use --- they will only be seen during
fuzzing or exploits.  There can even be reserved bits outside the
xstate_header which are easy to forget about.  For example, the MXCSR
register contains reserved bits, which were not validated by the
KVM_SET_XSAVE ioctl until commit a575813bfe ("KVM: x86: Fix load
damaged SSEx MXCSR register").

Therefore, mitigate this class of vulnerability by restoring the FPU
registers from init_fpstate if restoring from the task's state fails.

We actually used to do this, but it was (perhaps unwisely) removed by
commit 9ccc27a5d2 ("x86/fpu: Remove error return values from
copy_kernel_to_*regs() functions").  This new patch is also a bit
different.  First, it only clears the registers, not also the bad
in-memory state; this is simpler and makes it easier to make the
mitigation cover all callers of __copy_kernel_to_fpregs().  Second, it
does the register clearing in an exception handler so that no extra
instructions are added to context switches.  In fact, we *remove*
instructions, since previously we were always zeroing the register
containing 'err' even if CONFIG_X86_DEBUG_FPU was disabled.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Kevin Hao <haokexin@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Halcrow <mhalcrow@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
Cc: kernel-hardening@lists.openwall.com
Link: http://lkml.kernel.org/r/20170922174156.16780-4-ebiggers3@gmail.com
Link: http://lkml.kernel.org/r/20170923130016.21448-27-mingo@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-25 09:26:38 +02:00