Commit graph

649345 commits

Author SHA1 Message Date
Shaohua Li
a12f1ae61c aio: fix lock dep warning
lockdep reports a warnning. file_start_write/file_end_write only
acquire/release the lock for regular files. So checking the files in aio
side too.

[  453.532141] ------------[ cut here ]------------
[  453.533011] WARNING: CPU: 1 PID: 1298 at ../kernel/locking/lockdep.c:3514 lock_release+0x434/0x670
[  453.533011] DEBUG_LOCKS_WARN_ON(depth <= 0)
[  453.533011] Modules linked in:
[  453.533011] CPU: 1 PID: 1298 Comm: fio Not tainted 4.9.0+ #964
[  453.533011] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014
[  453.533011]  ffff8803a24b7a70 ffffffff8196cffb ffff8803a24b7ae8 0000000000000000
[  453.533011]  ffff8803a24b7ab8 ffffffff81091ee1 ffff8803a5dba700 00000dba00000008
[  453.533011]  ffffed0074496f59 ffff8803a5dbaf54 ffff8803ae0f8488 fffffffffffffdef
[  453.533011] Call Trace:
[  453.533011]  [<ffffffff8196cffb>] dump_stack+0x67/0x9c
[  453.533011]  [<ffffffff81091ee1>] __warn+0x111/0x130
[  453.533011]  [<ffffffff81091f97>] warn_slowpath_fmt+0x97/0xb0
[  453.533011]  [<ffffffff81091f00>] ? __warn+0x130/0x130
[  453.533011]  [<ffffffff8191b789>] ? blk_finish_plug+0x29/0x60
[  453.533011]  [<ffffffff811205d4>] lock_release+0x434/0x670
[  453.533011]  [<ffffffff8198af94>] ? import_single_range+0xd4/0x110
[  453.533011]  [<ffffffff81322195>] ? rw_verify_area+0x65/0x140
[  453.533011]  [<ffffffff813aa696>] ? aio_write+0x1f6/0x280
[  453.533011]  [<ffffffff813aa6c9>] aio_write+0x229/0x280
[  453.533011]  [<ffffffff813aa4a0>] ? aio_complete+0x640/0x640
[  453.533011]  [<ffffffff8111df20>] ? debug_check_no_locks_freed+0x1a0/0x1a0
[  453.533011]  [<ffffffff8114793a>] ? debug_lockdep_rcu_enabled.part.2+0x1a/0x30
[  453.533011]  [<ffffffff81147985>] ? debug_lockdep_rcu_enabled+0x35/0x40
[  453.533011]  [<ffffffff812a92be>] ? __might_fault+0x7e/0xf0
[  453.533011]  [<ffffffff813ac9bc>] do_io_submit+0x94c/0xb10
[  453.533011]  [<ffffffff813ac2ae>] ? do_io_submit+0x23e/0xb10
[  453.533011]  [<ffffffff813ac070>] ? SyS_io_destroy+0x270/0x270
[  453.533011]  [<ffffffff8111d7b3>] ? mark_held_locks+0x23/0xc0
[  453.533011]  [<ffffffff8100201a>] ? trace_hardirqs_on_thunk+0x1a/0x1c
[  453.533011]  [<ffffffff813acb90>] SyS_io_submit+0x10/0x20
[  453.533011]  [<ffffffff824f96aa>] entry_SYSCALL_64_fastpath+0x18/0xad
[  453.533011]  [<ffffffff81119190>] ? trace_hardirqs_off_caller+0xc0/0x110
[  453.533011] ---[ end trace b2fbe664d1cc0082 ]---

Cc: Dmitry Monakhov <dmonakhov@openvz.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-01-14 19:31:40 -05:00
Linus Torvalds
f0ad17712b dmaengine-4.10-rc4
dmaengine fixes for 4.10-rc4
 
 The fixes this time around are spread over drivers, pretty normal update.
 
  o PCI ID for SKL ioatdma, workaround for SKX and ioat_alloc_chan_resources
    sleepy allocation fix.
  o dw kconfig typo fix
  o null pointer deref for stm32
  o MAINTAINERS Update for at_hdmac
  o pl330 runtime pm fixes
  o omap-dma port window fix
  o rcar-dmac unmap slave resource fix.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJYekAeAAoJEHwUBw8lI4NHyWAQAMQxH4dOms2p4PuM65HCwJFW
 BvEsljmvn+XmlqwhaMxrwGcyMPUpVSkQUkMQZYa9hMAH4fwWLIAemxdHYR9WpXL4
 YsVz9pAgDs6G3XbunhMA6mM5vyR18uKkzUaLWtU/z4nw45RWe7NkG9Gw7k2gWKfK
 rNYIOYts4Ue3sPXsYjYn4mdHkWslFcaKuxxRZGpKAmF/V7i7m5xg6WVEykVRK7EL
 0e944zdAwmRkc3dcW+SWmZqbZwp+Sjj0Vn5kCbTEoItUQN40VnxOK25ZN+YAxpxg
 28pljjTfo0LRW0rgNUGulFhg0ssMns4J3gArfqbzRvDWOFCH6MZ1e1CwbLmZLF9U
 rK/escqsKuXFG76EstdZzfjq0H15vfF1backq23ww6WXr9w4TAUwUJR7mwShVP+O
 9WTkfjwlXEnUoixutfsfTr7YSdztCBXB34BLJxNKItl5Mm7N923BTPW06e9J7S61
 Fiv8E85U4WnMe1BLBPNK550Zz6YUAtqfBeYBFWZExury9jSX5tIjeoEbaxDO+FEJ
 fvWcjFoHZ2nQfigyafaoEIkpvjnTgveSs39tPWhTat/pucCb7za/aiA7syR9oGF2
 kRa7++5TKt7q+hr1KOzvK3TNgI8fVItBd0Asi0LejIdJa4OoxAPNa7EuRLNh+vZG
 YQoKllPDYToI/RowjGKW
 =pJ7d
 -----END PGP SIGNATURE-----

Merge tag 'dmaengine-fix-4.10-rc4' of git://git.infradead.org/users/vkoul/slave-dma

Pull dmaengine fixes from Vinod Koul:
 "The fixes this time around are spread over drivers, pretty normal
  update:

   - PCI ID for SKL ioatdma, workaround for SKX and
     ioat_alloc_chan_resources sleepy allocation fix

   - dw kconfig typo fix

   - null pointer deref for stm32

   - MAINTAINERS Update for at_hdmac

   - pl330 runtime pm fixes

   - omap-dma port window fix

   - rcar-dmac unmap slave resource fix"

* tag 'dmaengine-fix-4.10-rc4' of git://git.infradead.org/users/vkoul/slave-dma:
  dmaengine: rcar-dmac: unmap slave resource when channel is freed
  dmaengine: omap-dma: Fix the port_window support
  dmaengine: iota: ioat_alloc_chan_resources should not perform sleeping allocations.
  dmaengine: pl330: Fix runtime PM support for terminated transfers
  MAINTAINERS: dmaengine: Update + Hand over the at_hdmac driver to Ludovic
  dmaengine: omap-dma: Fix dynamic lch_map allocation
  dmaengine: ti-dma-crossbar: Add some 'of_node_put()' in error path.
  dmaengine: stm32-dma: Fix null pointer dereference in stm32_dma_tx_status
  dmaengine: stm32-dma: Set correct args number for DMA request from DT
  dmaengine: dw: fix typo in Kconfig
  dmaengine: ioatdma: workaround SKX ioatdma version
  dmaengine: ioatdma: Add Skylake PCI Dev ID
2017-01-14 11:09:24 -08:00
David S. Miller
bb60b8b35a For 4.11, we seem to have more than in the past few releases:
* socket owner support for connections, so when the wifi
    manager (e.g. wpa_supplicant) is killed, connections are
    torn down - wpa_supplicant is critical to managing certain
    operations, and can opt in to this where applicable
  * minstrel & minstrel_ht updates to be more efficient (time and space)
  * set wifi_acked/wifi_acked_valid for skb->destructor use in the
    kernel, which was already available to userspace
  * don't indicate new mesh peers that might be used if there's no
    room to add them
  * multicast-to-unicast support in mac80211, for better medium usage
    (since unicast frames can use *much* higher rates, by ~3 orders of
    magnitude)
  * add API to read channel (frequency) limitations from DT
  * add infrastructure to allow randomizing public action frames for
    MAC address privacy (still requires driver support)
  * many cleanups and small improvements/fixes across the board
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJYeKu7AAoJEGt7eEactAAdwjEP/RA4bXFMfkC7qUJ++cLrMMwY
 yCvjb8+ULWL2wbCzpfY37acbGJgot3DNoQJzrO2jMQPqyM9nRlTMg5aF49cI7t62
 gU6daNKJaGBe/0yeG7lTJ4n5UtVCDtN45hGc06Yert+ewb9njiJf+XYrtCWetsIJ
 5bOLYQKPWOz/7UyMH7uJ25zrPFaiA3y7XnXKPEudagG/EwEq9ZuUpSSfLwEAEBPi
 6i/2w4fLj32vXRsQMvQT0sU6mjd+1ub8Is7w5l2F06iWwNYPzdSM0IbU+E+ie2tk
 sE6RA70c4ILrp8KisTAz2lJPa4XEpFkLhI3lzRRy8CVzjyyo/OJen92zvr2R7TVb
 /uZG9qfRQ3UitQmgeKd+wS8PsbRAyWUR/xhNxD2r7zARH2vliwyneU+zEpXLeGA1
 Y4PrN1+Fk45Ye4/4XSbPO4cf1MHX7qinN4rjrpsJKPwoYD/gQ1cZvef4AbaKPvq6
 oCKRVrwNoUuSB8NTcMLPqze3WCfhnJyVUhCZTyzHeW4uG81qrHwrvBvM25vcWGcm
 CcSWFktFIpuGML4FCU3byZfb0NkmJtpCD4n7P98WFPGjvsWIEVCMckqlC8x1F7B7
 BqqjGS2mGA17Xy0uLfmN/JempesQJnZhnAnFERdyX1S1YQuKhLwEu7OsYegnStDL
 Cn1wFw2/qcgeTkJfBICB
 =UToW
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-davem-2017-01-13' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
For 4.11, we seem to have more than in the past few releases:
 * socket owner support for connections, so when the wifi
   manager (e.g. wpa_supplicant) is killed, connections are
   torn down - wpa_supplicant is critical to managing certain
   operations, and can opt in to this where applicable
 * minstrel & minstrel_ht updates to be more efficient (time and space)
 * set wifi_acked/wifi_acked_valid for skb->destructor use in the
   kernel, which was already available to userspace
 * don't indicate new mesh peers that might be used if there's no
   room to add them
 * multicast-to-unicast support in mac80211, for better medium usage
   (since unicast frames can use *much* higher rates, by ~3 orders of
   magnitude)
 * add API to read channel (frequency) limitations from DT
 * add infrastructure to allow randomizing public action frames for
   MAC address privacy (still requires driver support)
 * many cleanups and small improvements/fixes across the board
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-14 12:02:15 -05:00
Shyam Saini
ca4b5eb88a cxgb4: Remove redundant memset before memcpy
The region set by the call to memset, immediately overwritten by
the subsequent call to memcpy and thus makes the  memset redundant.

Also remove the memset((&info, 0, sizeof(info)) on line 398 because
info is memcpy()'ed to before being used in the loop and it isn't
used outside of the loop.

Signed-off-by: Shyam Saini <mayhs11saini@gmail.com>
Reviewed-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-14 12:00:00 -05:00
Peter Jones
0100a3e67a efi/x86: Prune invalid memory map entries and fix boot regression
Some machines, such as the Lenovo ThinkPad W541 with firmware GNET80WW
(2.28), include memory map entries with phys_addr=0x0 and num_pages=0.

These machines fail to boot after the following commit,

  commit 8e80632fb2 ("efi/esrt: Use efi_mem_reserve() and avoid a kmalloc()")

Fix this by removing such bogus entries from the memory map.

Furthermore, currently the log output for this case (with efi=debug)
looks like:

 [    0.000000] efi: mem45: [Reserved           |   |  |  |  |  |  |  |  |  |  |  |  ] range=[0x0000000000000000-0xffffffffffffffff] (0MB)

This is clearly wrong, and also not as informative as it could be.  This
patch changes it so that if we find obviously invalid memory map
entries, we print an error and skip those entries.  It also detects the
display of the address range calculation overflow, so the new output is:

 [    0.000000] efi: [Firmware Bug]: Invalid EFI memory map entries:
 [    0.000000] efi: mem45: [Reserved           |   |  |  |  |  |  |  |   |  |  |  |  ] range=[0x0000000000000000-0x0000000000000000] (invalid)

It also detects memory map sizes that would overflow the physical
address, for example phys_addr=0xfffffffffffff000 and
num_pages=0x0200000000000001, and prints:

 [    0.000000] efi: [Firmware Bug]: Invalid EFI memory map entries:
 [    0.000000] efi: mem45: [Reserved           |   |  |  |  |  |  |  |   |  |  |  |  ] range=[phys_addr=0xfffffffffffff000-0x20ffffffffffffffff] (invalid)

It then removes these entries from the memory map.

Signed-off-by: Peter Jones <pjones@redhat.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
[ardb: refactor for clarity with no functional changes, avoid PAGE_SHIFT]
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
[Matt: Include bugzilla info in commit log]
Cc: <stable@vger.kernel.org> # v4.9+
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=191121
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 16:48:53 +01:00
Greg Kroah-Hartman
c7334ce814 Revert "driver core: Add deferred_probe attribute to devices in sysfs"
This reverts commit 6751667a29.

Rob Herring objected to it, and a replacement for it will be added using
debugfs in the future.

Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
Reported-by: Rob Herring <robh@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-14 14:09:03 +01:00
Jiri Olsa
18e7a45af9 perf/x86: Reject non sampling events with precise_ip
As Peter suggested [1] rejecting non sampling PEBS events,
because they dont make any sense and could cause bugs
in the NMI handler [2].

  [1] http://lkml.kernel.org/r/20170103094059.GC3093@worktop
  [2] http://lkml.kernel.org/r/1482931866-6018-3-git-send-email-jolsa@kernel.org

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vince@deater.net>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/20170103142454.GA26251@krava
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:06:50 +01:00
Jiri Olsa
475113d937 perf/x86/intel: Account interrupts for PEBS errors
It's possible to set up PEBS events to get only errors and not
any data, like on SNB-X (model 45) and IVB-EP (model 62)
via 2 perf commands running simultaneously:

    taskset -c 1 ./perf record -c 4 -e branches:pp -j any -C 10

This leads to a soft lock up, because the error path of the
intel_pmu_drain_pebs_nhm() does not account event->hw.interrupt
for error PEBS interrupts, so in case you're getting ONLY
errors you don't have a way to stop the event when it's over
the max_samples_per_tick limit:

  NMI watchdog: BUG: soft lockup - CPU#22 stuck for 22s! [perf_fuzzer:5816]
  ...
  RIP: 0010:[<ffffffff81159232>]  [<ffffffff81159232>] smp_call_function_single+0xe2/0x140
  ...
  Call Trace:
   ? trace_hardirqs_on_caller+0xf5/0x1b0
   ? perf_cgroup_attach+0x70/0x70
   perf_install_in_context+0x199/0x1b0
   ? ctx_resched+0x90/0x90
   SYSC_perf_event_open+0x641/0xf90
   SyS_perf_event_open+0x9/0x10
   do_syscall_64+0x6c/0x1f0
   entry_SYSCALL64_slow_path+0x25/0x25

Add perf_event_account_interrupt() which does the interrupt
and frequency checks and call it from intel_pmu_drain_pebs_nhm()'s
error path.

We keep the pending_kill and pending_wakeup logic only in the
__perf_event_overflow() path, because they make sense only if
there's any data to deliver.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vince@deater.net>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1482931866-6018-2-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:06:49 +01:00
Peter Zijlstra
321027c1fe perf/core: Fix concurrent sys_perf_event_open() vs. 'move_group' race
Di Shen reported a race between two concurrent sys_perf_event_open()
calls where both try and move the same pre-existing software group
into a hardware context.

The problem is exactly that described in commit:

  f63a8daa58 ("perf: Fix event->ctx locking")

... where, while we wait for a ctx->mutex acquisition, the event->ctx
relation can have changed under us.

That very same commit failed to recognise sys_perf_event_context() as an
external access vector to the events and thereby didn't apply the
established locking rules correctly.

So while one sys_perf_event_open() call is stuck waiting on
mutex_lock_double(), the other (which owns said locks) moves the group
about. So by the time the former sys_perf_event_open() acquires the
locks, the context we've acquired is stale (and possibly dead).

Apply the established locking rules as per perf_event_ctx_lock_nested()
to the mutex_lock_double() for the 'move_group' case. This obviously means
we need to validate state after we acquire the locks.

Reported-by: Di Shen (Keen Lab)
Tested-by: John Dias <joaodias@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Min Chong <mchong@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: f63a8daa58 ("perf: Fix event->ctx locking")
Link: http://lkml.kernel.org/r/20170106131444.GZ3174@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 10:56:11 +01:00
Peter Zijlstra
63cae12bce perf/core: Fix sys_perf_event_open() vs. hotplug
There is problem with installing an event in a task that is 'stuck' on
an offline CPU.

Blocked tasks are not dis-assosciated from offlined CPUs, after all, a
blocked task doesn't run and doesn't require a CPU etc.. Only on
wakeup do we ammend the situation and place the task on a available
CPU.

If we hit such a task with perf_install_in_context() we'll loop until
either that task wakes up or the CPU comes back online, if the task
waking depends on the event being installed, we're stuck.

While looking into this issue, I also spotted another problem, if we
hit a task with perf_install_in_context() that is in the middle of
being migrated, that is we observe the old CPU before sending the IPI,
but run the IPI (on the old CPU) while the task is already running on
the new CPU, things also go sideways.

Rework things to rely on task_curr() -- outside of rq->lock -- which
is rather tricky. Imagine the following scenario where we're trying to
install the first event into our task 't':

CPU0            CPU1            CPU2

                (current == t)

t->perf_event_ctxp[] = ctx;
smp_mb();
cpu = task_cpu(t);

                switch(t, n);
                                migrate(t, 2);
                                switch(p, t);

                                ctx = t->perf_event_ctxp[]; // must not be NULL

smp_function_call(cpu, ..);

                generic_exec_single()
                  func();
                    spin_lock(ctx->lock);
                    if (task_curr(t)) // false

                    add_event_to_ctx();
                    spin_unlock(ctx->lock);

                                perf_event_context_sched_in();
                                  spin_lock(ctx->lock);
                                  // sees event

So its CPU0's store of t->perf_event_ctxp[] that must not go 'missing'.
Because if CPU2's load of that variable were to observe NULL, it would
not try to schedule the ctx and we'd have a task running without its
counter, which would be 'bad'.

As long as we observe !NULL, we'll acquire ctx->lock. If we acquire it
first and not see the event yet, then CPU0 must observe task_curr()
and retry. If the install happens first, then we must see the event on
sched-in and all is well.

I think we can translate the first part (until the 'must not be NULL')
of the scenario to a litmus test like:

  C C-peterz

  {
  }

  P0(int *x, int *y)
  {
          int r1;

          WRITE_ONCE(*x, 1);
          smp_mb();
          r1 = READ_ONCE(*y);
  }

  P1(int *y, int *z)
  {
          WRITE_ONCE(*y, 1);
          smp_store_release(z, 1);
  }

  P2(int *x, int *z)
  {
          int r1;
          int r2;

          r1 = smp_load_acquire(z);
	  smp_mb();
          r2 = READ_ONCE(*x);
  }

  exists
  (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)

Where:
  x is perf_event_ctxp[],
  y is our tasks's CPU, and
  z is our task being placed on the rq of CPU2.

The P0 smp_mb() is the one added by this patch, ordering the store to
perf_event_ctxp[] from find_get_context() and the load of task_cpu()
in task_function_call().

The smp_store_release/smp_load_acquire model the RCpc locking of the
rq->lock and the smp_mb() of P2 is the context switch switching from
whatever CPU2 was running to our task 't'.

This litmus test evaluates into:

  Test C-peterz Allowed
  States 7
  0:r1=0; 2:r1=0; 2:r2=0;
  0:r1=0; 2:r1=0; 2:r2=1;
  0:r1=0; 2:r1=1; 2:r2=1;
  0:r1=1; 2:r1=0; 2:r2=0;
  0:r1=1; 2:r1=0; 2:r2=1;
  0:r1=1; 2:r1=1; 2:r2=0;
  0:r1=1; 2:r1=1; 2:r2=1;
  No
  Witnesses
  Positive: 0 Negative: 7
  Condition exists (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)
  Observation C-peterz Never 0 7
  Hash=e427f41d9146b2a5445101d3e2fcaa34

And the strong and weak model agree.

Reported-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Will Deacon <will.deacon@arm.com>
Cc: jeremy.linton@arm.com
Link: http://lkml.kernel.org/r/20161209135900.GU3174@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 10:56:10 +01:00
Tobias Klauser
4538286257 x86/mpx: Use compatible types in comparison to fix sparse error
info->si_addr is of type void __user *, so it should be compared against
something from the same address space.

This fixes the following sparse error:

  arch/x86/mm/mpx.c:296:27: error: incompatible types in comparison expression (different address spaces)

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 09:32:06 +01:00
Len Brown
695085b4bc x86/tsc: Add the Intel Denverton Processor to native_calibrate_tsc()
The Intel Denverton microserver uses a 25 MHz TSC crystal,
so we can derive its exact [*] TSC frequency
using CPUID and some arithmetic, eg.:

  TSC: 1800 MHz (25000000 Hz * 216 / 3 / 1000000)

[*] 'exact' is only as good as the crystal, which should be +/- 20ppm

Signed-off-by: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/306899f94804aece6d8fa8b4223ede3b48dbb59c.1484287748.git.len.brown@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 09:30:37 +01:00
Ganesh Goudar
f750e82e7f cxgb4: Fix misleading packet/frame count stats.
Do not count pause frames as part of general TX/RX frame
counters.

Based on the original work of Casey Leedom <leedom@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 23:35:36 -05:00
David S. Miller
4b89aa3c16 Merge branch 'bnxt_en-next'
Michael Chan says:

====================
bnxt_en: Misc. updates for net-next.

Miscellaneous updates including firmware spec update, ethtool -p blinking
LED support, RDMA SRIOV config callback, and minor fixes.

v2: Dropped the DCBX RoCE app TLV patch until the ETH_P_IBOE RDMA patch
is merged.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 23:21:31 -05:00
Michael Chan
2f5938467b bnxt_en: Add the ulp_sriov_cfg hooks for bnxt_re RDMA driver.
Add the ulp_sriov_cfg callbacks when the number of VFs is changing.  This
allows the RDMA driver to provision RDMA resources for the VFs.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 23:21:31 -05:00
Michael Chan
5ad2cbeed7 bnxt_en: Add support for ethtool -p.
Add LED blinking code to support ethtool -p on the PF.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 23:21:31 -05:00
Michael Chan
f183886c0d bnxt_en: Update to firmware interface spec to 1.6.1.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 23:21:31 -05:00
Michael Chan
341138c3e6 bnxt_en: Clear TPA flags when BNXT_FLAG_NO_AGG_RINGS is set.
Commit bdbd1eb59c ("bnxt_en: Handle no aggregation ring gracefully.")
introduced the BNXT_FLAG_NO_AGG_RINGS flag.  For consistency,
bnxt_set_tpa_flags() should also clear TPA flags when there are no
aggregation rings.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 23:21:31 -05:00
Michael Chan
b742995445 bnxt_en: Fix compiler warnings when CONFIG_RFS_ACCEL is not defined.
CC [M]  drivers/net/ethernet/broadcom/bnxt/bnxt.o
drivers/net/ethernet/broadcom/bnxt/bnxt.c:4947:21: warning: ‘bnxt_get_max_func_rss_ctxs’ defined but not used [-Wunused-function]
 static unsigned int bnxt_get_max_func_rss_ctxs(struct bnxt *bp)
                     ^
  CC [M]  drivers/net/ethernet/broadcom/bnxt/bnxt.o
drivers/net/ethernet/broadcom/bnxt/bnxt.c:4956:21: warning: ‘bnxt_get_max_func_vnics’ defined but not used [-Wunused-function]
 static unsigned int bnxt_get_max_func_vnics(struct bnxt *bp)
                     ^

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 23:21:31 -05:00
David S. Miller
718e14bb29 Merge branch 'tcp-RACK-fast-recovery'
Yuchung Cheng says:

====================
tcp: RACK fast recovery

The patch set enables RACK loss detection (draft-ietf-tcpm-rack-01)
to trigger fast recovery with a reordering timer.

Previously RACK has been running in auxiliary mode where it is
used to detect packet losses once the recovery has triggered by
other algorithms (e.g., FACK). By inspecting packet timestamps,
RACK can start ACK-driven repairs timely. A few similar heuristics
are no longer needed and are either removed or disabled to reduce
the complexity of the Linux TCP loss recovery engine:

  1. FACK (Forward Acknowledgement)
  2. Early Retransmit (RFC5827)
  3. thin_dupack (fast recovery on single DUPACK for thin-streams)
  4. NCR (Non-Congestion Robustness RFC4653) (RFC4653)
  5. Forward Retransmit

After this change, Linux's loss recovery algorithms consist of
  1. Conventional DUPACK threshold approach (RFC6675)
  2. RACK and Tail Loss Probe (draft-ietf-tcpm-rack-01)
  3. RTO plus F-RTO extension (RFC5682)

The patch set has been tested on Google servers extensively and
presented in several IETF meetings. The data suggests that RACK
successfully improves recovery performance:
https://www.ietf.org/proceedings/97/slides/slides-97-tcpm-draft-ietf-tcpm-rack-01.pdf
https://www.ietf.org/proceedings/96/slides/slides-96-tcpm-3.pdf
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 22:37:18 -05:00
Yuchung Cheng
94bdc9785a tcp: disable fack by default
This patch disables FACK by default as RACK is the successor of FACK
(inspired by the insights behind FACK).

FACK[1] in Linux works as follows: a packet P is deemed lost,
if packet Q of higher sequence is s/acked and P and Q are distant
by at least dupthresh number of packets in sequence space.

FACK is more aggressive than the IETF recommened recovery for SACK
(RFC3517 A Conservative Selective Acknowledgment (SACK)-based Loss
 Recovery Algorithm for TCP), because a single SACK may trigger
fast recovery. This obviously won't work well with reordering so
FACK is dynamically disabled upon detecting reordering.

RACK supersedes FACK by using time distance instead of sequence
distance. On reordering, RACK waits for a quarter of RTT receiving
a single SACK before starting recovery. (the timer can be made more
adaptive in the future by measuring reordering distance in time,
but currently RTT/4 seem to work well.) Once the recovery starts,
RACK behaves almost like FACK because it reduces the reodering
window to 1ms, so it fast retransmits quickly. In addition RACK
can detect loss retransmission as it does not care about the packet
sequences (being repeated or not), which is extremely useful when
the connection is going through a traffic policer.

Google server experiments indicate that disabling FACK after enabling
RACK has negligible impact on the overall loss recovery performance
with more reordering events detected.  But we still keep the FACK
implementation for backup if RACK has bugs that needs to be disabled.

[1] M. Mathis, J. Mahdavi, "Forward Acknowledgment: Refining
TCP Congestion Control," In Proceedings of SIGCOMM '96, August 1996.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 22:37:16 -05:00
Yuchung Cheng
4a7f600944 tcp: remove thin_dupack feature
Thin stream DUPACK is to start fast recovery on only one DUPACK
provided the connection is a thin stream (i.e., low inflight).  But
this older feature is now subsumed with RACK. If a connection
receives only a single DUPACK, RACK would arm a reordering timer
and soon starts fast recovery instead of timeout if no further
ACKs are received.

The socket option (THIN_DUPACK) is kept as a nop for compatibility.
Note that this patch does not change another thin-stream feature
which enables linear RTO. Although it might be good to generalize
that in the future (i.e., linear RTO for the first say 3 retries).

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 22:37:16 -05:00
Yuchung Cheng
ac229dca7e tcp: remove RFC4653 NCR
This patch removes the (partial) implementation of the aggressive
limited transmit in RFC4653 TCP Non-Congestion Robustness (NCR).

NCR is a mitigation to the problem created by the dynamic
DUPACK threshold.  With the current adaptive DUPACK threshold
(tp->reordering) could cause timeouts by preventing fast recovery.
For example, if the last packet of a cwnd burst was reordered, the
threshold will be set to the size of cwnd. But if next application
burst is smaller than threshold and has drops instead of reorderings,
the sender would not trigger fast recovery but instead resorts to a
timeout recovery.

NCR mitigates this issue by checking the number of DUPACKs against
the current flight size additionally. The techniqueue is similar to
the early retransmit RFC.

With RACK loss detection, this mitigation is not needed, because RACK
does not use DUPACK threshold to detect losses. RACK arms a reordering
timer to fire at most a quarter RTT later to start fast recovery.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 22:37:16 -05:00
Yuchung Cheng
bec41a11dd tcp: remove early retransmit
This patch removes the support of RFC5827 early retransmit (i.e.,
fast recovery on small inflight with <3 dupacks) because it is
subsumed by the new RACK loss detection. More specifically when
RACK receives DUPACKs, it'll arm a reordering timer to start fast
recovery after a quarter of (min)RTT, hence it covers the early
retransmit except RACK does not limit itself to specific inflight
or dupack numbers.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 22:37:16 -05:00
Yuchung Cheng
840a3cbe89 tcp: remove forward retransmit feature
Forward retransmit is an esoteric feature in RFC3517 (condition(3)
in the NextSeg()). Basically if a packet is not considered lost by
the current criteria (# of dupacks etc), but the congestion window
has room for more packets, then retransmit this packet.

However it actually conflicts with the rest of recovery design. For
example, when reordering is detected we want to be conservative
in retransmitting packets but forward-retransmit feature would
break that to force more retransmission. Also the implementation is
fairly complicated inside the retransmission logic inducing extra
iterations in the write queue. With RACK losses are being detected
timely and this heuristic is no longer necessary. There this patch
removes the feature.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 22:37:16 -05:00
Yuchung Cheng
89fe18e44f tcp: extend F-RTO to catch more spurious timeouts
Current F-RTO reverts cwnd reset whenever a never-retransmitted
packet was (s)acked. The timeout can be declared spurious because
the packets acknoledged with this ACK was transmitted before the
timeout, so clearly not all the packets are lost to reset the cwnd.

This nice detection does not really depend F-RTO internals. This
patch applies the detection universally. On Google servers this
change detected 20% more spurious timeouts.

Suggested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 22:37:16 -05:00
Yuchung Cheng
a0370b3f3f tcp: enable RACK loss detection to trigger recovery
This patch changes two things:

1. Start fast recovery with RACK in addition to other heuristics
   (e.g., DUPACK threshold, FACK). Prior to this change RACK
   is enabled to detect losses only after the recovery has
   started by other algorithms.

2. Disable TCP early retransmit. RACK subsumes the early retransmit
   with the new reordering timer feature. A latter patch in this
   series removes the early retransmit code.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 22:37:16 -05:00
Yuchung Cheng
98e36d449c tcp: check undo conditions before detecting losses
Currently RACK would mark loss before the undo operations in TCP
loss recovery. This could incorrectly identify real losses as
spurious. For example a sender first experiences a delay spike and
then eventually some packets were lost due to buffer overrun.
In this case, the sender should perform fast recovery b/c not all
the packets were lost.

But the sender may first trigger a (spurious) RTO and reset
cwnd to 1. The following ACKs may used to mark real losses by
tcp_rack_mark_lost. Then in tcp_process_loss this ACK could trigger
F-RTO undo condition and unmark real losses and revert the cwnd
reduction. If there are no more ACKs coming back, eventually the
sender would timeout again instead of performing fast recovery.

The patch fixes this incorrect process by always performing
the undo checks before detecting losses.

Fixes: 4f41b1c58a ("tcp: use RACK to detect losses")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 22:37:16 -05:00
Yuchung Cheng
1d0833df59 tcp: use sequence to break TS ties for RACK loss detection
The packets inside a jumbo skb (e.g., TSO) share the same skb
timestamp, even though they are sent sequentially on the wire. Since
RACK is based on time, it can not detect some packets inside the
same skb are lost.  However, we can leverage the packet sequence
numbers as extended timestamps to detect losses. Therefore, when
RACK timestamp is identical to skb's timestamp (i.e., one of the
packets of the skb is acked or sacked), we use the sequence numbers
of the acked and unacked packets to break ties.

We can use the same sequence logic to advance RACK xmit time as
well to detect more losses and avoid timeout.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 22:37:16 -05:00
Yuchung Cheng
57dde7f70d tcp: add reordering timer in RACK loss detection
This patch makes RACK install a reordering timer when it suspects
some packets might be lost, but wants to delay the decision
a little bit to accomodate reordering.

It does not create a new timer but instead repurposes the existing
RTO timer, because both are meant to retransmit packets.
Specifically it arms a timer ICSK_TIME_REO_TIMEOUT when
the RACK timing check fails. The wait time is set to

  RACK.RTT + RACK.reo_wnd - (NOW - Packet.xmit_time) + fudge

This translates to expecting a packet (Packet) should take
(RACK.RTT + RACK.reo_wnd + fudge) to deliver after it was sent.

When there are multiple packets that need a timer, we use one timer
with the maximum timeout. Therefore the timer conservatively uses
the maximum window to expire N packets by one timeout, instead of
N timeouts to expire N packets sent at different times.

The fudge factor is 2 jiffies to ensure when the timer fires, all
the suspected packets would exceed the deadline and be marked lost
by tcp_rack_detect_loss(). It has to be at least 1 jiffy because the
clock may tick between calling icsk_reset_xmit_timer(timeout) and
actually hang the timer. The next jiffy is to lower-bound the timeout
to 2 jiffies when reo_wnd is < 1ms.

When the reordering timer fires (tcp_rack_reo_timeout): If we aren't
in Recovery we'll enter fast recovery and force fast retransmit.
This is very similar to the early retransmit (RFC5827) except RACK
is not constrained to only enter recovery for small outstanding
flights.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 22:37:16 -05:00
Yuchung Cheng
deed7be78f tcp: record most recent RTT in RACK loss detection
Record the most recent RTT in RACK. It is often identical to the
"ca_rtt_us" values in tcp_clean_rtx_queue. But when the packet has
been retransmitted, RACK choses to believe the ACK is for the
(latest) retransmitted packet if the RTT is over minimum RTT.

This requires passing the arrival time of the most recent ACK to
RACK routines. The timestamp is now recorded in the "ack_time"
in tcp_sacktag_state during the ACK processing.

This patch does not change the RACK algorithm itself. It only adds
the RTT variable to prepare the next main patch.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 22:37:16 -05:00
Yuchung Cheng
e636f8b010 tcp: new helper for RACK to detect loss
Create a new helper tcp_rack_detect_loss to prepare the upcoming
RACK reordering timer patch.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 22:37:16 -05:00
Yuchung Cheng
db8da6bb57 tcp: new helper function for RACK loss detection
Create a new helper tcp_rack_mark_skb_lost to prepare the
upcoming RACK reordering timer support.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 22:37:16 -05:00
Linus Torvalds
e96f8f18c8 Merge branch 'for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "These are all over the place.

  The tracepoint part of the pull fixes a crash and adds a little more
  information to two tracepoints, while the rest are good old fashioned
  fixes"

* 'for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: make tracepoint format strings more compact
  Btrfs: add truncated_len for ordered extent tracepoints
  Btrfs: add 'inode' for extent map tracepoint
  btrfs: fix crash when tracepoint arguments are freed by wq callbacks
  Btrfs: adjust outstanding_extents counter properly when dio write is split
  Btrfs: fix lockdep warning about log_mutex
  Btrfs: use down_read_nested to make lockdep silent
  btrfs: fix locking when we put back a delayed ref that's too new
  btrfs: fix error handling when run_delayed_extent_op fails
  btrfs: return the actual error value from  from btrfs_uuid_tree_iterate
2017-01-13 17:40:22 -08:00
Linus Torvalds
04e396277b Two small fixups for the filesystem changes that went into this merge
window.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJYeQymAAoJEEp/3jgCEfOLLVsH/28qRsjVPWr5JuL1SF86//kd
 rAi7QUfbNgXHqbb10a9za9pNuLhHr3kImIfvQ04wYiYQY+IaAapiRXwQev8BsNAa
 yENUc8XwNgydw4FU1ia5PkGOJLDtujtfgjWT2v+gf1HUzLaV6alBzqDwUZBt3xJz
 mlYC82oFkXPa0BFmLUXtT/jJu/ZI8caO4KB34/UKi7LjBQk1ca7E2xVUoDtdQmEm
 ciPE98akU4JiB99aOgGdwemBzkAMHEGQpImTzqHr/tbIUj0MqVAjH9FVOhRCbjMy
 6MSR+U9yUzJkBzefS5enijAoExVc8cD/A0nIaKGVb6qWrIrk51/Opl6iILeVLUo=
 =28cq
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.10-rc4' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "Two small fixups for the filesystem changes that went into this merge
  window"

* tag 'ceph-for-4.10-rc4' of git://github.com/ceph/ceph-client:
  ceph: fix get_oldest_context()
  ceph: fix mds cluster availability check
2017-01-13 17:38:05 -08:00
Linus Torvalds
af54efa4f5 VFIO fixes for v4.10-rc4
- Cleanups and bug fixes for the mtty sample driver (Dan Carpenter)
  - Export and make use of has_capability() to fix incorrect use of
    ns_capable() for testing task capabilities (Jike Song)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJYeTWfAAoJECObm247sIsi1PEP/0lIkIQWBUlVWC1QA6bJN0Xx
 9c4pA34kLJwCpEtEoPxf6owjgK7kSBgIUUqBaNNDdKZQYttGgA+qiX3HhuvEigKL
 vEq5/TqwL6vv2aIUp/5uPP4NNTJD8RynwkfDI1B8DVQN6E1GM2zozpFUiZbDUxz/
 sgIuby9nuG3WTVLgOVayyMHlPTXG1+l+quRlAhMAseD7LMx7q/71NIjKggSUFRQG
 fkOVVTqfCnLJmIyq/cWbJt2cDgeWQq2/Ik6gje3SiOFtxi8fRdlzONUL+tHM1KgT
 r0htrq+r3B7BxI0CMZuoHIBt1SK443yu39xDzb0iXDSb5W9gwR14uFMuXv1ftfM0
 qkZnvpsXaT6wpKvK2ztmHgUiKJmOTgYrG77Dhz4oz6Mm0Y1mn6bV4yueoF/rQIn0
 GrM1Af/SVLf3Vhxw6i5a1s7kDgpySw8FfucKO5Xv3cOaIgNtlrrjxbKKa9DZ3wd7
 mnjD30XHwxEim8OCgv7CFswPsc5TiqYJTKGbnSJGo67ZCXWxXFHLIab0cn5yMd8G
 Qgw4mLnIv2rkRZOWpgMy4PedCNjZXNuQbW3I90kDb/VlPvRdCqUIsO0Ty10yaNhe
 s8Gwmxphoi3U/J7Y4T/BsfkCZ4Umut9gAt/WsG4kgWj3v0FOmxLgl39lC0cRigR6
 l7HSf0fOg/D9k6EN1xnc
 =yeS9
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v4.10-rc4' of git://github.com/awilliam/linux-vfio

Pull VFIO fixes from Alex Williamson:

 - Cleanups and bug fixes for the mtty sample driver (Dan Carpenter)

 - Export and make use of has_capability() to fix incorrect use of
   ns_capable() for testing task capabilities (Jike Song)

* tag 'vfio-v4.10-rc4' of git://github.com/awilliam/linux-vfio:
  vfio/type1: Remove pid_namespace.h include
  vfio iommu type1: fix the testing of capability for remote task
  capability: export has_capability
  vfio-mdev: remove some dead code
  vfio-mdev: buffer overflow in ioctl()
  vfio-mdev: return -EFAULT if copy_to_user() fails
2017-01-13 17:35:43 -08:00
Satanand Burla
7410191afc liquidio: use fallback for selecting txq
Remove assignment to ndo_select_queue so that fallback is used for
selecting txq.  Also remove the now-useless function that used to be
assigned to ndo_select_queue.

Signed-off-by: Satanand Burla <satananda.burla@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 20:17:35 -05:00
Vivien Didelot
98fc3c6fa5 net: dsa: mv88e6xxx: add EEPROM support to 6390
The Marvell 6352 chip has a 8-bit address/16-bit data EEPROM access.
The Marvell 6390 chip has a 16-bit address/8-bit data EEPROM access.

This patch implements the 8-bit data EEPROM access in the mv88e6xxx
driver and adds its support to chips of the 6390 family.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 20:17:01 -05:00
Linus Torvalds
406732c932 * fix for module unload vs. deferred jump labels (note: there might be
other buggy modules!)
 * two NULL pointer dereferences from syzkaller
 * CVE from syzkaller, very serious on 4.10-rc, "just" kernel memory
   leak on releases
 * CVE from security@kernel.org, somewhat serious on AMD, less so on
   Intel
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJYd7l5AAoJEL/70l94x66DLWYH/0GUg+lK9J/gj0kwqi6BwsOP
 Rrs5Y7XvyNLsy/piBrrHDHvRa+DfAkrU8nepwgygX/yuGmSDV/zmdIb8XA/dvKht
 MN285NFlVjTyznYlU/LH3etx11CHLMNclishiFHQbcnohtvhOe+fvN6RVNdfeRxm
 d9iBPOum15ikc1xDl2z8Op+ZXVjMxkgLkzIXFcDBpJf4BvUx0X+ZHZXIKdizVhgU
 ZMD2ds/MutMB8X1A52qp6kQvT7xE4rp87M0So4qDMTbAto5G4ZmMaWC5MlK2Oxe/
 o+3qnx4vVz4H6uYzg1N4diHiC+buhgtXCLwwkcUOKKUVqJRP9e0Bh7kw8JA52XU=
 =C+tM
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:

 - fix for module unload vs deferred jump labels (note: there might be
   other buggy modules!)

 - two NULL pointer dereferences from syzkaller

 - also syzkaller: fix emulation of fxsave/fxrstor/sgdt/sidt, problem
   made worse during this merge window, "just" kernel memory leak on
   releases

 - fix emulation of "mov ss" - somewhat serious on AMD, less so on Intel

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86: fix emulation of "MOV SS, null selector"
  KVM: x86: fix NULL deref in vcpu_scan_ioapic
  KVM: eventfd: fix NULL deref irqbypass consumer
  KVM: x86: Introduce segmented_write_std
  KVM: x86: flush pending lapic jump label updates on module unload
  jump_labels: API for flushing deferred jump label updates
2017-01-13 17:06:24 -08:00
Linus Torvalds
a65c92597d - Fix huge_ptep_set_access_flags() to return "changed" when any of the
ptes in the contiguous range is changed, not just the last one
 
 - Fix the adr_l assembly macro to work in modules under KASLR
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJYeRmYAAoJEGvWsS0AyF7x6tYQAJd0Rtb88kwalDYb/kMcutBU
 3xyjvb8mIEKtnMOP1wS4o3YdqD6ke9OMCUm2EAwhAxgkfzwklsDOOOUlWsDijif2
 X3TzYWoKVgoje3oFODXOHMZNLqU6lBmuVN6G4ZdVPsTfvntTLE4cn9q828OgLdtB
 L1H+cRkHMhO9w4a0VxZFsNWtSDs4UugGLUp/cNLA4gXFj4atw8+bgX9o7BsmCb1d
 x+rd3LDWJb+a1YFKhKJkLQO+uQKk3n7d1WQ0DrQeDBgPs4uzMx422WpfmoW+j/dq
 MV/6C8ZYtQczS4BKp8k9apFHq3SC0bZcPLhtXqf/NZZCCLvDKS0iPflDAArYmIHo
 mOnmYhw+SeGc0llp9+tDaReco71HAqzdlpYnhGEePDEc0ZXBBr4/xqAwQoY4tgWa
 uZLSGZuiGqCFovzLb+LMLEtQlFyu48w+Y4Ct6r0M9gmRmU6d8msoEvXkA2IB/q8z
 JGFdFkJ1ZD8MtabRqUzYhuqn7WD+aC5eA3uqImnPjcrqNaYaiSy8Wif6vO+7asz5
 1YWyEaLuL9rITllunTQuK0crgZGjplwhGKYASz/w82AZebBeTl84adK/x7jrJgbn
 BPxQRHg4LqoX7i6tU3KWc/ulbE8EzOeJabCcKN8HnkPvt2akgKh/nlH3NQVLpG0l
 c/ffN90w3+fK7pNQKYnu
 =aUnr
 -----END PGP SIGNATURE-----

Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 fixes from Catalin Marinas:

 - Fix huge_ptep_set_access_flags() to return "changed" when any of the
   ptes in the contiguous range is changed, not just the last one

 - Fix the adr_l assembly macro to work in modules under KASLR

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: assembler: make adr_l work in modules under KASLR
  arm64: hugetlb: fix the wrong return value for huge_ptep_set_access_flags
2017-01-13 17:00:42 -08:00
Christoph Hellwig
bef13315e9 block: don't try to discard from __blkdev_issue_zeroout
Discard can return -EIO asynchronously if the alignment for the request
isn't suitable for the driver, which makes a proper fallback to other
methods in __blkdev_issue_zeroout impossible.  Thus only issue a sync
discard from blkdev_issue_zeroout an don't try discard at all from
__blkdev_issue_zeroout as a non-invasive workaround.

One more reason why abusing discard for zeroing must die..

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Eryu Guan <eguan@redhat.com>
Fixes: e73c23ff ("block: add async variant of blkdev_issue_zeroout")
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-13 15:18:16 -07:00
Christoph Hellwig
f80de881d8 sd: remove __data_len hack for WRITE SAME
Now that we have the blk_rq_payload_bytes helper available to determine
the actual I/O size we don't need to mess around with __data_len for
WRITE SAME.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-13 15:17:04 -07:00
Christoph Hellwig
b131c61d62 nvme: use blk_rq_payload_bytes
The new blk_rq_payload_bytes generalizes the payload length hacks
that nvme_map_len did before.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-13 15:17:04 -07:00
Christoph Hellwig
fd102b125e scsi: use blk_rq_payload_bytes
Without that we'll pass a wrong payload size in cmd->sdb, which
can lead to hangs with drivers that need the total transfer size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Chris Valean <v-chvale@microsoft.com>
Reported-by: Dexuan Cui <decui@microsoft.com>
Fixes: f9d03f96 ("block: improve handling of the magic discard payload")
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-13 15:17:04 -07:00
Christoph Hellwig
2e3258ecfa block: add blk_rq_payload_bytes
Add a helper to calculate the actual data transfer size for special
payload requests.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-13 15:17:04 -07:00
Linus Torvalds
c79d47f14f SCSI fixes on 20170113
The major fix is the bfa firmware, since the latest 10Gb cards fail
 probing with the current firmware.  The rest is a set of minor fixes:
 one missed Kconfig dependency causing randconfig failures, a missed
 error return on an error leg, a change for how multiqueue waits on a
 blocked device and a don't reset while in reset fix.
 
 Signed-off-by: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABAgAGBQJYeOuBAAoJEAVr7HOZEZN4TxEP/j3nlFKrs7XEhlbbZrezyHkA
 RvEWAkpoSXpQiVvuqHcObOX5Jh9I8ndBMRh2MyS2zLV4PA4SzwrprgemaB5rL9+j
 +5IIw7MdHY5SHP0IrvpTae+c1kOnHRfSdsLORqohB9uhqJiSbwhXgV0Q19hRzttA
 qmjD5RBI4sY3a+/CjGQevaM2Y/WQqCp4+J5kvclr3AmEoTjMkAiOsjbSuclFHWX9
 CxNKSMv/6z34ZgqS0FgqCrCAI5kO5UHrjznslcEtgVECIOTrNer3g5e75DhQOcD6
 Mhklxbx1dfG9r3h3oush8Zy9IIJoXZFZLRNmhhX7zcKkuYUaR9LU2COUVCTOEE+s
 SAtp4qdXxCZ8sFyHS+E9ajTYSeTw05fmGEbAu1dqJ8FUxkRnvsWouCiw/TI+ps94
 mJSiS2UGHAYxpLdWFbnKC+2CaCG32ygXINUcZbIItmT2cdA+ERAPAO6tDegjgb0S
 MoeRxAlUGfUehHReCppRV3igwjtw2vtQaY5hFq1+v8n00Ezt6kDNeeTanqobMrHr
 71rOZAos9aT5wa39pmbW18M3pfUB1qeuGn6di4ArYk8f9ILAMI4qykvnDHH3INcn
 BdCIY3SgRo/Zq+87vFWZr+56CcbVqq0GV38FKoRZcHglyOjDw8ibQY6elcsjBtJC
 S1Z5wlkv1T9tfA+8mRX6
 =kjPw
 -----END PGP SIGNATURE-----

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
 "The major fix is the bfa firmware, since the latest 10Gb cards fail
  probing with the current firmware.

  The rest is a set of minor fixes: one missed Kconfig dependency
  causing randconfig failures, a missed error return on an error leg, a
  change for how multiqueue waits on a blocked device and a don't reset
  while in reset fix"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: bfa: Increase requested firmware version to 3.2.5.1
  scsi: snic: Return error code on memory allocation failure
  scsi: fnic: Avoid sending reset to firmware when another reset is in progress
  scsi: qedi: fix build, depends on UIO
  scsi: scsi-mq: Wait for .queue_rq() if necessary
2017-01-13 12:38:36 -08:00
Linus Torvalds
6d90b4f99d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input updates from Dmitry Torokhov:
 "Small driver fixups"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: elants_i2c - avoid divide by 0 errors on bad touchscreen data
  Input: adxl34x - make it enumerable in ACPI environment
  Input: ALPS - fix TrackStick Y axis handling for SS5 hardware
  Input: synaptics-rmi4 - fix F03 build error when serio is module
  Input: xpad - use correct product id for x360w controllers
  Input: synaptics_i2c - change msleep to usleep_range for small msecs
  Input: i8042 - add Pegatron touchpad to noloop table
  Input: joydev - remove unused linux/miscdevice.h include
2017-01-13 11:49:34 -08:00
Trond Myklebust
c6180a6237 NFSv4: Fix client recovery when server reboots multiple times
If the server reboots multiple times, the client should rely on the
server to tell it that it cannot reclaim state as per section 9.6.3.4
in RFC7530 and section 8.4.2.1 in RFC5661.
Currently, the client is being to conservative, and is assuming that
if the server reboots while state recovery is in progress, then it must
ignore state that was not recovered before the reboot.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-01-13 13:31:32 -05:00
Shannon Nelson
003c941057 tcp: fix tcp_fastopen unaligned access complaints on sparc
Fix up a data alignment issue on sparc by swapping the order
of the cookie byte array field with the length field in
struct tcp_fastopen_cookie, and making it a proper union
to clean up the typecasting.

This addresses log complaints like these:
    log_unaligned: 113 callbacks suppressed
    Kernel unaligned access at TPC[976490] tcp_try_fastopen+0x2d0/0x360
    Kernel unaligned access at TPC[9764ac] tcp_try_fastopen+0x2ec/0x360
    Kernel unaligned access at TPC[9764c8] tcp_try_fastopen+0x308/0x360
    Kernel unaligned access at TPC[9764e4] tcp_try_fastopen+0x324/0x360
    Kernel unaligned access at TPC[976490] tcp_try_fastopen+0x2d0/0x360

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 12:31:24 -05:00
David Lebrun
fa79581ea6 ipv6: sr: fix several BUGs when preemption is enabled
When CONFIG_PREEMPT=y, CONFIG_IPV6=m and CONFIG_SEG6_HMAC=y,
seg6_hmac_init() is called during the initialization of the ipv6 module.
This causes a subsequent call to smp_processor_id() with preemption
enabled, resulting in the following trace.

[   20.451460] BUG: using smp_processor_id() in preemptible [00000000] code: systemd/1
[   20.452556] caller is debug_smp_processor_id+0x17/0x19
[   20.453304] CPU: 0 PID: 1 Comm: systemd Not tainted 4.9.0-rc5-00973-g46738b1 #1
[   20.454406]  ffffc9000062fc18 ffffffff813607b2 0000000000000000 ffffffff81a7f782
[   20.455528]  ffffc9000062fc48 ffffffff813778dc 0000000000000000 00000000001dcf98
[   20.456539]  ffffffffa003bd08 ffffffff81af93e0 ffffc9000062fc58 ffffffff81377905
[   20.456539] Call Trace:
[   20.456539]  [<ffffffff813607b2>] dump_stack+0x63/0x7f
[   20.456539]  [<ffffffff813778dc>] check_preemption_disabled+0xd1/0xe3
[   20.456539]  [<ffffffff81377905>] debug_smp_processor_id+0x17/0x19
[   20.460260]  [<ffffffffa0061f3b>] seg6_hmac_init+0xfa/0x192 [ipv6]
[   20.460260]  [<ffffffffa0061ccc>] seg6_init+0x39/0x6f [ipv6]
[   20.460260]  [<ffffffffa006121a>] inet6_init+0x21a/0x321 [ipv6]
[   20.460260]  [<ffffffffa0061000>] ? 0xffffffffa0061000
[   20.460260]  [<ffffffff81000457>] do_one_initcall+0x8b/0x115
[   20.460260]  [<ffffffff811328a3>] do_init_module+0x53/0x1c4
[   20.460260]  [<ffffffff8110650a>] load_module+0x1153/0x14ec
[   20.460260]  [<ffffffff81106a7b>] SYSC_finit_module+0x8c/0xb9
[   20.460260]  [<ffffffff81106a7b>] ? SYSC_finit_module+0x8c/0xb9
[   20.460260]  [<ffffffff81106abc>] SyS_finit_module+0x9/0xb
[   20.460260]  [<ffffffff810014d1>] do_syscall_64+0x62/0x75
[   20.460260]  [<ffffffff816834f0>] entry_SYSCALL64_slow_path+0x25/0x25

Moreover, dst_cache_* functions also call smp_processor_id(), generating
a similar trace.

This patch uses raw_cpu_ptr() in seg6_hmac_init() rather than this_cpu_ptr()
and disable preemption when using dst_cache_* functions.

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 12:29:55 -05:00