Commit graph

16991 commits

Author SHA1 Message Date
Linus Torvalds
bddffa28dc ACPI and power management fixes and new device IDs for 3.13-rc6
- Fix for a cpufreq regression causing stale sysfs files to be left
   behind during system resume if cpufreq_add_dev() fails for one or
   more CPUs from Viresh Kumar.
 
 - Fix for a bug in cpufreq causing CONFIG_CPU_FREQ_DEFAULT_* to be
   ignored when the intel_pstate driver is used from Jason Baron.
 
 - System suspend fix for a memory leak in pm_vt_switch_unregister()
   that forgot to release objects after removing them from
   pm_vt_switch_list.  From Masami Ichikawa.
 
 - Intel Valley View device ID and energy unit encoding update for the
   (recently added) Intel RAPL (Running Average Power Limit) driver
   from Jacob Pan.
 
 - Intel Bay Trail SoC GPIO and ACPI device IDs for the Low Power
   Subsystem (LPSS) ACPI driver from Paul Drews.
 
 /
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIcBAABCAAGBQJSvh9MAAoJEILEb/54YlRxfJoQAJFzcqXlYsROgPSlYVEG0F80
 Cop0MbYC/gr8XNiLXWGIRVVYHbxMNeAPk5vUPZg4E9L6cOeazwjFtnRB/Av3FpVo
 XReYHpLbJXJ3VaVyJiw0tCHp/Ukw8Ds0VcURi8RdcrQdkmyXPtbfWcrE+7GmuA2z
 jnZOJviws+mTnxdEHtaml2iZMM5jwvUmUeh3iytc8zOC3QR4I7cLkKnYNTrQatqZ
 qYxu5e9VAKuTXBv7BeNHiViakKhoWPx0S3nKofoiOG5hGwg49HGVlJ3pH9CCfIli
 jA1NpXOGyKzYLJv2fJPtgxQ+l7Mb8wu9hGbPJWaUI3MRa9vIxNok6qzYwAQcfCWD
 p4iugfsaatyKbBSBu+mntczCM7wsl2+/gH3gWfDySRpxq8G9At1dduO9GMeD/pDi
 QhYlFl0obR05F6R0hlkk2Pahx+5x5nub7dM2+8Oh+r8k6TlkFRg+BKe21MJGz/45
 BHBmJkNkLpdUNKT2GhaQK6rhc5TSln3eYGjYDRRhRmV6/4US/hI1MtY6HYg2uWbk
 J3xiMcUXAY/0DzC1zDzvwr4Cc+WRdNXbZGKmmhyc+fxEmjZZVGanfmqC0JFV3gni
 32v1krQA8v3KANz2xjvnNwNLQzSypgVOHihUbPN34FGaE7fxp6PWqxwhRD6RyywK
 gtIHNSZ0mKnmIR6oxq1a
 =vQDX
 -----END PGP SIGNATURE-----

Merge tag 'pm+acpi-3.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI and power management fixes and new device IDs from Rafael Wysocki:

 - Fix for a cpufreq regression causing stale sysfs files to be left
   behind during system resume if cpufreq_add_dev() fails for one or
   more CPUs from Viresh Kumar.

 - Fix for a bug in cpufreq causing CONFIG_CPU_FREQ_DEFAULT_* to be
   ignored when the intel_pstate driver is used from Jason Baron.

 - System suspend fix for a memory leak in pm_vt_switch_unregister()
   that forgot to release objects after removing them from
   pm_vt_switch_list.  From Masami Ichikawa.

 - Intel Valley View device ID and energy unit encoding update for the
   (recently added) Intel RAPL (Running Average Power Limit) driver from
   Jacob Pan.

 - Intel Bay Trail SoC GPIO and ACPI device IDs for the Low Power
   Subsystem (LPSS) ACPI driver from Paul Drews.

* tag 'pm+acpi-3.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  powercap / RAPL: add support for ValleyView Soc
  PM / sleep: Fix memory leak in pm_vt_switch_unregister().
  cpufreq: Use CONFIG_CPU_FREQ_DEFAULT_* to set initial policy for setpolicy drivers
  cpufreq: remove sysfs files for CPUs which failed to come back after resume
  ACPI: Add BayTrail SoC GPIO and LPSS ACPI IDs
2013-12-29 13:27:51 -08:00
Rafael J. Wysocki
1a6725359e Merge branches 'pm-cpufreq' and 'pm-sleep' containing PM fixes
* pm-cpufreq:
  cpufreq: Use CONFIG_CPU_FREQ_DEFAULT_* to set initial policy for setpolicy drivers
  cpufreq: remove sysfs files for CPUs which failed to come back after resume

* pm-sleep:
  PM / sleep: Fix memory leak in pm_vt_switch_unregister().
2013-12-27 00:42:27 +01:00
Linus Torvalds
70e672fa73 Merge branch 'for-3.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
 "Two fixes.  One fixes a bug in the error path of cgroup_create().  The
  other changes cgrp->id lifetime rule so that the id doesn't get
  recycled before all controller states are destroyed.  This premature
  id recycling made memcg malfunction"

* 'for-3.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: don't recycle cgroup id until all csses' have been destroyed
  cgroup: fix cgroup_create() error handling path
2013-12-24 09:49:20 -08:00
Linus Torvalds
4b69316ede Merge branch 'for-3.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata
Pull libata fixes from Tejun Heo:
 "There's one interseting commit - "libata, freezer: avoid block device
  removal while system is frozen".  It's an ugly hack working around a
  deadlock condition between driver core resume and block layer device
  removal paths through freezer which was made more reproducible by
  writeback being converted to workqueue some releases ago.  The bug has
  nothing to do with libata but it's just an workaround which is easy to
  backport.  After discussion, Rafael and I seem to agree that we don't
  really need kernel freezables - both kthread and workqueue.  There are
  few specific workqueues which constitute PM operations and require
  freezing, which will be converted to use workqueue_set_max_active()
  instead.  All other kernel freezer uses are planned to be removed,
  followed by the removal of kthread and workqueue freezer support,
  hopefully.

  Others are device-specific fixes.  The most notable is the addition of
  NO_NCQ_TRIM which is used to disable queued TRIM commands to Micro
  M500 SSDs which otherwise suffers data corruption"

* 'for-3.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
  libata, freezer: avoid block device removal while system is frozen
  libata: implement ATA_HORKAGE_NO_NCQ_TRIM and apply it to Micro M500 SSDs
  libata: disable a disk via libata.force params
  ahci: bail out on ICH6 before using AHCI BAR
  ahci: imx: Explicitly clear IMX6Q_GPR13_SATA_MPLL_CLK_EN
  libata: add ATA_HORKAGE_BROKEN_FPDMA_AA quirk for Seagate Momentus SpinPoint M8
2013-12-24 09:35:58 -08:00
Masami Ichikawa
c606850407 PM / sleep: Fix memory leak in pm_vt_switch_unregister().
kmemleak reported a memory leak as below.

unreferenced object 0xffff880118f14700 (size 32):
  comm "swapper/0", pid 1, jiffies 4294877401 (age 123.283s)
  hex dump (first 32 bytes):
    00 01 10 00 00 00 ad de 00 02 20 00 00 00 ad de  .......... .....
    00 d4 d2 18 01 88 ff ff 01 00 00 00 00 04 00 00  ................
  backtrace:
    [<ffffffff814edb1e>] kmemleak_alloc+0x4e/0xb0
    [<ffffffff811889dc>] kmem_cache_alloc_trace+0x1ec/0x260
    [<ffffffff810aba66>] pm_vt_switch_required+0x76/0xb0
    [<ffffffff812f39f5>] register_framebuffer+0x195/0x320
    [<ffffffff8130af18>] efifb_probe+0x718/0x780
    [<ffffffff81391495>] platform_drv_probe+0x45/0xb0
    [<ffffffff8138f407>] driver_probe_device+0x87/0x3a0
    [<ffffffff8138f7f3>] __driver_attach+0x93/0xa0
    [<ffffffff8138d413>] bus_for_each_dev+0x63/0xa0
    [<ffffffff8138ee5e>] driver_attach+0x1e/0x20
    [<ffffffff8138ea40>] bus_add_driver+0x180/0x250
    [<ffffffff8138fe74>] driver_register+0x64/0xf0
    [<ffffffff813913ba>] __platform_driver_register+0x4a/0x50
    [<ffffffff8191e028>] efifb_driver_init+0x12/0x14
    [<ffffffff8100214a>] do_one_initcall+0xfa/0x1b0
    [<ffffffff818e40e0>] kernel_init_freeable+0x17b/0x201

In pm_vt_switch_required(), "entry" variable is allocated via kmalloc().
So, in pm_vt_switch_unregister(), it needs to call kfree() when object
is deleted from list.

Signed-off-by: Masami Ichikawa <masami256@gmail.com>
Reviewed-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-12-22 00:56:35 +01:00
Kirill A. Shutemov
597d795a2a mm: do not allocate page->ptl dynamically, if spinlock_t fits to long
In struct page we have enough space to fit long-size page->ptl there,
but we use dynamically-allocated page->ptl if size(spinlock_t) is larger
than sizeof(int).

It hurts 64-bit architectures with CONFIG_GENERIC_LOCKBREAK, where
sizeof(spinlock_t) == 8, but it easily fits into struct page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-20 12:25:45 -08:00
Linus Torvalds
5263f0a880 This fixes a long standing bug in the ftrace profiler.
The problem is that the profiler only initializes the online
 CPUs, and not possible CPUs. This causes issues if the user takes
 CPUs online or offline while the profiler is running.
 
 If we online a CPU after starting the profiler, we lose all the
 trace information on the CPU going online.
 
 If we offline a CPU after running a test and start a new test, it
 will not clear the old data from that CPU.
 
 This bug causes incorrect data to be reported to the user if they
 online or offline CPUs during the profiling.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.15 (GNU/Linux)
 
 iQEcBAABAgAGBQJSsNBHAAoJEKQekfcNnQGuKP8H/2mol/d7z2vANh7/FeNjTKIN
 VkRzDEwUIwoaJBsL75EDDXBFx7w8jjAsXyoTrqrvMRV4UNcsfm46mohQTPAmK39y
 muqodL1VnVXdKrUmtw/1nL7yDi2KltQH1UwOgvwXGuUFIq5cuCXNQxNK9/1fVVVn
 tIMNz5kEAG3XCwnqP0PgQxWCuA7s+aQR0ijTf4vPf1G3IJujPyG9VhJWcGS3dJTR
 t8TPyatd9D/S+7/r7iZ9hS8nWpaka3qJfhiWqk16SC9LiUXVA8oFOVMoN7n6Co5E
 6r2dNo01WOABlojCxi1t3afUtcV1bUjBnVkiDva5cSc84pQSxe1qRrIpjTmHk00=
 =MSZs
 -----END PGP SIGNATURE-----

Merge tag 'trace-fixes-v3.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull ftrace fix from Steven Rostedt:
 "This fixes a long standing bug in the ftrace profiler.  The problem is
  that the profiler only initializes the online CPUs, and not possible
  CPUs.  This causes issues if the user takes CPUs online or offline
  while the profiler is running.

  If we online a CPU after starting the profiler, we lose all the trace
  information on the CPU going online.

  If we offline a CPU after running a test and start a new test, it will
  not clear the old data from that CPU.

  This bug causes incorrect data to be reported to the user if they
  online or offline CPUs during the profiling"

* tag 'trace-fixes-v3.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace: Initialize the ftrace profiler for each possible cpu
2013-12-20 09:32:30 -08:00
Tejun Heo
85fbd722ad libata, freezer: avoid block device removal while system is frozen
Freezable kthreads and workqueues are fundamentally problematic in
that they effectively introduce a big kernel lock widely used in the
kernel and have already been the culprit of several deadlock
scenarios.  This is the latest occurrence.

During resume, libata rescans all the ports and revalidates all
pre-existing devices.  If it determines that a device has gone
missing, the device is removed from the system which involves
invalidating block device and flushing bdi while holding driver core
layer locks.  Unfortunately, this can race with the rest of device
resume.  Because freezable kthreads and workqueues are thawed after
device resume is complete and block device removal depends on
freezable workqueues and kthreads (e.g. bdi_wq, jbd2) to make
progress, this can lead to deadlock - block device removal can't
proceed because kthreads are frozen and kthreads can't be thawed
because device resume is blocked behind block device removal.

839a8e8660 ("writeback: replace custom worker pool implementation
with unbound workqueue") made this particular deadlock scenario more
visible but the underlying problem has always been there - the
original forker task and jbd2 are freezable too.  In fact, this is
highly likely just one of many possible deadlock scenarios given that
freezer behaves as a big kernel lock and we don't have any debug
mechanism around it.

I believe the right thing to do is getting rid of freezable kthreads
and workqueues.  This is something fundamentally broken.  For now,
implement a funny workaround in libata - just avoid doing block device
hot[un]plug while the system is frozen.  Kernel engineering at its
finest.  :(

v2: Add EXPORT_SYMBOL_GPL(pm_freezing) for cases where libata is built
    as a module.

v3: Comment updated and polling interval changed to 10ms as suggested
    by Rafael.

v4: Add #ifdef CONFIG_FREEZER around the hack as pm_freezing is not
    defined when FREEZER is not configured thus breaking build.
    Reported by kbuild test robot.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Tomaž Šolc <tomaz.solc@tablix.org>
Reviewed-by: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=62801
Link: http://lkml.kernel.org/r/20131213174932.GA27070@htj.dyndns.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: stable@vger.kernel.org
Cc: kbuild test robot <fengguang.wu@intel.com>
2013-12-19 13:50:32 -05:00
Linus Torvalds
f7556698a3 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "An RT group-scheduling fix and the sched-domains topology setup fix
  from Mel"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/rt: Fix rq's cpupri leak while enqueue/dequeue child RT entities
  sched: Assign correct scheduling domain to 'sd_llc'
2013-12-19 09:11:22 -08:00
Linus Torvalds
58cac3faef Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "An ABI documentation fix, and a mixed-PMU perf-info-corruption fix"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Document the new transaction sample type
  perf: Disable all pmus on unthrottling and rescheduling
2013-12-19 09:10:46 -08:00
Linus Torvalds
86fbf1617a Merge branch 'akpm' (incoming from Andrew)
Merge patches from Andrew Morton:
 "23 fixes and a MAINTAINERS update"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (24 commits)
  mm/hugetlb: check for pte NULL pointer in __page_check_address()
  fix build with make 3.80
  mm/mempolicy: fix !vma in new_vma_page()
  MAINTAINERS: add Davidlohr as GPT maintainer
  mm/memory-failure.c: recheck PageHuge() after hugetlb page migrate successfully
  mm/compaction: respect ignore_skip_hint in update_pageblock_skip
  mm/mempolicy: correct putback method for isolate pages if failed
  mm: add missing dependency in Kconfig
  sh: always link in helper functions extracted from libgcc
  mm: page_alloc: exclude unreclaimable allocations from zone fairness policy
  mm: numa: defer TLB flush for THP migration as long as possible
  mm: numa: guarantee that tlb_flush_pending updates are visible before page table updates
  mm: fix TLB flush race between migration, and change_protection_range
  mm: numa: avoid unnecessary disruption of NUMA hinting during migration
  mm: numa: clear numa hinting information on mprotect
  sched: numa: skip inaccessible VMAs
  mm: numa: avoid unnecessary work on the failure path
  mm: numa: ensure anon_vma is locked to prevent parallel THP splits
  mm: numa: do not clear PTE for pte_numa update
  mm: numa: do not clear PMD during PTE update scan
  ...
2013-12-18 19:05:00 -08:00
Rik van Riel
2084140594 mm: fix TLB flush race between migration, and change_protection_range
There are a few subtle races, between change_protection_range (used by
mprotect and change_prot_numa) on one side, and NUMA page migration and
compaction on the other side.

The basic race is that there is a time window between when the PTE gets
made non-present (PROT_NONE or NUMA), and the TLB is flushed.

During that time, a CPU may continue writing to the page.

This is fine most of the time, however compaction or the NUMA migration
code may come in, and migrate the page away.

When that happens, the CPU may continue writing, through the cached
translation, to what is no longer the current memory location of the
process.

This only affects x86, which has a somewhat optimistic pte_accessible.
All other architectures appear to be safe, and will either always flush,
or flush whenever there is a valid mapping, even with no permissions
(SPARC).

The basic race looks like this:

CPU A			CPU B			CPU C

						load TLB entry
make entry PTE/PMD_NUMA
			fault on entry
						read/write old page
			start migrating page
			change PTE/PMD to new page
						read/write old page [*]
flush TLB
						reload TLB from new entry
						read/write new page
						lose data

[*] the old page may belong to a new user at this point!

The obvious fix is to flush remote TLB entries, by making sure that
pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
still be accessible if there is a TLB flush pending for the mm.

This should fix both NUMA migration and compaction.

[mgorman@suse.de: fix build]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-18 19:04:51 -08:00
Mel Gorman
3c67f47455 sched: numa: skip inaccessible VMAs
Inaccessible VMA should not be trapping NUMA hint faults. Skip them.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-18 19:04:51 -08:00
Vivek Goyal
c97102ba96 kexec: migrate to reboot cpu
Commit 1b3a5d02ee ("reboot: move arch/x86 reboot= handling to generic
kernel") moved reboot= handling to generic code.  In the process it also
removed the code in native_machine_shutdown() which are moving reboot
process to reboot_cpu/cpu0.

I guess that thought must have been that all reboot paths are calling
migrate_to_reboot_cpu(), so we don't need this special handling.  But
kexec reboot path (kernel_kexec()) is not calling
migrate_to_reboot_cpu() so above change broke kexec.  Now reboot can
happen on non-boot cpu and when INIT is sent in second kerneo to bring
up BP, it brings down the machine.

So start calling migrate_to_reboot_cpu() in kexec reboot path to avoid
this problem.

Bisected by WANG Chao.

Reported-by: Matthew Whitehead <mwhitehe@redhat.com>
Reported-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Tested-by: Baoquan He <bhe@redhat.com>
Tested-by: WANG Chao <chaowang@redhat.com>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-18 19:04:50 -08:00
Linus Torvalds
a81bddde96 Merge branch 'keys-devel' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull crypto key patches from David Howells:
 "There are four items:

   - A patch to fix X.509 certificate gathering.  The problem was that I
     was coming up with a different path for signing_key.x509 in the
     build directory if it didn't exist to if it did exist.  This meant
     that the X.509 cert container object file would be rebuilt on the
     second rebuild in a build directory and the kernel would get
     relinked.

   - Unconditionally remove files generated by SYSTEM_TRUSTED_KEYRING=y
     when doing make mrproper.

   - Actually initialise the persistent-keyring semaphore for
     init_user_ns.  I have no idea why this works at all for users in
     the base user namespace unless it's something to do with systemd
     containerising the system.

   - Documentation for module signing"

* 'keys-devel' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  Add Documentation/module-signing.txt file
  KEYS: fix uninitialized persistent_keyring_register_sem
  KEYS: Remove files generated when SYSTEM_TRUSTED_KEYRING=y
  X.509: Fix certificate gathering
2013-12-18 14:09:08 -08:00
Linus Torvalds
dd0508093b Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Three fixes for scheduler crashes, each triggers in relatively rare,
  hardware environment dependent situations"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Rework sched_fair time accounting
  math64: Add mul_u64_u32_shr()
  sched: Remove PREEMPT_NEED_RESCHED from generic code
  sched: Initialize power_orig for overlapping groups
2013-12-17 12:35:54 -08:00
Kirill Tkhai
757dfcaa41 sched/rt: Fix rq's cpupri leak while enqueue/dequeue child RT entities
This patch touches the RT group scheduling case.

Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's
priority, while rt_rq passed to them may be not the top-level rt_rq.
This is wrong, because changing of priority on a child level does not
guarantee that the priority is the highest all over the rq. So, this
leak makes RT balancing unusable.

The short example: the task having the highest priority among all rq's
RT tasks (no one other task has the same priority) are waking on a
throttle rt_rq.  The rq's cpupri is set to the task's priority
equivalent, but real rq->rt.highest_prio.curr is less.

The patch below fixes the problem.

Signed-off-by: Kirill Tkhai <tkhai@yandex.ru>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/49231385567953@web4m.yandex.ru
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-12-17 15:08:44 +01:00
Mel Gorman
5d4cf996cf sched: Assign correct scheduling domain to 'sd_llc'
Commit 42eb088e (sched: Avoid NULL dereference on sd_busy) corrected a NULL
dereference on sd_busy but the fix also altered what scheduling domain it
used for the 'sd_llc' percpu variable.

One impact of this is that a task selecting a runqueue may consider
idle CPUs that are not cache siblings as candidates for running.
Tasks are then running on CPUs that are not cache hot.

This was found through bisection where ebizzy threads were not seeing equal
performance and it looked like a scheduling fairness issue. This patch
mitigates but does not completely fix the problem on all machines tested
implying there may be an additional bug or a common root cause. Here are
the average range of performance seen by individual ebizzy threads. It
was tested on top of candidate patches related to x86 TLB range flushing.

	4-core machine
			    3.13.0-rc3            3.13.0-rc3
			       vanilla            fixsd-v3r3
	Mean   1        0.00 (  0.00%)        0.00 (  0.00%)
	Mean   2        0.34 (  0.00%)        0.10 ( 70.59%)
	Mean   3        1.29 (  0.00%)        0.93 ( 27.91%)
	Mean   4        7.08 (  0.00%)        0.77 ( 89.12%)
	Mean   5      193.54 (  0.00%)        2.14 ( 98.89%)
	Mean   6      151.12 (  0.00%)        2.06 ( 98.64%)
	Mean   7      115.38 (  0.00%)        2.04 ( 98.23%)
	Mean   8      108.65 (  0.00%)        1.92 ( 98.23%)

	8-core machine
	Mean   1         0.00 (  0.00%)        0.00 (  0.00%)
	Mean   2         0.40 (  0.00%)        0.21 ( 47.50%)
	Mean   3        23.73 (  0.00%)        0.89 ( 96.25%)
	Mean   4        12.79 (  0.00%)        1.04 ( 91.87%)
	Mean   5        13.08 (  0.00%)        2.42 ( 81.50%)
	Mean   6        23.21 (  0.00%)       69.46 (-199.27%)
	Mean   7        15.85 (  0.00%)      101.72 (-541.77%)
	Mean   8       109.37 (  0.00%)       19.13 ( 82.51%)
	Mean   12      124.84 (  0.00%)       28.62 ( 77.07%)
	Mean   16      113.50 (  0.00%)       24.16 ( 78.71%)

It's eliminated for one machine and reduced for another.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Alex Shi <alex.shi@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: H Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20131217092124.GV11295@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-12-17 15:08:43 +01:00
Alexander Shishkin
443772776c perf: Disable all pmus on unthrottling and rescheduling
Currently, only one PMU in a context gets disabled during unthrottling
and event_sched_{out,in}(), however, events in one context may belong to
different pmus, which results in PMUs being reprogrammed while they are
still enabled.

This means that mixed PMU use [which is rare in itself] resulted in
potentially completely unreliable results: corrupted events, bogus
results, etc.

This patch temporarily disables PMUs that correspond to
each event in the context while these events are being modified.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Link: http://lkml.kernel.org/r/1387196256-8030-1-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-12-17 15:04:00 +01:00
Li Zefan
c1a71504e9 cgroup: don't recycle cgroup id until all csses' have been destroyed
Hugh reported this bug:

> CONFIG_MEMCG_SWAP is broken in 3.13-rc.  Try something like this:
>
> mkdir -p /tmp/tmpfs /tmp/memcg
> mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs
> mount -t cgroup -o memory memcg /tmp/memcg
> mkdir /tmp/memcg/old
> echo 512M >/tmp/memcg/old/memory.limit_in_bytes
> echo $$ >/tmp/memcg/old/tasks
> cp /dev/zero /tmp/tmpfs/zero 2>/dev/null
> echo $$ >/tmp/memcg/tasks
> rmdir /tmp/memcg/old
> sleep 1	# let rmdir work complete
> mkdir /tmp/memcg/new
> umount /tmp/tmpfs
> dmesg | grep WARNING
> rmdir /tmp/memcg/new
> umount /tmp/memcg
>
> Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91
>                            res_counter_uncharge_locked+0x1f/0x2f()
>
> Breakage comes from 34c00c319c ("memcg: convert to use cgroup id").
>
> The lifetime of a cgroup id is different from the lifetime of the
> css id it replaced: memsw's css_get()s do nothing to hold on to the
> old cgroup id, it soon gets recycled to a new cgroup, which then
> mysteriously inherits the old's swap, without any charge for it.

Instead of removing cgroup id right after all the csses have been
offlined, we should do that after csses have been destroyed.

To make sure an invalid css pointer won't be returned after the css
is destroyed, make sure css_from_id() returns NULL in this case.

tj: Updated comment to note planned changes for cgrp->id.

Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-12-17 08:11:52 -05:00
Miao Xie
c4602c1c81 ftrace: Initialize the ftrace profiler for each possible cpu
Ftrace currently initializes only the online CPUs. This implementation has
two problems:
- If we online a CPU after we enable the function profile, and then run the
  test, we will lose the trace information on that CPU.
  Steps to reproduce:
  # echo 0 > /sys/devices/system/cpu/cpu1/online
  # cd <debugfs>/tracing/
  # echo <some function name> >> set_ftrace_filter
  # echo 1 > function_profile_enabled
  # echo 1 > /sys/devices/system/cpu/cpu1/online
  # run test
- If we offline a CPU before we enable the function profile, we will not clear
  the trace information when we enable the function profile. It will trouble
  the users.
  Steps to reproduce:
  # cd <debugfs>/tracing/
  # echo <some function name> >> set_ftrace_filter
  # echo 1 > function_profile_enabled
  # run test
  # cat trace_stat/function*
  # echo 0 > /sys/devices/system/cpu/cpu1/online
  # echo 0 > function_profile_enabled
  # echo 1 > function_profile_enabled
  # cat trace_stat/function*
  # run test
  # cat trace_stat/function*

So it is better that we initialize the ftrace profiler for each possible cpu
every time we enable the function profile instead of just the online ones.

Link: http://lkml.kernel.org/r/1387178401-10619-1-git-send-email-miaox@cn.fujitsu.com

Cc: stable@vger.kernel.org # 2.6.31+
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-12-16 10:53:46 -05:00
Linus Torvalds
9199c4caa1 PCI updates for v3.13:
PCI device hotplug
     - Move device_del() from pci_stop_dev() to pci_destroy_dev() (Rafael J. Wysocki)
 
   Host bridge drivers
     - Update maintainers for DesignWare, i.MX6, Armada, R-Car (Bjorn Helgaas)
     - mvebu: Return 'unsupported' for Interrupt Line and Interrupt Pin (Jason Gunthorpe)
 
   Miscellaneous
     - Avoid unnecessary CPU switch when calling .probe() (Alexander Duyck)
     - Revert "workqueue: allow work_on_cpu() to be called recursively" (Bjorn Helgaas)
     - Disable Bus Master only on kexec reboot (Khalid Aziz)
     - Omit PCI ID macro strings to shorten quirk names for LTO (Michal Marek)
 
  MAINTAINERS                  | 33 +++++++++++++++++++++++++++++++++
  drivers/pci/host/pci-mvebu.c |  5 +++++
  drivers/pci/pci-driver.c     | 38 ++++++++++++++++++++++++++++++--------
  drivers/pci/remove.c         |  4 +++-
  include/linux/kexec.h        |  3 +++
  include/linux/pci.h          | 30 +++++++++++++++---------------
  kernel/kexec.c               |  4 ++++
  kernel/workqueue.c           | 32 ++++++++++----------------------
  8 files changed, 103 insertions(+), 46 deletions(-)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJSrLDLAAoJEFmIoMA60/r8JO4P/iakGDezwXcd9YTbdaq/NmEK
 31JtBao9rJNSUXpFGaFKGm99A479QNb+Fo1o5NaysyGbcAA2W0HeCkaee13hpw2D
 l3wvzJRoEwqVySOzMwlGsxxhYuvGiZ2WVALNkwBzwyCiYtypNnUtGNCSp/3J6XKp
 2ZUjoagyixdS4oYoR4irZucPBWwzW89Gx4oJ0rBttNVsXjiT1D2OzYDYTuxMyb2E
 ZjXJqTUYzfwiFxqKPrGOabUPD9GX3EWaFmu01FLlsSznPZkk9yJR3/eq6I6utp/E
 WSpP+7v3jnPHjZVky1FdHaP5wqurtNRsiWBQVyNYF+M5WofKA1hGA+niJDEB/pVL
 vE74fJfZzpET1jlZR+7Z5qHPxW2A8Q2ARn74A42DYLQGmJEh7VdARcayV1E4diRH
 DzEnhf33b9bwIC1Q0cESWZ5RH4r2Xbjg0a1qoVQYi5VEJEMWXID3Ofk64Bd9QOz4
 oLZ27V7clurD+gNarch/zgpd9LVIHLFriR2YWPpA0Iwy9EjueH8GsiOS3NqDn2SQ
 sQg5utE4vdnix4VrCvvbufAr3kJngndtuj/s7I3lZqi7nCyS+jeFFvMUd9h3MUqZ
 rs1IN1/qeTBBNi2dtmcKDN2ItahCoBhTI37JRaijZwe+B5m6mTe049it+E0SZPI2
 IqLOYWL4ggT0se/cMXld
 =Edvk
 -----END PGP SIGNATURE-----

Merge tag 'pci-v3.13-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull PCI updates from Bjorn Helgaas:
 "PCI device hotplug
    - Move device_del() from pci_stop_dev() to pci_destroy_dev() (Rafael
      Wysocki)

  Host bridge drivers
    - Update maintainers for DesignWare, i.MX6, Armada, R-Car (Bjorn
      Helgaas)
    - mvebu: Return 'unsupported' for Interrupt Line and Interrupt Pin
      (Jason Gunthorpe)

  Miscellaneous
    - Avoid unnecessary CPU switch when calling .probe() (Alexander
      Duyck)
    - Revert "workqueue: allow work_on_cpu() to be called recursively"
      (Bjorn Helgaas)
    - Disable Bus Master only on kexec reboot (Khalid Aziz)
    - Omit PCI ID macro strings to shorten quirk names for LTO (Michal
      Marek)"

* tag 'pci-v3.13-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
  MAINTAINERS: Add DesignWare, i.MX6, Armada, R-Car PCI host maintainers
  PCI: Disable Bus Master only on kexec reboot
  PCI: mvebu: Return 'unsupported' for Interrupt Line and Interrupt Pin
  PCI: Omit PCI ID macro strings to shorten quirk names
  PCI: Move device_del() from pci_stop_dev() to pci_destroy_dev()
  Revert "workqueue: allow work_on_cpu() to be called recursively"
  PCI: Avoid unnecessary CPU switch when calling driver .probe() method
2013-12-15 11:45:27 -08:00
Xiao Guangrong
6bd364d829 KEYS: fix uninitialized persistent_keyring_register_sem
We run into this bug:
[ 2736.063245] Unable to handle kernel paging request for data at address 0x00000000
[ 2736.063293] Faulting instruction address: 0xc00000000037efb0
[ 2736.063300] Oops: Kernel access of bad area, sig: 11 [#1]
[ 2736.063303] SMP NR_CPUS=2048 NUMA pSeries
[ 2736.063310] Modules linked in: sg nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6table_security ip6table_raw ip6t_REJECT iptable_nat nf_nat_ipv4 iptable_mangle iptable_security iptable_raw ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ebtable_filter ebtables ip6table_filter iptable_filter ip_tables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 nf_nat nf_conntrack ip6_tables ibmveth pseries_rng nx_crypto nfsd auth_rpcgss nfs_acl lockd sunrpc binfmt_misc xfs libcrc32c dm_service_time sd_mod crc_t10dif crct10dif_common ibmvfc scsi_transport_fc scsi_tgt dm_mirror dm_region_hash dm_log dm_multipath dm_mod
[ 2736.063383] CPU: 1 PID: 7128 Comm: ssh Not tainted 3.10.0-48.el7.ppc64 #1
[ 2736.063389] task: c000000131930120 ti: c0000001319a0000 task.ti: c0000001319a0000
[ 2736.063394] NIP: c00000000037efb0 LR: c0000000006c40f8 CTR: 0000000000000000
[ 2736.063399] REGS: c0000001319a3870 TRAP: 0300   Not tainted  (3.10.0-48.el7.ppc64)
[ 2736.063403] MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI>  CR: 28824242  XER: 20000000
[ 2736.063415] SOFTE: 0
[ 2736.063418] CFAR: c00000000000908c
[ 2736.063421] DAR: 0000000000000000, DSISR: 40000000
[ 2736.063425]
GPR00: c0000000006c40f8 c0000001319a3af0 c000000001074788 c0000001319a3bf0
GPR04: 0000000000000000 0000000000000000 0000000000000020 000000000000000a
GPR08: fffffffe00000002 00000000ffff0000 0000000080000001 c000000000924888
GPR12: 0000000028824248 c000000007e00400 00001fffffa0f998 0000000000000000
GPR16: 0000000000000022 00001fffffa0f998 0000010022e92470 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR24: 0000000000000000 c000000000f4a828 00003ffffe527108 0000000000000000
GPR28: c000000000f4a730 c000000000f4a828 0000000000000000 c0000001319a3bf0
[ 2736.063498] NIP [c00000000037efb0] .__list_add+0x30/0x110
[ 2736.063504] LR [c0000000006c40f8] .rwsem_down_write_failed+0x78/0x264
[ 2736.063508] PACATMSCRATCH [800000000280f032]
[ 2736.063511] Call Trace:
[ 2736.063516] [c0000001319a3af0] [c0000001319a3b80] 0xc0000001319a3b80 (unreliable)
[ 2736.063523] [c0000001319a3b80] [c0000000006c40f8] .rwsem_down_write_failed+0x78/0x264
[ 2736.063530] [c0000001319a3c50] [c0000000006c1bb0] .down_write+0x70/0x78
[ 2736.063536] [c0000001319a3cd0] [c0000000002e5ffc] .keyctl_get_persistent+0x20c/0x320
[ 2736.063542] [c0000001319a3dc0] [c0000000002e2388] .SyS_keyctl+0x238/0x260
[ 2736.063548] [c0000001319a3e30] [c000000000009e7c] syscall_exit+0x0/0x7c
[ 2736.063553] Instruction dump:
[ 2736.063556] 7c0802a6 fba1ffe8 fbc1fff0 fbe1fff8 7cbd2b78 7c9e2378 7c7f1b78 f8010010
[ 2736.063566] f821ff71 e8a50008 7fa52040 40de00c0 <e8be0000> 7fbd2840 40de0094 7fbff040
[ 2736.063579] ---[ end trace 2708241785538296 ]---

It's caused by uninitialized persistent_keyring_register_sem.

The bug was introduced by commit f36f8c75, two typos are in that commit:
CONFIG_KEYS_KERBEROS_CACHE should be CONFIG_PERSISTENT_KEYRINGS and
krb_cache_register_sem should be persistent_keyring_register_sem.

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2013-12-13 15:59:11 +00:00
Kirill Tkhai
f46a3cbbeb KEYS: Remove files generated when SYSTEM_TRUSTED_KEYRING=y
Always remove generated SYSTEM_TRUSTED_KEYRING files while doing make mrproper.

Signed-off-by: Kirill Tkhai <tkhai@yandex.ru>
Signed-off-by: David Howells <dhowells@redhat.com>
2013-12-13 15:59:11 +00:00
David Howells
d7ec435fdd X.509: Fix certificate gathering
Fix the gathering of certificates from both the source tree and the build tree
to correctly calculate the pathnames of all the certificates.

The problem was that if the default generated cert, signing_key.x509, didn't
exist then it would not have a path attached and if it did, it would have a
path attached.

This means that the contents of kernel/.x509.list would change between the
first compilation in a directory and the second.  After the second it would
remain stable because the signing_key.x509 file exists.

The consequence was that the kernel would get relinked unconditionally on the
second recompilation.  The second recompilation would also show something like
this:

   X.509 certificate list changed
     CERTS   kernel/x509_certificate_list
     - Including cert /home/torvalds/v2.6/linux/signing_key.x509
     AS      kernel/system_certificates.o
     LD      kernel/built-in.o

which is why the relink would happen.


Unfortunately, it isn't a simple matter of just sticking a path on the front
of the filename of the certificate in the build directory as make can't then
work out how to build it.

So the path has to be prepended to the name for sorting and duplicate
elimination and then removed for the make rule if it is in the build tree.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David Howells <dhowells@redhat.com>
2013-12-13 15:28:14 +00:00
Linus Torvalds
5dec682c7f Keyrings fixes 2013-12-10
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.15 (GNU/Linux)
 
 iQIVAwUAUqdgihOxKuMESys7AQIishAAjGG3LnEp12fd7oay5//4SEg31ybPqGj7
 Xotk9mblBW7cWPbLqk7xyIhxpIj2zM14XatEV2TfBNZV0Lcrzr1M5s87ZRUX0ui+
 FHbtfn/M3Or8AvX7W1HZgG2Se7M0Ba6Y2RRXNsEdgqk6KMQuHhPEZ2FZ5vCx6BAD
 vXzxWIKVNeqMXR7z6xkyqmpztnVQs+ZgzE9+c96QKyhBdtBy4spDmfgJS90m0pjN
 HhgONpdfgosknj8yu43rWIQvd3UUO5BVntCeic94Fbgh77ZAEi0dD7ifLz/ebcja
 pfJfPXxYzkCfIrSdAzQF8iYnQ+5rRomvGsMcvtq6mBooah/YmEkmKpKJDTp47wyN
 IEUeJaxM2Qp2jcUSjEd7lY9o1AK4sj+90cKeVRUd5kzZP1iolkQHR+dGiAwqbn6w
 MJBn9fCamvhpsZDhl2G/DICprFDvErd9zAlNubivggJmITnPXNWx+K1RDL4fO4qc
 zLHTGHkxYyACM/7oJfwbH/NyZ1yu53OlE3R2h6TA6ISc7nAed7qzGAZyrSV92Quc
 pItGaQ/zNa0sbe+nufCx9FOWu8sA3x7qnazLxhVtlPde9nlxcpGo5cagqcyc4hyp
 /IGK2JoGkgHvehECY8miJlsu7UqmThIhmy6o4T4X6ErX1M9ifR92WGeXkAi/2mh/
 WciVpPvTrqM=
 =/AdA
 -----END PGP SIGNATURE-----

Merge tag 'keys-devel-20131210' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull misc keyrings fixes from David Howells:
 "These break down into five sets:

   - A patch to error handling in the big_key type for huge payloads.
     If the payload is larger than the "low limit" and the backing store
     allocation fails, then big_key_instantiate() doesn't clear the
     payload pointers in the key, assuming them to have been previously
     cleared - but only one of them is.

     Unfortunately, the garbage collector still calls big_key_destroy()
     when sees one of the pointers with a weird value in it (and not
     NULL) which it then tries to clean up.

   - Three patches to fix the keyring type:

     * A patch to fix the hash function to correctly divide keyrings off
       from keys in the topology of the tree inside the associative
       array.  This is only a problem if searching through nested
       keyrings - and only if the hash function incorrectly puts the a
       keyring outside of the 0 branch of the root node.

     * A patch to fix keyrings' use of the associative array.  The
       __key_link_begin() function initially passes a NULL key pointer
       to assoc_array_insert() on the basis that it's holding a place in
       the tree whilst it does more allocation and stuff.

       This is only a problem when a node contains 16 keys that match at
       that level and we want to add an also matching 17th.  This should
       easily be manufactured with a keyring full of keyrings (without
       chucking any other sort of key into the mix) - except for (a)
       above which makes it on average adding the 65th keyring.

     * A patch to fix searching down through nested keyrings, where any
       keyring in the set has more than 16 keyrings and none of the
       first keyrings we look through has a match (before the tree
       iteration needs to step to a more distal node).

     Test in keyutils test suite:

        http://git.kernel.org/cgit/linux/kernel/git/dhowells/keyutils.git/commit/?id=8b4ae963ed92523aea18dfbb8cab3f4979e13bd1

   - A patch to fix the big_key type's use of a shmem file as its
     backing store causing audit messages and LSM check failures.  This
     is done by setting S_PRIVATE on the file to avoid LSM checks on the
     file (access to the shmem file goes through the keyctl() interface
     and so is gated by the LSM that way).

     This isn't normally a problem if a key is used by the context that
     generated it - and it's currently only used by libkrb5.

     Test in keyutils test suite:

        http://git.kernel.org/cgit/linux/kernel/git/dhowells/keyutils.git/commit/?id=d9a53cbab42c293962f2f78f7190253fc73bd32e

   - A patch to add a generated file to .gitignore.

   - A patch to fix the alignment of the system certificate data such
     that it it works on s390.  As I understand it, on the S390 arch,
     symbols must be 2-byte aligned because loading the address discards
     the least-significant bit"

* tag 'keys-devel-20131210' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  KEYS: correct alignment of system_certificate_list content in assembly file
  Ignore generated file kernel/x509_certificate_list
  security: shmem: implement kernel private shmem inodes
  KEYS: Fix searching of nested keyrings
  KEYS: Fix multiple key add into associative array
  KEYS: Fix the keyring hash function
  KEYS: Pre-clear struct key on allocation
2013-12-12 10:15:24 -08:00
Linus Torvalds
5cdec2d833 futex: move user address verification up to common code
When debugging the read-only hugepage case, I was confused by the fact
that get_futex_key() did an access_ok() only for the non-shared futex
case, since the user address checking really isn't in any way specific
to the private key handling.

Now, it turns out that the shared key handling does effectively do the
equivalent checks inside get_user_pages_fast() (it doesn't actually
check the address range on x86, but does check the page protections for
being a user page).  So it wasn't actually a bug, but the fact that we
treat the address differently for private and shared futexes threw me
for a loop.

Just move the check up, so that it gets done for both cases.  Also, use
the 'rw' parameter for the type, even if it doesn't actually matter any
more (it's a historical artifact of the old racy i386 "page faults from
kernel space don't check write protections").

Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-12 09:53:51 -08:00
Linus Torvalds
f12d5bfceb futex: fix handling of read-only-mapped hugepages
The hugepage code had the exact same bug that regular pages had in
commit 7485d0d375 ("futexes: Remove rw parameter from
get_futex_key()").

The regular page case was fixed by commit 9ea71503a8 ("futex: Fix
regression with read only mappings"), but the transparent hugepage case
(added in a5b338f2b0: "thp: update futex compound knowledge") case
remained broken.

Found by Dave Jones and his trinity tool.

Reported-and-tested-by: Dave Jones <davej@fedoraproject.org>
Cc: stable@kernel.org # v2.6.38+
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-12 09:38:42 -08:00
Peter Zijlstra
9dbdb15553 sched/fair: Rework sched_fair time accounting
Christian suffers from a bad BIOS that wrecks his i5's TSC sync. This
results in him occasionally seeing time going backwards - which
crashes the scheduler ...

Most of our time accounting can actually handle that except the most
common one; the tick time update of sched_fair.

There is a further problem with that code; previously we assumed that
because we get a tick every TICK_NSEC our time delta could never
exceed 32bits and math was simpler.

However, ever since Frederic managed to get NO_HZ_FULL merged; this is
no longer the case since now a task can run for a long time indeed
without getting a tick. It only takes about ~4.2 seconds to overflow
our u32 in nanoseconds.

This means we not only need to better deal with time going backwards;
but also means we need to be able to deal with large deltas.

This patch reworks the entire code and uses mul_u64_u32_shr() as
proposed by Andy a long while ago.

We express our virtual time scale factor in a u32 multiplier and shift
right and the 32bit mul_u64_u32_shr() implementation reduces to a
single 32x32->64 multiply if the time delta is still short (common
case).

For 64bit a 64x64->128 multiply can be used if ARCH_SUPPORTS_INT128.

Reported-and-Tested-by: Christian Engelmayer <cengelma@gmx.at>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: fweisbec@gmail.com
Cc: Paul Turner <pjt@google.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20131118172706.GI3866@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-12-11 15:52:35 +01:00
Peter Zijlstra
8e8339a3a1 sched: Initialize power_orig for overlapping groups
Yinghai reported that he saw a /0 in sg_capacity on his EX parts.
Make sure to always initialize power_orig now that we actually use it.

Ideally build_sched_domains() -> init_sched_groups_power() would also
initialize this; but for some yet unexplained reason some setups seem
to miss updates there.

Reported-by: Yinghai Lu <yinghai@kernel.org>
Tested-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-l8ng2m9uml6fhibln8wqpom7@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-12-11 15:47:59 +01:00
Hendrik Brueckner
62226983da KEYS: correct alignment of system_certificate_list content in assembly file
Apart from data-type specific alignment constraints, there are also
architecture-specific alignment requirements.
For example, on s390 symbols must be on even addresses implying a 2-byte
alignment.  If the system_certificate_list_end symbol is on an odd address
and if this address is loaded, the least-significant bit is ignored.  As a
result, the load_system_certificate_list() fails to load the certificates
because of a wrong certificate length calculation.

To be safe, align system_certificate_list on an 8-byte boundary.  Also improve
the length calculation of the system_certificate_list content.  Introduce a
system_certificate_list_size (8-byte aligned because of unsigned long) variable
that stores the length.  Let the linker calculate this size by introducing
a start and end label for the certificate content.

Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2013-12-10 18:25:28 +00:00
Rusty Russell
7cfe5b3310 Ignore generated file kernel/x509_certificate_list
$ git status
# On branch pending-rebases
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
#	kernel/x509_certificate_list
nothing added to commit but untracked files present (use "git add" to track)
$

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David Howells <dhowells@redhat.com>
2013-12-10 18:21:34 +00:00
Khalid Aziz
4fc9bbf98f PCI: Disable Bus Master only on kexec reboot
Add a flag to tell the PCI subsystem that kernel is shutting down in
preparation to kexec a kernel.  Add code in PCI subsystem to use this flag
to clear Bus Master bit on PCI devices only in case of kexec reboot.

This fixes a power-off problem on Acer Aspire V5-573G and likely other
machines and avoids any other issues caused by clearing Bus Master bit on
PCI devices in normal shutdown path.  The problem was introduced by
b566a22c23 ("PCI: disable Bus Master on PCI device shutdown").

This patch is based on discussion at
http://marc.info/?l=linux-pci&m=138425645204355&w=2

Link: https://bugzilla.kernel.org/show_bug.cgi?id=63861
Reported-by: Chang Liu <cl91tp@gmail.com>
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: stable@vger.kernel.org	# v3.5+
2013-12-07 14:20:28 -07:00
Tejun Heo
266ccd505e cgroup: fix cgroup_create() error handling path
ae7f164a09 ("cgroup: move cgroup->subsys[] assignment to
online_css()") moved cgroup->subsys[] assignements later in
cgroup_create() but didn't update error handling path accordingly
leading to the following oops and leaking later css's after an
online_css() failure.  The oops is from cgroup destruction path being
invoked on the partially constructed cgroup which is not ready to
handle empty slots in cgrp->subsys[] array.

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
  IP: [<ffffffff810eeaa8>] cgroup_destroy_locked+0x118/0x2f0
  PGD a780a067 PUD aadbe067 PMD 0
  Oops: 0000 [#1] SMP
  Modules linked in:
  CPU: 6 PID: 7360 Comm: mkdir Not tainted 3.13.0-rc2+ #69
  Hardware name:
  task: ffff8800b9dbec00 ti: ffff8800a781a000 task.ti: ffff8800a781a000
  RIP: 0010:[<ffffffff810eeaa8>]  [<ffffffff810eeaa8>] cgroup_destroy_locked+0x118/0x2f0
  RSP: 0018:ffff8800a781bd98  EFLAGS: 00010282
  RAX: ffff880586903878 RBX: ffff880586903800 RCX: ffff880586903820
  RDX: ffff880586903860 RSI: ffff8800a781bdb0 RDI: ffff880586903820
  RBP: ffff8800a781bde8 R08: ffff88060e0b8048 R09: ffffffff811d7bc1
  R10: 000000000000008c R11: 0000000000000001 R12: ffff8800a72286c0
  R13: 0000000000000000 R14: ffffffff81cf7a40 R15: 0000000000000001
  FS:  00007f60ecda57a0(0000) GS:ffff8806272c0000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000008 CR3: 00000000a7a03000 CR4: 00000000000007e0
  Stack:
   ffff880586903860 ffff880586903910 ffff8800a72286c0 ffff880586903820
   ffffffff81cf7a40 ffff880586903800 ffff88060e0b8018 ffffffff81cf7a40
   ffff8800b9dbec00 ffff8800b9dbf098 ffff8800a781bec8 ffffffff810ef5bf
  Call Trace:
   [<ffffffff810ef5bf>] cgroup_mkdir+0x55f/0x5f0
   [<ffffffff811c90ae>] vfs_mkdir+0xee/0x140
   [<ffffffff811cb07e>] SyS_mkdirat+0x6e/0xf0
   [<ffffffff811c6a19>] SyS_mkdir+0x19/0x20
   [<ffffffff8169e569>] system_call_fastpath+0x16/0x1b

This patch moves reference bumping inside online_css() loop, clears
css_ar[] as css's are brought online successfully, and updates
err_destroy path so that either a css is fully online and destroyed by
cgroup_destroy_locked() or the error path frees it.  This creates a
duplicate css free logic in the error path but it will be cleaned up
soon.

v2: Li pointed out that cgroup_destroy_locked() would do NULL-deref if
    invoked with a cgroup which doesn't have all css's populated.
    Update cgroup_destroy_locked() so that it skips NULL css's.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Reported-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: stable@vger.kernel.org # v3.12+
2013-12-06 15:08:50 -05:00
Linus Torvalds
843f4f4bb1 A regression showed up that there's a large delay when enabling
all events. This was prevalent when FTRACE_SELFTEST was enabled which
 enables all events several times, and caused the system bootup to
 pause for over a minute.
 
 This was tracked down to an addition of a synchronize_sched() performed
 when system call tracepoints are unregistered.
 
 The synchronize_sched() is needed between the unregistering of the
 system call tracepoint and a deletion of a tracing instance buffer.
 But placing the synchronize_sched() in the unreg of *every* system call
 tracepoint is a bit overboard. A single synchronize_sched() before
 the deletion of the instance is sufficient.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.15 (GNU/Linux)
 
 iQEcBAABAgAGBQJSofYNAAoJEKQekfcNnQGuTUcIAJOx8745KlkFjN4VX+nCWNfP
 xrrpnLBymbGMA9lQ4fXk+kdiuhH8DjRYdKq9fU4T481MFYKkToUIZH6NeaLI5fr1
 0nBPPjVyAlJ+yt9JbOAYa1jEYnAr27ORDHEtdQnqb6OJSky3oh9jCQi+toxmh2qX
 Sv1tIYeAf3K2V/h5xt6uSl9oiZ6KBtwE3f+xkHWNizaU9i2rq2gxd77fSbPNTIps
 wLdsESYziA2UeAm13eh8xXo1uqRbfvx7bPr59cu0+3AqdOoaXFG+JoE/MqmljIO5
 HkyCKJVqP8HD+QieuEB+hw4zpSsl6A6iKSlaiHm8NMnRRocfM6qMKMQXQ28KhxA=
 =sYm/
 -----END PGP SIGNATURE-----

Merge tag 'trace-fixes-3.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
 "A regression showed up that there's a large delay when enabling all
  events.  This was prevalent when FTRACE_SELFTEST was enabled which
  enables all events several times, and caused the system bootup to
  pause for over a minute.

  This was tracked down to an addition of a synchronize_sched()
  performed when system call tracepoints are unregistered.

  The synchronize_sched() is needed between the unregistering of the
  system call tracepoint and a deletion of a tracing instance buffer.
  But placing the synchronize_sched() in the unreg of *every* system
  call tracepoint is a bit overboard.  A single synchronize_sched()
  before the deletion of the instance is sufficient"

* tag 'trace-fixes-3.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Only run synchronize_sched() at instance deletion time
2013-12-06 08:34:16 -08:00
Steven Rostedt
3ccb012392 tracing: Only run synchronize_sched() at instance deletion time
It has been reported that boot up with FTRACE_SELFTEST enabled can take a
very long time. There can be stalls of over a minute.

This was tracked down to the synchronize_sched() called when a system call
event is disabled. As the self tests enable and disable thousands of events,
this makes the synchronize_sched() get called thousands of times.

The synchornize_sched() was added with d562aff93b "tracing: Add support
for SOFT_DISABLE to syscall events" which caused this regression (added
in 3.13-rc1).

The synchronize_sched() is to protect against the events being accessed
when a tracer instance is being deleted. When an instance is being deleted
all the events associated to it are unregistered. The synchronize_sched()
makes sure that no more users are running when it finishes.

Instead of calling synchronize_sched() for all syscall events, we only
need to call it once, after the events are unregistered and before the
instance is deleted. The event_mutex is held during this action to
prevent new users from enabling events.

Link: http://lkml.kernel.org/r/20131203124120.427b9661@gandalf.local.home

Reported-by: Petr Mladek <pmladek@suse.cz>
Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Acked-by: Petr Mladek <pmladek@suse.cz>
Tested-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-12-05 14:22:30 -05:00
Linus Torvalds
1ab231b274 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:

 - timekeeping: Cure a subtle drift issue on GENERIC_TIME_VSYSCALL_OLD

 - nohz: Make CONFIG_NO_HZ=n and nohz=off command line option behave the
   same way.  Fixes a long standing load accounting wreckage.

 - clocksource/ARM: Kconfig update to avoid ARM=n wreckage

 - clocksource/ARM: Fixlets for the AT91 and SH clocksource/clockevents

 - Trivial documentation update and kzalloc conversion from akpms pile

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  nohz: Fix another inconsistency between CONFIG_NO_HZ=n and nohz=off
  time: Fix 1ns/tick drift w/ GENERIC_TIME_VSYSCALL_OLD
  clocksource: arm_arch_timer: Hide eventstream Kconfig on non-ARM
  clocksource: sh_tmu: Add clk_prepare/unprepare support
  clocksource: sh_tmu: Release clock when sh_tmu_register() fails
  clocksource: sh_mtu2: Add clk_prepare/unprepare support
  clocksource: sh_mtu2: Release clock when sh_mtu2_register() fails
  ARM: at91: rm9200: switch back to clockevents_config_and_register
  tick: Document tick_do_timer_cpu
  timer: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...)
  NOHZ: Check for nohz active instead of nohz enabled
2013-12-04 08:52:09 -08:00
Linus Torvalds
a45299e727 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
 - Correction of fuzzy and fragile IRQ_RETVAL macro
 - IRQ related resume fix affecting only XEN
 - ARM/GIC fix for chained GIC controllers

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip: Gic: fix boot for chained gics
  irq: Enable all irqs unconditionally in irq_resume
  genirq: Correct fuzzy and fragile IRQ_RETVAL() definition
2013-12-02 10:15:39 -08:00
Linus Torvalds
a0b57ca33e Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Various smaller fixlets, all over the place"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/doc: Fix generation of device-drivers
  sched: Expose preempt_schedule_irq()
  sched: Fix a trivial typo in comments
  sched: Remove unused variable in 'struct sched_domain'
  sched: Avoid NULL dereference on sd_busy
  sched: Check sched_domain before computing group power
  MAINTAINERS: Update file patterns in the lockdep and scheduler entries
2013-12-02 10:13:44 -08:00
Linus Torvalds
e321ae4c20 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Misc kernel and tooling fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tools lib traceevent: Fix conversion of pointer to integer of different size
  perf/trace: Properly use u64 to hold event_id
  perf: Remove fragile swevent hlist optimization
  ftrace, perf: Avoid infinite event generation loop
  tools lib traceevent: Fix use of multiple options in processing field
  perf header: Fix possible memory leaks in process_group_desc()
  perf header: Fix bogus group name
  perf tools: Tag thread comm as overriden
2013-12-02 10:13:09 -08:00
Linus Torvalds
7224b31bd5 Merge branch 'for-3.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fixes from Tejun Heo:
 "This contains one important fix.  The NUMA support added a while back
  broke ordering guarantees on ordered workqueues.  It was enforced by
  having single frontend interface with @max_active == 1 but the NUMA
  support puts multiple interfaces on unbound workqueues on NUMA
  machines thus breaking the ordered guarantee.  This is fixed by
  disabling NUMA support on ordered workqueues.

  The above and a couple other patches were sitting in for-3.12-fixes
  but I forgot to push that out, so they ended up waiting a bit too
  long.  My aplogies.

  Other fixes are minor"

* 'for-3.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: fix pool ID allocation leakage and remove BUILD_BUG_ON() in init_workqueues
  workqueue: fix comment typo for __queue_work()
  workqueue: fix ordered workqueues in NUMA setups
  workqueue: swap set_cpus_allowed_ptr() and PF_NO_SETAFFINITY
2013-11-29 09:49:08 -08:00
Linus Torvalds
2855987d13 Merge branch 'for-3.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
 "Fixes for three issues.

   - cgroup destruction path could swamp system_wq possibly leading to
     deadlock.  This actually seems to happen in the wild with memcg
     because memcg destruction path adds nested dependency on system_wq.

     Resolved by isolating cgroup destruction work items on its
     dedicated workqueue.

   - Possible locking context deadlock through seqcount reported by
     lockdep

   - Memory leak under certain conditions"

* 'for-3.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: fix cgroup_subsys_state leak for seq_files
  cpuset: Fix memory allocator deadlock
  cgroup: use a dedicated workqueue for cgroup destruction
2013-11-29 09:47:06 -08:00
Thomas Gleixner
0e576acbc1 nohz: Fix another inconsistency between CONFIG_NO_HZ=n and nohz=off
If CONFIG_NO_HZ=n tick_nohz_get_sleep_length() returns NSEC_PER_SEC/HZ.

If CONFIG_NO_HZ=y and the nohz functionality is disabled via the
command line option "nohz=off" or not enabled due to missing hardware
support, then tick_nohz_get_sleep_length() returns 0. That happens
because ts->sleep_length is never set in that case.

Set it to NSEC_PER_SEC/HZ when the NOHZ mode is inactive.

Reported-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-11-29 12:23:03 +01:00
Helge Deller
5ecbe3c3c6 kernel/extable: fix address-checks for core_kernel and init areas
The init_kernel_text() and core_kernel_text() functions should not
include the labels _einittext and _etext when checking if an address is
inside the .text or .init sections.

Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-28 09:49:41 -08:00
Tejun Heo
e605b36575 cgroup: fix cgroup_subsys_state leak for seq_files
If a cgroup file implements either read_map() or read_seq_string(),
such file is served using seq_file by overriding file->f_op to
cgroup_seqfile_operations, which also overrides the release method to
single_release() from cgroup_file_release().

Because cgroup_file_open() didn't use to acquire any resources, this
used to be fine, but since f7d58818ba ("cgroup: pin
cgroup_subsys_state when opening a cgroupfs file"), cgroup_file_open()
pins the css (cgroup_subsys_state) which is put by
cgroup_file_release().  The patch forgot to update the release path
for seq_files and each open/release cycle leaks a css reference.

Fix it by updating cgroup_file_release() to also handle seq_files and
using it for seq_file release path too.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v3.12
2013-11-27 18:16:21 -05:00
Peter Zijlstra
0fc0287c9e cpuset: Fix memory allocator deadlock
Juri hit the below lockdep report:

[    4.303391] ======================================================
[    4.303392] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
[    4.303394] 3.12.0-dl-peterz+ #144 Not tainted
[    4.303395] ------------------------------------------------------
[    4.303397] kworker/u4:3/689 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
[    4.303399]  (&p->mems_allowed_seq){+.+...}, at: [<ffffffff8114e63c>] new_slab+0x6c/0x290
[    4.303417]
[    4.303417] and this task is already holding:
[    4.303418]  (&(&q->__queue_lock)->rlock){..-...}, at: [<ffffffff812d2dfb>] blk_execute_rq_nowait+0x5b/0x100
[    4.303431] which would create a new lock dependency:
[    4.303432]  (&(&q->__queue_lock)->rlock){..-...} -> (&p->mems_allowed_seq){+.+...}
[    4.303436]

[    4.303898] the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock:
[    4.303918] -> (&p->mems_allowed_seq){+.+...} ops: 2762 {
[    4.303922]    HARDIRQ-ON-W at:
[    4.303923]                     [<ffffffff8108ab9a>] __lock_acquire+0x65a/0x1ff0
[    4.303926]                     [<ffffffff8108cbe3>] lock_acquire+0x93/0x140
[    4.303929]                     [<ffffffff81063dd6>] kthreadd+0x86/0x180
[    4.303931]                     [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0
[    4.303933]    SOFTIRQ-ON-W at:
[    4.303933]                     [<ffffffff8108abcc>] __lock_acquire+0x68c/0x1ff0
[    4.303935]                     [<ffffffff8108cbe3>] lock_acquire+0x93/0x140
[    4.303940]                     [<ffffffff81063dd6>] kthreadd+0x86/0x180
[    4.303955]                     [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0
[    4.303959]    INITIAL USE at:
[    4.303960]                    [<ffffffff8108a884>] __lock_acquire+0x344/0x1ff0
[    4.303963]                    [<ffffffff8108cbe3>] lock_acquire+0x93/0x140
[    4.303966]                    [<ffffffff81063dd6>] kthreadd+0x86/0x180
[    4.303969]                    [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0
[    4.303972]  }

Which reports that we take mems_allowed_seq with interrupts enabled. A
little digging found that this can only be from
cpuset_change_task_nodemask().

This is an actual deadlock because an interrupt doing an allocation will
hit get_mems_allowed()->...->__read_seqcount_begin(), which will spin
forever waiting for the write side to complete.

Cc: John Stultz <john.stultz@linaro.org>
Cc: Mel Gorman <mgorman@suse.de>
Reported-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: Juri Lelli <juri.lelli@gmail.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
2013-11-27 13:52:47 -05:00
Thomas Gleixner
32e475d76a sched: Expose preempt_schedule_irq()
Tony reported that aa0d532605 ("ia64: Use preempt_schedule_irq")
broke PREEMPT=n builds on ia64.

Ok, wrapped my brain around it. I tripped over the magic asm foo which
has a single need_resched check and schedule point for both sys call
return and interrupt return.

So you need the schedule_preempt_irq() for kernel preemption from
interrupt return while on a normal syscall preemption a schedule would
be sufficient. But using schedule_preempt_irq() is not harmful here in
any way. It just sets the preempt_active bit also in cases where it
would not be required.

Even on preempt=n kernels adding the preempt_active bit is completely
harmless. So instead of having an extra function, moving the existing
one out of the ifdef PREEMPT looks like the sanest thing to do.

It would also allow getting rid of various other sti/schedule/cli asm
magic in other archs.

Reported-and-Tested-by: Tony Luck <tony.luck@gmail.com>
Fixes: aa0d532605 ("ia64: Use preempt_schedule_irq")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[slightly edited Changelog]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1311211230030.30673@ionos.tec.linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-27 11:04:53 +01:00
Linus Torvalds
8ae516aa8b This includes two fixes.
1) is a bug fix that happens when root does the following:
 
   echo function_graph > current_tracer
   modprobe foo
   echo nop > current_tracer
 
 This causes the ftrace internal accounting to get screwed up and
 crashes ftrace, preventing the user from using the function tracer
 after that.
 
 2) if a TRACE_EVENT has a string field, and NULL is given for it.
 
 The internal trace event code does a strlen() and strcpy() on the
 source of field. If it is NULL it causes the system to oops.
 
 This bug has been there since 2.6.31, but no TRACE_EVENT ever passed
 in a NULL to the string field, until now.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQEcBAABAgAGBQJSlUw+AAoJEKQekfcNnQGugTYIAJQ7Zfhor2Jrw7XzkcBDpQv9
 kqL/NvjLfyA49BLwba0VJCqJA56dEfW7kaSa7Wx0qAHdHKATLDA8G4c9FdHRAmZf
 WJ4jDbrcJqc7DA2vEn4aUuczwvTMx0H1KJHPMAu9taEno3YocIzCMxkuNYelwAz2
 XUkUGtR7olF85pyVccfZLKnKPtslSwxWoG6WgEqiAap6fIorPlcSXBVYFqLKVTRJ
 P2e847eqxMF5ACLmv3dWiEvTPtWY91abN1zpeJYQNjBtQJmzVlvRcXYE6TwPUIFg
 RtB9n3SrT+0lEWvcxDbQi+4hKHf+JQLkGaYwWCMJihbdF4sh36olzpUOimKVqsk=
 =+9+v
 -----END PGP SIGNATURE-----

Merge tag 'trace-fixes-v3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "This includes two fixes.

  1) is a bug fix that happens when root does the following:

     echo function_graph > current_tracer
     modprobe foo
     echo nop > current_tracer

   This causes the ftrace internal accounting to get screwed up and
   crashes ftrace, preventing the user from using the function tracer
   after that.

  2) if a TRACE_EVENT has a string field, and NULL is given for it.

   The internal trace event code does a strlen() and strcpy() on the
   source of field.  If it is NULL it causes the system to oops.

   This bug has been there since 2.6.31, but no TRACE_EVENT ever passed
   in a NULL to the string field, until now"

* tag 'trace-fixes-v3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace: Fix function graph with loading of modules
  tracing: Allow events to have NULL strings
2013-11-26 18:04:21 -08:00
Steven Rostedt (Red Hat)
8a56d7761d ftrace: Fix function graph with loading of modules
Commit 8c4f3c3fa9 "ftrace: Check module functions being traced on reload"
fixed module loading and unloading with respect to function tracing, but
it missed the function graph tracer. If you perform the following

 # cd /sys/kernel/debug/tracing
 # echo function_graph > current_tracer
 # modprobe nfsd
 # echo nop > current_tracer

You'll get the following oops message:

 ------------[ cut here ]------------
 WARNING: CPU: 2 PID: 2910 at /linux.git/kernel/trace/ftrace.c:1640 __ftrace_hash_rec_update.part.35+0x168/0x1b9()
 Modules linked in: nfsd exportfs nfs_acl lockd ipt_MASQUERADE sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables uinput snd_hda_codec_idt
 CPU: 2 PID: 2910 Comm: bash Not tainted 3.13.0-rc1-test #7
 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
  0000000000000668 ffff8800787efcf8 ffffffff814fe193 ffff88007d500000
  0000000000000000 ffff8800787efd38 ffffffff8103b80a 0000000000000668
  ffffffff810b2b9a ffffffff81a48370 0000000000000001 ffff880037aea000
 Call Trace:
  [<ffffffff814fe193>] dump_stack+0x4f/0x7c
  [<ffffffff8103b80a>] warn_slowpath_common+0x81/0x9b
  [<ffffffff810b2b9a>] ? __ftrace_hash_rec_update.part.35+0x168/0x1b9
  [<ffffffff8103b83e>] warn_slowpath_null+0x1a/0x1c
  [<ffffffff810b2b9a>] __ftrace_hash_rec_update.part.35+0x168/0x1b9
  [<ffffffff81502f89>] ? __mutex_lock_slowpath+0x364/0x364
  [<ffffffff810b2cc2>] ftrace_shutdown+0xd7/0x12b
  [<ffffffff810b47f0>] unregister_ftrace_graph+0x49/0x78
  [<ffffffff810c4b30>] graph_trace_reset+0xe/0x10
  [<ffffffff810bf393>] tracing_set_tracer+0xa7/0x26a
  [<ffffffff810bf5e1>] tracing_set_trace_write+0x8b/0xbd
  [<ffffffff810c501c>] ? ftrace_return_to_handler+0xb2/0xde
  [<ffffffff811240a8>] ? __sb_end_write+0x5e/0x5e
  [<ffffffff81122aed>] vfs_write+0xab/0xf6
  [<ffffffff8150a185>] ftrace_graph_caller+0x85/0x85
  [<ffffffff81122dbd>] SyS_write+0x59/0x82
  [<ffffffff8150a185>] ftrace_graph_caller+0x85/0x85
  [<ffffffff8150a2d2>] system_call_fastpath+0x16/0x1b
 ---[ end trace 940358030751eafb ]---

The above mentioned commit didn't go far enough. Well, it covered the
function tracer by adding checks in __register_ftrace_function(). The
problem is that the function graph tracer circumvents that (for a slight
efficiency gain when function graph trace is running with a function
tracer. The gain was not worth this).

The problem came with ftrace_startup() which should always be called after
__register_ftrace_function(), if you want this bug to be completely fixed.

Anyway, this solution moves __register_ftrace_function() inside of
ftrace_startup() and removes the need to call them both.

Reported-by: Dave Wysochanski <dwysocha@redhat.com>
Fixes: ed926f9b35 ("ftrace: Use counters to enable functions to trace")
Cc: stable@vger.kernel.org # 3.0+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-11-26 10:36:50 -05:00
Bjorn Helgaas
12997d1a99 Revert "workqueue: allow work_on_cpu() to be called recursively"
This reverts commit c2fda50966.

c2fda50966 removed lockdep annotation from work_on_cpu() to work around
the PCI path that calls work_on_cpu() from within a work_on_cpu() work item
(PF driver .probe() method -> pci_enable_sriov() -> add VFs -> VF driver
.probe method).

961da7fb6b22 ("PCI: Avoid unnecessary CPU switch when calling driver
.probe() method) avoids that recursive work_on_cpu() use in a different
way, so this revert restores the work_on_cpu() lockdep annotation.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
2013-11-25 14:37:22 -07:00