Commit graph

197 commits

Author SHA1 Message Date
Peter Zijlstra
0b9d46fc5e sched: Rename task_running() to task_on_cpu()
There is some ambiguity about task_running() in that it is unrelated
to TASK_RUNNING but instead tests ->on_cpu. As such, rename the thing
task_on_cpu().

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/Yxhkhn55uHZx+NGl@hirez.programming.kicks-ass.net
2022-09-07 21:53:47 +02:00
Ingo Molnar
53aa930dc4 Merge branch 'sched/warnings' into sched/core, to pick up WARN_ON_ONCE() conversion commit
Merge in the BUG_ON() => WARN_ON_ONCE() conversion commit.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-08-30 10:28:15 +02:00
Shang XiaoJing
5531ecffa4 sched: Add update_current_exec_runtime helper
Wrap repeated code in helper function update_current_exec_runtime for
update the exec time of the current.

Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220824082856.15674-1-shangxiaojing@huawei.com
2022-08-27 00:05:35 +02:00
Ingo Molnar
09348d75a6 sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE()
There's no good reason to crash a user's system with a BUG_ON(),
chances are high that they'll never even see the crash message on
Xorg, and it won't make it into the syslog either.

By using a WARN_ON_ONCE() we at least give the user a chance to report
any bugs triggered here - instead of getting silent hangs.

None of these WARN_ON_ONCE()s are supposed to trigger, ever - so we ignore
cases where a NULL check is done via a BUG_ON() and we let a NULL
pointer through after a WARN_ON_ONCE().

There's one exception: WARN_ON_ONCE() arguments with side-effects,
such as locking - in this case we use the return value of the
WARN_ON_ONCE(), such as in:

 -       BUG_ON(!lock_task_sighand(p, &flags));
 +       if (WARN_ON_ONCE(!lock_task_sighand(p, &flags)))
 +               return;

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/YvSsKcAXISmshtHo@gmail.com
2022-08-12 11:25:10 +02:00
Dietmar Eggemann
740cf8a760 sched/core: Introduce sched_asym_cpucap_active()
Create an inline helper for conditional code to be only executed on
asymmetric CPU capacity systems. This makes these (currently ~10 and
future) conditions a lot more readable.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20220729111305.1275158-2-dietmar.eggemann@arm.com
2022-08-02 12:32:45 +02:00
Nicolas Saenz Julienne
5c66d1b9b3 nohz/full, sched/rt: Fix missed tick-reenabling bug in dequeue_task_rt()
dequeue_task_rt() only decrements 'rt_rq->rt_nr_running' after having
called sched_update_tick_dependency() preventing it from re-enabling the
tick on systems that no longer have pending SCHED_RT tasks but have
multiple runnable SCHED_OTHER tasks:

  dequeue_task_rt()
    dequeue_rt_entity()
      dequeue_rt_stack()
        dequeue_top_rt_rq()
	  sub_nr_running()	// decrements rq->nr_running
	    sched_update_tick_dependency()
	      sched_can_stop_tick()	// checks rq->rt.rt_nr_running,
	      ...
        __dequeue_rt_entity()
          dec_rt_tasks()	// decrements rq->rt.rt_nr_running
	  ...

Every other scheduler class performs the operation in the opposite
order, and sched_update_tick_dependency() expects the values to be
updated as such. So avoid the misbehaviour by inverting the order in
which the above operations are performed in the RT scheduler.

Fixes: 76d92ac305 ("sched: Migrate sched to use new tick dependency mask model")
Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lore.kernel.org/r/20220628092259.330171-1-nsaenzju@redhat.com
2022-07-21 10:39:38 +02:00
Linus Torvalds
44d35720c9 sysctl changes for v5.19-rc1
For two kernel releases now kernel/sysctl.c has been being cleaned up
 slowly, since the tables were grossly long, sprinkled with tons of #ifdefs and
 all this caused merge conflicts with one susbystem or another.
 
 This tree was put together to help try to avoid conflicts with these cleanups
 going on different trees at time. So nothing exciting on this pull request,
 just cleanups.
 
 I actually had this sysctl-next tree up since v5.18 but I missed sending a
 pull request for it on time during the last merge window. And so these changes
 have been being soaking up on sysctl-next and so linux-next for a while.
 The last change was merged May 4th.
 
 Most of the compile issues were reported by 0day and fixed.
 
 To help avoid a conflict with bpf folks at Daniel Borkmann's request
 I merged bpf-next/pr/bpf-sysctl into sysctl-next to get the effor which
 moves the BPF sysctls from kernel/sysctl.c to BPF core.
 
 Possible merge conflicts and known resolutions as per linux-next:
 
 bfp:
 https://lkml.kernel.org/r/20220414112812.652190b5@canb.auug.org.au
 
 rcu:
 https://lkml.kernel.org/r/20220420153746.4790d532@canb.auug.org.au
 
 powerpc:
 https://lkml.kernel.org/r/20220520154055.7f964b76@canb.auug.org.au
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmKOq8ASHG1jZ3JvZkBr
 ZXJuZWwub3JnAAoJEM4jHQowkoinDAkQAJVo5YVM9f74UwYp4PQhTpjxJBCjRoZD
 z1u9bp5rMj2ujTC8Fr7VmzKaHrb8+r1C1WvCvZtIzemYNB4lZUrHpVDYfXuXiPRB
 ihPmEjhlPO5PFBx6cVCpI3cu9bEhG00rLc1QXnABx/pXwNPcOTJAGZJVamZvqubk
 chjgZrb7N+adHPfvS55v1+zpwdeKfpp5U3zuu5qlT/nn0GS0HCVzOj5fj4oC4wtJ
 IqfUubo+FX50Ga58yQABWNrjaPD9Crykz5ohVazy3ElQl0hJ4VsK65ct3blqc2vz
 1Bb8kPpWuv6aZ5nr1lCVE8qvF4ZIL33ySvpg5BSdWLQEDrBbSpzvJe9Yn7wgR+eq
 y7fhpO24+zRM82EoDMEvyxX9u1n1RsvoXRtf3ds9BGf63MUxk8a1cgjlU6vuyO2U
 JhDmfM1xzdKvPoY4COOnHzcAiIqzItTqKd09N5y0cahmYstROU8lvp9huhTAHqk1
 SjQMbLIZG7OnX8ZeQcR1EB8sq/IOPZT48ejj0iJmQ8FyMaep71MOQLYyLPAq4lgh
 JHXm8P6QdB57jfJbqAeNSyZoK0qdxOUR/83Zcah7Jjns6vkju1DNatEsaEEI2y2M
 4n7/rkHeZ3TyFHBUX4e9FomKvGLsAalDBRiqsuxLSOPMU8rGrNLAslOAtKwvp90X
 4ht3M2VP098l
 =btwh
 -----END PGP SIGNATURE-----

Merge tag 'sysctl-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux

Pull sysctl updates from Luis Chamberlain:
 "For two kernel releases now kernel/sysctl.c has been being cleaned up
  slowly, since the tables were grossly long, sprinkled with tons of
  #ifdefs and all this caused merge conflicts with one susbystem or
  another.

  This tree was put together to help try to avoid conflicts with these
  cleanups going on different trees at time. So nothing exciting on this
  pull request, just cleanups.

  Thanks a lot to the Uniontech and Huawei folks for doing some of this
  nasty work"

* tag 'sysctl-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (28 commits)
  sched: Fix build warning without CONFIG_SYSCTL
  reboot: Fix build warning without CONFIG_SYSCTL
  kernel/kexec_core: move kexec_core sysctls into its own file
  sysctl: minor cleanup in new_dir()
  ftrace: fix building with SYSCTL=y but DYNAMIC_FTRACE=n
  fs/proc: Introduce list_for_each_table_entry for proc sysctl
  mm: fix unused variable kernel warning when SYSCTL=n
  latencytop: move sysctl to its own file
  ftrace: fix building with SYSCTL=n but DYNAMIC_FTRACE=y
  ftrace: Fix build warning
  ftrace: move sysctl_ftrace_enabled to ftrace.c
  kernel/do_mount_initrd: move real_root_dev sysctls to its own file
  kernel/delayacct: move delayacct sysctls to its own file
  kernel/acct: move acct sysctls to its own file
  kernel/panic: move panic sysctls to its own file
  kernel/lockdep: move lockdep sysctls to its own file
  mm: move page-writeback sysctls to their own file
  mm: move oom_kill sysctls to their own file
  kernel/reboot: move reboot sysctls to its own file
  sched: Move energy_aware sysctls to topology.c
  ...
2022-05-26 16:57:20 -07:00
Hao Jia
2679a83731 sched/core: Avoid obvious double update_rq_clock warning
When we use raw_spin_rq_lock() to acquire the rq lock and have to
update the rq clock while holding the lock, the kernel may issue
a WARN_DOUBLE_CLOCK warning.

Since we directly use raw_spin_rq_lock() to acquire rq lock instead of
rq_lock(), there is no corresponding change to rq->clock_update_flags.
In particular, we have obtained the rq lock of other CPUs, the
rq->clock_update_flags of this CPU may be RQCF_UPDATED at this time, and
then calling update_rq_clock() will trigger the WARN_DOUBLE_CLOCK warning.

So we need to clear RQCF_UPDATED of rq->clock_update_flags to avoid
the WARN_DOUBLE_CLOCK warning.

For the sched_rt_period_timer() and migrate_task_rq_dl() cases
we simply replace raw_spin_rq_lock()/raw_spin_rq_unlock() with
rq_lock()/rq_unlock().

For the {pull,push}_{rt,dl}_task() cases, we add the
double_rq_clock_clear_update() function to clear RQCF_UPDATED of
rq->clock_update_flags, and call double_rq_clock_clear_update()
before double_lock_balance()/double_rq_lock() returns to avoid the
WARN_DOUBLE_CLOCK warning.

Some call trace reports:
Call Trace 1:
 <IRQ>
 sched_rt_period_timer+0x10f/0x3a0
 ? enqueue_top_rt_rq+0x110/0x110
 __hrtimer_run_queues+0x1a9/0x490
 hrtimer_interrupt+0x10b/0x240
 __sysvec_apic_timer_interrupt+0x8a/0x250
 sysvec_apic_timer_interrupt+0x9a/0xd0
 </IRQ>
 <TASK>
 asm_sysvec_apic_timer_interrupt+0x12/0x20

Call Trace 2:
 <TASK>
 activate_task+0x8b/0x110
 push_rt_task.part.108+0x241/0x2c0
 push_rt_tasks+0x15/0x30
 finish_task_switch+0xaa/0x2e0
 ? __switch_to+0x134/0x420
 __schedule+0x343/0x8e0
 ? hrtimer_start_range_ns+0x101/0x340
 schedule+0x4e/0xb0
 do_nanosleep+0x8e/0x160
 hrtimer_nanosleep+0x89/0x120
 ? hrtimer_init_sleeper+0x90/0x90
 __x64_sys_nanosleep+0x96/0xd0
 do_syscall_64+0x34/0x90
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Call Trace 3:
 <TASK>
 deactivate_task+0x93/0xe0
 pull_rt_task+0x33e/0x400
 balance_rt+0x7e/0x90
 __schedule+0x62f/0x8e0
 do_task_dead+0x3f/0x50
 do_exit+0x7b8/0xbb0
 do_group_exit+0x2d/0x90
 get_signal+0x9df/0x9e0
 ? preempt_count_add+0x56/0xa0
 ? __remove_hrtimer+0x35/0x70
 arch_do_signal_or_restart+0x36/0x720
 ? nanosleep_copyout+0x39/0x50
 ? do_nanosleep+0x131/0x160
 ? audit_filter_inodes+0xf5/0x120
 exit_to_user_mode_prepare+0x10f/0x1e0
 syscall_exit_to_user_mode+0x17/0x30
 do_syscall_64+0x40/0x90
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Call Trace 4:
 update_rq_clock+0x128/0x1a0
 migrate_task_rq_dl+0xec/0x310
 set_task_cpu+0x84/0x1e4
 try_to_wake_up+0x1d8/0x5c0
 wake_up_process+0x1c/0x30
 hrtimer_wakeup+0x24/0x3c
 __hrtimer_run_queues+0x114/0x270
 hrtimer_interrupt+0xe8/0x244
 arch_timer_handler_phys+0x30/0x50
 handle_percpu_devid_irq+0x88/0x140
 generic_handle_domain_irq+0x40/0x60
 gic_handle_irq+0x48/0xe0
 call_on_irq_stack+0x2c/0x60
 do_interrupt_handler+0x80/0x84

Steps to reproduce:
1. Enable CONFIG_SCHED_DEBUG when compiling the kernel
2. echo 1 > /sys/kernel/debug/clear_warn_once
   echo "WARN_DOUBLE_CLOCK" > /sys/kernel/debug/sched/features
   echo "NO_RT_PUSH_IPI" > /sys/kernel/debug/sched/features
3. Run some rt/dl tasks that periodically work and sleep, e.g.
Create 2*n rt or dl (90% running) tasks via rt-app (on a system
with n CPUs), and Dietmar Eggemann reports Call Trace 4 when running
on PREEMPT_RT kernel.

Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20220430085843.62939-2-jiahao.os@bytedance.com
2022-05-11 16:27:11 +02:00
Baisong Zhong
28f152cd09 sched/rt: fix build error when CONFIG_SYSCTL is disable
Avoid random build errors which do not select
CONFIG_SYSCTL by depending on it in Kconfig.

This fixes the following warning:

In file included from kernel/sched/build_policy.c:43:
At top level:
kernel/sched/rt.c:3017:12: error: ‘sched_rr_handler’ defined but not used [-Werror=unused-function]
 3017 | static int sched_rr_handler(struct ctl_table *table, int write, void *buffer,
      |            ^~~~~~~~~~~~~~~~
kernel/sched/rt.c:2978:12: error: ‘sched_rt_handler’ defined but not used [-Werror=unused-function]
 2978 | static int sched_rt_handler(struct ctl_table *table, int write, void *buffer,
      |            ^~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
make[2]: *** [scripts/Makefile.build:310: kernel/sched/build_policy.o] Error 1
make[1]: *** [scripts/Makefile.build:638: kernel/sched] Error 2
make[1]: *** Waiting for unfinished jobs....

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baisong Zhong <zhongbaisong@huawei.com>
[mcgrof: small build fix, we need sched_rt_can_attach() even
 when CONFIG_SYSCTL is disabled]
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-04-06 13:43:43 -07:00
Zhen Ni
dafd7a9dad sched: Move rr_timeslice sysctls to rt.c
move rr_timeslice sysctls to rt.c and use the new
register_sysctl_init() to register the sysctl interface.

Signed-off-by: Zhen Ni <nizhen@uniontech.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-04-06 13:43:43 -07:00
Zhen Ni
d9ab0e63fa sched: Move rt_period/runtime sysctls to rt.c
move rt_period/runtime sysctls to rt.c and use the new
register_sysctl_init() to register the sysctl interface.

Signed-off-by: Zhen Ni <nizhen@uniontech.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-04-06 13:43:43 -07:00
Ingo Molnar
ccacfe56d7 Merge branch 'sched/fast-headers' into sched/core
Merge the scheduler build speedup of the fast-headers tree.

Cumulative scheduler (kernel/sched/) build time speedup on a
Linux distribution's config, which enables all scheduler features,
compared to the vanilla kernel:

      _____________________________________________________________________________
     |
     |  Vanilla kernel (v5.13-rc7):
     |_____________________________________________________________________________
     |
     |  Performance counter stats for 'make -j96 kernel/sched/' (3 runs):
     |
     |   126,975,564,374      instructions              #    1.45  insn per cycle           ( +-  0.00% )
     |    87,637,847,671      cycles                    #    3.959 GHz                      ( +-  0.30% )
     |         22,136.96 msec cpu-clock                 #    7.499 CPUs utilized            ( +-  0.29% )
     |
     |            2.9520 +- 0.0169 seconds time elapsed  ( +-  0.57% )
     |_____________________________________________________________________________
     |
     |  Patched kernel:
     |_____________________________________________________________________________
     |
     | Performance counter stats for 'make -j96 kernel/sched/' (3 runs):
     |
     |    50,420,496,914      instructions              #    1.47  insn per cycle           ( +-  0.00% )
     |    34,234,322,038      cycles                    #    3.946 GHz                      ( +-  0.31% )
     |          8,675.81 msec cpu-clock                 #    3.053 CPUs utilized            ( +-  0.45% )
     |
     |            2.8420 +- 0.0181 seconds time elapsed  ( +-  0.64% )
     |_____________________________________________________________________________

    Summary:

      - CPU time used to build the scheduler dropped by -60.9%, a reduction
        from 22.1 clock-seconds to 8.7 clock-seconds.

      - Wall-clock time to build the scheduler dropped by -3.9%, a reduction
        from 2.95 seconds to 2.84 seconds.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-03-15 09:05:05 +01:00
Dietmar Eggemann
821aecd09e sched/deadline,rt: Remove unused parameter from pick_next_[rt|dl]_entity()
The `struct rq *rq` parameter isn't used. Remove it.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20220302183433.333029-7-dietmar.eggemann@arm.com
2022-03-08 16:08:40 +01:00
Dietmar Eggemann
71d29747b0 sched/deadline,rt: Remove unused functions for !CONFIG_SMP
The need_pull_[rt|dl]_task() and pull_[rt|dl]_task() functions are not
used on a !CONFIG_SMP system. Remove them.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20220302183433.333029-6-dietmar.eggemann@arm.com
2022-03-08 16:08:39 +01:00
Valentin Schneider
49bef33e4b sched/rt: Plug rt_mutex_setprio() vs push_rt_task() race
John reported that push_rt_task() can end up invoking
find_lowest_rq(rq->curr) when curr is not an RT task (in this case a CFS
one), which causes mayhem down convert_prio().

This can happen when current gets demoted to e.g. CFS when releasing an
rt_mutex, and the local CPU gets hit with an rto_push_work irqwork before
getting the chance to reschedule. Exactly who triggers this work isn't
entirely clear to me - switched_from_rt() only invokes rt_queue_pull_task()
if there are no RT tasks on the local RQ, which means the local CPU can't
be in the rto_mask.

My current suspected sequence is something along the lines of the below,
with the demoted task being current.

  mark_wakeup_next_waiter()
    rt_mutex_adjust_prio()
      rt_mutex_setprio() // deboost originally-CFS task
	check_class_changed()
	  switched_from_rt() // Only rt_queue_pull_task() if !rq->rt.rt_nr_running
	  switched_to_fair() // Sets need_resched
      __balance_callbacks() // if pull_rt_task(), tell_cpu_to_push() can't select local CPU per the above
      raw_spin_rq_unlock(rq)

       // need_resched is set, so task_woken_rt() can't
       // invoke push_rt_tasks(). Best I can come up with is
       // local CPU has rt_nr_migratory >= 2 after the demotion, so stays
       // in the rto_mask, and then:

       <some other CPU running rto_push_irq_work_func() queues rto_push_work on this CPU>
	 push_rt_task()
	   // breakage follows here as rq->curr is CFS

Move an existing check to check rq->curr vs the next pushable task's
priority before getting anywhere near find_lowest_rq(). While at it, add an
explicit sched_class of rq->curr check prior to invoking
find_lowest_rq(rq->curr). Align the DL logic to also reschedule regardless
of next_task's migratability.

Fixes: a7c81556ec ("sched: Fix migrate_disable() vs rt/dl balancing")
Reported-by: John Keeping <john@metanate.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: John Keeping <john@metanate.com>
Link: https://lore.kernel.org/r/20220127154059.974729-1-valentin.schneider@arm.com
2022-03-01 16:18:38 +01:00
Ingo Molnar
f96eca4320 sched/headers: Introduce kernel/sched/build_policy.c and build multiple .c files there
Similarly to kernel/sched/build_utility.c, collect all 'scheduling policy' related
source code files into kernel/sched/build_policy.c:

    kernel/sched/idle.c

    kernel/sched/rt.c

    kernel/sched/cpudeadline.c
    kernel/sched/pelt.c

    kernel/sched/cputime.c
    kernel/sched/deadline.c

With the exception of fair.c, which we continue to build as a separate file
for build efficiency and parallelism reasons.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
2022-02-23 10:58:33 +01:00
Li Hua
9b58e976b3 sched/rt: Try to restart rt period timer when rt runtime exceeded
When rt_runtime is modified from -1 to a valid control value, it may
cause the task to be throttled all the time. Operations like the following
will trigger the bug. E.g:

  1. echo -1 > /proc/sys/kernel/sched_rt_runtime_us
  2. Run a FIFO task named A that executes while(1)
  3. echo 950000 > /proc/sys/kernel/sched_rt_runtime_us

When rt_runtime is -1, The rt period timer will not be activated when task
A enqueued. And then the task will be throttled after setting rt_runtime to
950,000. The task will always be throttled because the rt period timer is
not activated.

Fixes: d0b27fa778 ("sched: rt-group: synchonised bandwidth period")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Li Hua <hucool.lihua@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211203033618.11895-1-hucool.lihua@huawei.com
2021-12-07 15:14:10 +01:00
Mathias Krause
b027789e5e sched/fair: Prevent dead task groups from regaining cfs_rq's
Kevin is reporting crashes which point to a use-after-free of a cfs_rq
in update_blocked_averages(). Initial debugging revealed that we've
live cfs_rq's (on_list=1) in an about to be kfree()'d task group in
free_fair_sched_group(). However, it was unclear how that can happen.

His kernel config happened to lead to a layout of struct sched_entity
that put the 'my_q' member directly into the middle of the object
which makes it incidentally overlap with SLUB's freelist pointer.
That, in combination with SLAB_FREELIST_HARDENED's freelist pointer
mangling, leads to a reliable access violation in form of a #GP which
made the UAF fail fast.

Michal seems to have run into the same issue[1]. He already correctly
diagnosed that commit a7b359fc6a ("sched/fair: Correctly insert
cfs_rq's to list on unthrottle") is causing the preconditions for the
UAF to happen by re-adding cfs_rq's also to task groups that have no
more running tasks, i.e. also to dead ones. His analysis, however,
misses the real root cause and it cannot be seen from the crash
backtrace only, as the real offender is tg_unthrottle_up() getting
called via sched_cfs_period_timer() via the timer interrupt at an
inconvenient time.

When unregister_fair_sched_group() unlinks all cfs_rq's from the dying
task group, it doesn't protect itself from getting interrupted. If the
timer interrupt triggers while we iterate over all CPUs or after
unregister_fair_sched_group() has finished but prior to unlinking the
task group, sched_cfs_period_timer() will execute and walk the list of
task groups, trying to unthrottle cfs_rq's, i.e. re-add them to the
dying task group. These will later -- in free_fair_sched_group() -- be
kfree()'ed while still being linked, leading to the fireworks Kevin
and Michal are seeing.

To fix this race, ensure the dying task group gets unlinked first.
However, simply switching the order of unregistering and unlinking the
task group isn't sufficient, as concurrent RCU walkers might still see
it, as can be seen below:

    CPU1:                                      CPU2:
      :                                        timer IRQ:
      :                                          do_sched_cfs_period_timer():
      :                                            :
      :                                            distribute_cfs_runtime():
      :                                              rcu_read_lock();
      :                                              :
      :                                              unthrottle_cfs_rq():
    sched_offline_group():                             :
      :                                                walk_tg_tree_from(…,tg_unthrottle_up,…):
      list_del_rcu(&tg->list);                           :
 (1)  :                                                  list_for_each_entry_rcu(child, &parent->children, siblings)
      :                                                    :
 (2)  list_del_rcu(&tg->siblings);                         :
      :                                                    tg_unthrottle_up():
      unregister_fair_sched_group():                         struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
        :                                                    :
        list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);               :
        :                                                    :
        :                                                    if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running)
 (3)    :                                                        list_add_leaf_cfs_rq(cfs_rq);
      :                                                      :
      :                                                    :
      :                                                  :
      :                                                :
      :                                              :
 (4)  :                                              rcu_read_unlock();

CPU 2 walks the task group list in parallel to sched_offline_group(),
specifically, it'll read the soon to be unlinked task group entry at
(1). Unlinking it on CPU 1 at (2) therefore won't prevent CPU 2 from
still passing it on to tg_unthrottle_up(). CPU 1 now tries to unlink
all cfs_rq's via list_del_leaf_cfs_rq() in
unregister_fair_sched_group().  Meanwhile CPU 2 will re-add some of
these at (3), which is the cause of the UAF later on.

To prevent this additional race from happening, we need to wait until
walk_tg_tree_from() has finished traversing the task groups, i.e.
after the RCU read critical section ends in (4). Afterwards we're safe
to call unregister_fair_sched_group(), as each new walk won't see the
dying task group any more.

On top of that, we need to wait yet another RCU grace period after
unregister_fair_sched_group() to ensure print_cfs_stats(), which might
run concurrently, always sees valid objects, i.e. not already free'd
ones.

This patch survives Michal's reproducer[2] for 8h+ now, which used to
trigger within minutes before.

  [1] https://lore.kernel.org/lkml/20211011172236.11223-1-mkoutny@suse.com/
  [2] https://lore.kernel.org/lkml/20211102160228.GA57072@blackbody.suse.cz/

Fixes: a7b359fc6a ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
[peterz: shuffle code around a bit]
Reported-by: Kevin Tanguy <kevin.tanguy@corp.ovh.com>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2021-11-11 13:09:33 +01:00
Yafang Shao
57a5c2dafc sched/rt: Support schedstats for RT sched class
We want to measure the latency of RT tasks in our production
environment with schedstats facility, but currently schedstats is only
supported for fair sched class. This patch enable it for RT sched class
as well.

After we make the struct sched_statistics and the helpers of it
independent of fair sched class, we can easily use the schedstats
facility for RT sched class.

The schedstat usage in RT sched class is similar with fair sched class,
for example,
                fair                        RT
enqueue         update_stats_enqueue_fair   update_stats_enqueue_rt
dequeue         update_stats_dequeue_fair   update_stats_dequeue_rt
put_prev_task   update_stats_wait_start     update_stats_wait_start_rt
set_next_task   update_stats_wait_end       update_stats_wait_end_rt

The user can get the schedstats information in the same way in fair sched
class. For example,
       fair                            RT
       /proc/[pid]/sched               /proc/[pid]/sched

schedstats is not supported for RT group.

The output of a RT task's schedstats as follows,
$ cat /proc/10349/sched
...
sum_sleep_runtime                            :           972.434535
sum_block_runtime                            :           960.433522
wait_start                                   :        188510.871584
sleep_start                                  :             0.000000
block_start                                  :             0.000000
sleep_max                                    :            12.001013
block_max                                    :           952.660622
exec_max                                     :             0.049629
slice_max                                    :             0.000000
wait_max                                     :             0.018538
wait_sum                                     :             0.424340
wait_count                                   :                   49
iowait_sum                                   :           956.495640
iowait_count                                 :                   24
nr_migrations_cold                           :                    0
nr_failed_migrations_affine                  :                    0
nr_failed_migrations_running                 :                    0
nr_failed_migrations_hot                     :                    0
nr_forced_migrations                         :                    0
nr_wakeups                                   :                   49
nr_wakeups_sync                              :                    0
nr_wakeups_migrate                           :                    0
nr_wakeups_local                             :                   49
nr_wakeups_remote                            :                    0
nr_wakeups_affine                            :                    0
nr_wakeups_affine_attempts                   :                    0
nr_wakeups_passive                           :                    0
nr_wakeups_idle                              :                    0
...

The sched:sched_stat_{wait, sleep, iowait, blocked} tracepoints can
be used to trace RT tasks as well. The output of these tracepoints for a
RT tasks as follows,

- runtime
          stress-10352   [004] d.h.  1035.382286: sched_stat_runtime: comm=stress pid=10352 runtime=995769 [ns] vruntime=0 [ns]
          [vruntime=0 means it is a RT task]

- wait
          <idle>-0       [004] dN..  1227.688544: sched_stat_wait: comm=stress pid=10352 delay=46849882 [ns]

- blocked
     kworker/4:1-465     [004] dN..  1585.676371: sched_stat_blocked: comm=stress pid=17194 delay=189963 [ns]

- iowait
     kworker/4:1-465     [004] dN..  1585.675330: sched_stat_iowait: comm=stress pid=17189 delay=182848 [ns]

- sleep
           sleep-18194   [023] dN..  1780.891840: sched_stat_sleep: comm=sleep.sh pid=17767 delay=1001160770 [ns]
           sleep-18196   [023] dN..  1781.893208: sched_stat_sleep: comm=sleep.sh pid=17767 delay=1001161970 [ns]
           sleep-18197   [023] dN..  1782.894544: sched_stat_sleep: comm=sleep.sh pid=17767 delay=1001128840 [ns]
           [ In sleep.sh, it sleeps 1 sec each time. ]

[lkp@intel.com: reported build failure in earlier version]

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210905143547.4668-7-laoar.shao@gmail.com
2021-10-05 15:51:51 +02:00
Yafang Shao
ed7b564cfd sched/rt: Support sched_stat_runtime tracepoint for RT sched class
The runtime of a RT task has already been there, so we only need to
add a tracepoint.

One difference between fair task and RT task is that there is no vruntime
in RT task. To reuse the sched_stat_runtime tracepoint, '0' is passed as
vruntime for RT task.

The output of this tracepoint for RT task as follows,
          stress-9748    [039] d.h.   113.519352: sched_stat_runtime: comm=stress pid=9748 runtime=997573 [ns] vruntime=0 [ns]
          stress-9748    [039] d.h.   113.520352: sched_stat_runtime: comm=stress pid=9748 runtime=997627 [ns] vruntime=0 [ns]
          stress-9748    [039] d.h.   113.521352: sched_stat_runtime: comm=stress pid=9748 runtime=998203 [ns] vruntime=0 [ns]

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210905143547.4668-6-laoar.shao@gmail.com
2021-10-05 15:51:49 +02:00
Yafang Shao
ceeadb83ae sched: Make struct sched_statistics independent of fair sched class
If we want to use the schedstats facility to trace other sched classes, we
should make it independent of fair sched class. The struct sched_statistics
is the schedular statistics of a task_struct or a task_group. So we can
move it into struct task_struct and struct task_group to achieve the goal.

After the patch, schestats are orgnized as follows,

    struct task_struct {
       ...
       struct sched_entity se;
       struct sched_rt_entity rt;
       struct sched_dl_entity dl;
       ...
       struct sched_statistics stats;
       ...
   };

Regarding the task group, schedstats is only supported for fair group
sched, and a new struct sched_entity_stats is introduced, suggested by
Peter -

    struct sched_entity_stats {
        struct sched_entity     se;
        struct sched_statistics stats;
    } __no_randomize_layout;

Then with the se in a task_group, we can easily get the stats.

The sched_statistics members may be frequently modified when schedstats is
enabled, in order to avoid impacting on random data which may in the same
cacheline with them, the struct sched_statistics is defined as cacheline
aligned.

As this patch changes the core struct of scheduler, so I verified the
performance it may impact on the scheduler with 'perf bench sched
pipe', suggested by Mel. Below is the result, in which all the values
are in usecs/op.
                                  Before               After
      kernel.sched_schedstats=0  5.2~5.4               5.2~5.4
      kernel.sched_schedstats=1  5.3~5.5               5.3~5.5
[These data is a little difference with the earlier version, that is
 because my old test machine is destroyed so I have to use a new
 different test machine.]

Almost no impact on the sched performance.

No functional change.

[lkp@intel.com: reported build failure in earlier version]

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20210905143547.4668-3-laoar.shao@gmail.com
2021-10-05 15:51:45 +02:00
Vincent Donnefort
fecfcbc288 sched/rt: Fix RT utilization tracking during policy change
RT keeps track of the utilization on a per-rq basis with the structure
avg_rt. This utilization is updated during task_tick_rt(),
put_prev_task_rt() and set_next_task_rt(). However, when the current
running task changes its policy, set_next_task_rt() which would usually
take care of updating the utilization when the rq starts running RT tasks,
will not see a such change, leaving the avg_rt structure outdated. When
that very same task will be dequeued later, put_prev_task_rt() will then
update the utilization, based on a wrong last_update_time, leading to a
huge spike in the RT utilization signal.

The signal would eventually recover from this issue after few ms. Even if
no RT tasks are run, avg_rt is also updated in __update_blocked_others().
But as the CPU capacity depends partly on the avg_rt, this issue has
nonetheless a significant impact on the scheduler.

Fix this issue by ensuring a load update when a running task changes
its policy to RT.

Fixes: 371bf427 ("sched/rt: Add rt_rq utilization tracking")
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/1624271872-211872-2-git-send-email-vincent.donnefort@arm.com
2021-06-22 16:41:59 +02:00
Peter Zijlstra
21f56ffe44 sched: Introduce sched_class::pick_task()
Because sched_class::pick_next_task() also implies
sched_class::set_next_task() (and possibly put_prev_task() and
newidle_balance) it is not state invariant. This makes it unsuitable
for remote task selection.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[Vineeth: folded fixes]
Signed-off-by: Vineeth Remanan Pillai <viremana@linux.microsoft.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Don Hiatt <dhiatt@digitalocean.com>
Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210422123308.437092775@infradead.org
2021-05-12 11:43:28 +02:00
Peter Zijlstra
5cb9eaa3d2 sched: Wrap rq::lock access
In preparation of playing games with rq->lock, abstract the thing
using an accessor.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Don Hiatt <dhiatt@digitalocean.com>
Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210422123308.136465446@infradead.org
2021-05-12 11:43:26 +02:00
Ingo Molnar
3b03706fa6 sched: Fix various typos
Fix ~42 single-word typos in scheduler code comments.

We have accumulated a few fun ones over the years. :-)

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: linux-kernel@vger.kernel.org
2021-03-22 00:11:52 +01:00
Hui Su
65bcf072e2 sched: Use task_current() instead of 'rq->curr == p'
Use the task_current() function where appropriate.

No functional change.

Signed-off-by: Hui Su <sh_def@163.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20201030173223.GA52339@rlk
2021-01-14 11:20:11 +01:00
Valentin Schneider
3aef1551e9 sched: Remove select_task_rq()'s sd_flag parameter
Only select_task_rq_fair() uses that parameter to do an actual domain
search, other classes only care about what kind of wakeup is happening
(fork, exec, or "regular") and thus just translate the flag into a wakeup
type.

WF_TTWU and WF_EXEC have just been added, use these along with WF_FORK to
encode the wakeup types we care about. For select_task_rq_fair(), we can
simply use the shiny new WF_flag : SD_flag mapping.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201102184514.2733-3-valentin.schneider@arm.com
2020-11-10 18:39:06 +01:00
Peter Zijlstra
12fa97c64d Merge branch 'sched/migrate-disable' 2020-11-10 18:39:04 +01:00
Peter Zijlstra
a7c81556ec sched: Fix migrate_disable() vs rt/dl balancing
In order to minimize the interference of migrate_disable() on lower
priority tasks, which can be deprived of runtime due to being stuck
below a higher priority task. Teach the RT/DL balancers to push away
these higher priority tasks when a lower priority task gets selected
to run on a freshly demoted CPU (pull).

This adds migration interference to the higher priority task, but
restores bandwidth to system that would otherwise be irrevocably lost.
Without this it would be possible to have all tasks on the system
stuck on a single CPU, each task preempted in a migrate_disable()
section with a single high priority task running.

This way we can still approximate running the M highest priority tasks
on the system.

Migrating the top task away is (ofcourse) still subject to
migrate_disable() too, which means the lower task is subject to an
interference equivalent to the worst case migrate_disable() section.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102347.499155098@infradead.org
2020-11-10 18:39:01 +01:00
Peter Zijlstra
95158a89dd sched,rt: Use the full cpumask for balancing
We want migrate_disable() tasks to get PULLs in order for them to PUSH
away the higher priority task.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102347.310519774@infradead.org
2020-11-10 18:39:00 +01:00
Peter Zijlstra
14e292f8d4 sched,rt: Use cpumask_any*_distribute()
Replace a bunch of cpumask_any*() instances with
cpumask_any*_distribute(), by injecting this little bit of random in
cpu selection, we reduce the chance two competing balance operations
working off the same lowest_mask pick the same CPU.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102347.190759694@infradead.org
2020-11-10 18:39:00 +01:00
Peter Zijlstra
120455c514 sched: Fix hotplug vs CPU bandwidth control
Since we now migrate tasks away before DYING, we should also move
bandwidth unthrottle, otherwise we can gain tasks from unthrottle
after we expect all tasks to be gone already.

Also; it looks like the RT balancers don't respect cpu_active() and
instead rely on rq->online in part, complete this. This too requires
we do set_rq_offline() earlier to match the cpu_active() semantics.
(The bigger patch is to convert RT to cpu_active() entirely)

Since set_rq_online() is called from sched_cpu_activate(), place
set_rq_offline() in sched_cpu_deactivate().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102346.639538965@infradead.org
2020-11-10 18:38:59 +01:00
Peter Zijlstra
43c31ac0e6 sched: Remove relyance on STRUCT_ALIGNMENT
Florian reported that all of kernel/sched/ is rebuild when
CONFIG_BLK_DEV_INITRD is changed, which, while not a bug is
unexpected. This is due to us including vmlinux.lds.h.

Jakub explained that the problem is that we put the alignment
requirement on the type instead of on a variable. Type alignment is a
minimum, the compiler is free to pick any larger alignment for a
specific instance of the type (eg. the variable).

So force the type alignment on all individual variable definitions and
remove the undesired dependency on vmlinux.lds.h.

Fixes: 85c2ce9104 ("sched, vmlinux.lds: Increase STRUCT_ALIGNMENT to 64 bytes for GCC-4.9")
Reported-by: Florian Fainelli <f.fainelli@gmail.com>
Suggested-by: Jakub Jelinek <jakub@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-10-29 11:00:32 +01:00
Peter Zijlstra
934fc3314b sched/cpupri: Remap CPUPRI_NORMAL to MAX_RT_PRIO-1
This makes the mapping continuous and frees up 100 for other usage.

Prev mapping:

p->rt_priority   p->prio   newpri   cpupri

                               -1       -1 (CPUPRI_INVALID)

                              100        0 (CPUPRI_NORMAL)

             1        98       98        1
           ...
            49        50       50       49
            50        49       49       50
           ...
            99         0        0       99

New mapping:

p->rt_priority   p->prio   newpri   cpupri

                               -1       -1 (CPUPRI_INVALID)

                               99        0 (CPUPRI_NORMAL)

             1        98       98        1
           ...
            49        50       50       49
            50        49       49       50
           ...
            99         0        0       99

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2020-10-29 11:00:30 +01:00
Joe Perches
33def8498f treewide: Convert macro and uses of __section(foo) to __section("foo")
Use a more generic form for __section that requires quotes to avoid
complications with clang and gcc differences.

Remove the quote operator # from compiler_attributes.h __section macro.

Convert all unquoted __section(foo) uses to quoted __section("foo").
Also convert __attribute__((section("foo"))) uses to __section("foo")
even if the __attribute__ has multiple list entry forms.

Conversion done using the script at:

    https://lore.kernel.org/lkml/75393e5ddc272dc7403de74d645e6c6e0f4e70eb.camel@perches.com/2-convert_section.pl

Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@gooogle.com>
Reviewed-by: Miguel Ojeda <ojeda@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-25 14:51:49 -07:00
Steven Rostedt (VMware)
a87e749e8f sched: Remove struct sched_class::next field
Now that the sched_class descriptors are defined in order via the linker
script vmlinux.lds.h, there's no reason to have a "next" pointer to the
previous priroity structure. The order of the sturctures can be aligned as
an array, and used to index and find the next sched_class descriptor.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191219214558.845353593@goodmis.org
2020-06-25 13:45:44 +02:00
Steven Rostedt (VMware)
590d697963 sched: Force the address order of each sched class descriptor
In order to make a micro optimization in pick_next_task(), the order of the
sched class descriptor address must be in the same order as their priority
to each other. That is:

 &idle_sched_class < &fair_sched_class < &rt_sched_class <
 &dl_sched_class < &stop_sched_class

In order to guarantee this order of the sched class descriptors, add each
one into their own data section and force the order in the linker script.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/157675913272.349305.8936736338884044103.stgit@localhost.localdomain
2020-06-25 13:45:43 +02:00
Linus Torvalds
cb8e59cc87 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:

 1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
    Augusto von Dentz.

 2) Add GSO partial support to igc, from Sasha Neftin.

 3) Several cleanups and improvements to r8169 from Heiner Kallweit.

 4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
    device self-test. From Andrew Lunn.

 5) Start moving away from custom driver versions, use the globally
    defined kernel version instead, from Leon Romanovsky.

 6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.

 7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.

 8) Add sriov and vf support to hinic, from Luo bin.

 9) Support Media Redundancy Protocol (MRP) in the bridging code, from
    Horatiu Vultur.

10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.

11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
    Dubroca. Also add ipv6 support for espintcp.

12) Lots of ReST conversions of the networking documentation, from Mauro
    Carvalho Chehab.

13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
    from Doug Berger.

14) Allow to dump cgroup id and filter by it in inet_diag code, from
    Dmitry Yakunin.

15) Add infrastructure to export netlink attribute policies to
    userspace, from Johannes Berg.

16) Several optimizations to sch_fq scheduler, from Eric Dumazet.

17) Fallback to the default qdisc if qdisc init fails because otherwise
    a packet scheduler init failure will make a device inoperative. From
    Jesper Dangaard Brouer.

18) Several RISCV bpf jit optimizations, from Luke Nelson.

19) Correct the return type of the ->ndo_start_xmit() method in several
    drivers, it's netdev_tx_t but many drivers were using
    'int'. From Yunjian Wang.

20) Add an ethtool interface for PHY master/slave config, from Oleksij
    Rempel.

21) Add BPF iterators, from Yonghang Song.

22) Add cable test infrastructure, including ethool interfaces, from
    Andrew Lunn. Marvell PHY driver is the first to support this
    facility.

23) Remove zero-length arrays all over, from Gustavo A. R. Silva.

24) Calculate and maintain an explicit frame size in XDP, from Jesper
    Dangaard Brouer.

25) Add CAP_BPF, from Alexei Starovoitov.

26) Support terse dumps in the packet scheduler, from Vlad Buslov.

27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.

28) Add devm_register_netdev(), from Bartosz Golaszewski.

29) Minimize qdisc resets, from Cong Wang.

30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
    eliminate set_fs/get_fs calls. From Christoph Hellwig.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
  selftests: net: ip_defrag: ignore EPERM
  net_failover: fixed rollback in net_failover_open()
  Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
  Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
  vmxnet3: allow rx flow hash ops only when rss is enabled
  hinic: add set_channels ethtool_ops support
  selftests/bpf: Add a default $(CXX) value
  tools/bpf: Don't use $(COMPILE.c)
  bpf, selftests: Use bpf_probe_read_kernel
  s390/bpf: Use bcr 0,%0 as tail call nop filler
  s390/bpf: Maintain 8-byte stack alignment
  selftests/bpf: Fix verifier test
  selftests/bpf: Fix sample_cnt shared between two threads
  bpf, selftests: Adapt cls_redirect to call csum_level helper
  bpf: Add csum_level helper for fixing up csum levels
  bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
  sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
  crypto/chtls: IPv6 support for inline TLS
  Crypto/chcr: Fixes a coccinile check error
  Crypto/chcr: Fixes compilations warnings
  ...
2020-06-03 16:27:18 -07:00
Huaixin Chang
d505b8af58 sched: Defend cfs and rt bandwidth quota against overflow
When users write some huge number into cpu.cfs_quota_us or
cpu.rt_runtime_us, overflow might happen during to_ratio() shifts of
schedulable checks.

to_ratio() could be altered to avoid unnecessary internal overflow, but
min_cfs_quota_period is less than 1 << BW_SHIFT, so a cutoff would still
be needed. Set a cap MAX_BW for cfs_quota_us and rt_runtime_us to
prevent overflow.

Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Link: https://lkml.kernel.org/r/20200425105248.60093-1-changhuaixin@linux.alibaba.com
2020-05-19 20:34:14 +02:00
Christoph Hellwig
32927393dc sysctl: pass kernel pointers to ->proc_handler
Instead of having all the sysctl handlers deal with user pointers, which
is rather hairy in terms of the BPF interaction, copy the input to and
from  userspace in common code.  This also means that the strings are
always NUL-terminated by the common code, making the API a little bit
safer.

As most handler just pass through the data to one of the common handlers
a lot of the changes are mechnical.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-27 02:07:40 -04:00
Qais Yousef
d94a9df490 sched/rt: Remove unnecessary push for unfit tasks
In task_woken_rt() and switched_to_rto() we try trigger push-pull if the
task is unfit.

But the logic is found lacking because if the task was the only one
running on the CPU, then rt_rq is not in overloaded state and won't
trigger a push.

The necessity of this logic was under a debate as well, a summary of
the discussion can be found in the following thread:

  https://lore.kernel.org/lkml/20200226160247.iqvdakiqbakk2llz@e107158-lin.cambridge.arm.com/

Remove the logic for now until a better approach is agreed upon.

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fixes: 804d402fb6 ("sched/rt: Make RT capacity-aware")
Link: https://lkml.kernel.org/r/20200302132721.8353-6-qais.yousef@arm.com
2020-03-06 12:57:29 +01:00
Qais Yousef
98ca645f82 sched/rt: Allow pulling unfitting task
When implemented RT Capacity Awareness; the logic was done such that if
a task was running on a fitting CPU, then it was sticky and we would try
our best to keep it there.

But as Steve suggested, to adhere to the strict priority rules of RT
class; allow pulling an RT task to unfitting CPU to ensure it gets a
chance to run ASAP.

LINK: https://lore.kernel.org/lkml/20200203111451.0d1da58f@oasis.local.home/
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fixes: 804d402fb6 ("sched/rt: Make RT capacity-aware")
Link: https://lkml.kernel.org/r/20200302132721.8353-5-qais.yousef@arm.com
2020-03-06 12:57:28 +01:00
Qais Yousef
a1bd02e1f2 sched/rt: Optimize cpupri_find() on non-heterogenous systems
By introducing a new cpupri_find_fitness() function that takes the
fitness_fn as an argument and only called when asym_system static key is
enabled.

cpupri_find() is now a wrapper function that calls cpupri_find_fitness()
passing NULL as a fitness_fn, hence disabling the logic that handles
fitness by default.

LINK: https://lore.kernel.org/lkml/c0772fca-0a4b-c88d-fdf2-5715fcf8447b@arm.com/
Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fixes: 804d402fb6 ("sched/rt: Make RT capacity-aware")
Link: https://lkml.kernel.org/r/20200302132721.8353-4-qais.yousef@arm.com
2020-03-06 12:57:27 +01:00
Qais Yousef
b28bc1e002 sched/rt: Re-instate old behavior in select_task_rq_rt()
When RT Capacity Aware support was added, the logic in select_task_rq_rt
was modified to force a search for a fitting CPU if the task currently
doesn't run on one.

But if the search failed, and the search was only triggered to fulfill
the fitness request; we could end up selecting a new CPU unnecessarily.

Fix this and re-instate the original behavior by ensuring we bail out
in that case.

This behavior change only affected asymmetric systems that are using
util_clamp to implement capacity aware. None asymmetric systems weren't
affected.

LINK: https://lore.kernel.org/lkml/20200218041620.GD28029@codeaurora.org/
Reported-by: Pavan Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fixes: 804d402fb6 ("sched/rt: Make RT capacity-aware")
Link: https://lkml.kernel.org/r/20200302132721.8353-3-qais.yousef@arm.com
2020-03-06 12:57:27 +01:00
Konstantin Khlebnikov
b4fb015eef sched/rt: Optimize checking group RT scheduler constraints
Group RT scheduler contains protection against setting zero runtime for
cgroup with RT tasks. Right now function tg_set_rt_bandwidth() iterates
over all CPU cgroups and calls tg_has_rt_tasks() for any cgroup which
runtime is zero (not only for changed one). Default RT runtime is zero,
thus tg_has_rt_tasks() will is called for almost at CPU cgroups.

This protection already is slightly racy: runtime limit could be changed
between cpu_cgroup_can_attach() and cpu_cgroup_attach() because changing
cgroup attribute does not lock cgroup_mutex while attach does not lock
rt_constraints_mutex. Changing task scheduler class also races with
changing rt runtime: check in __sched_setscheduler() isn't protected.

Function tg_has_rt_tasks() iterates over all threads in the system.
This gives NR_CGROUPS * NR_TASKS operations under single tasklist_lock
locked for read tg_set_rt_bandwidth(). Any concurrent attempt of locking
tasklist_lock for write (for example fork) will stuck with disabled irqs.

This patch makes two optimizations:
1) Remove locking tasklist_lock and iterate only tasks in cgroup
2) Call tg_has_rt_tasks() iff rt runtime changes from non-zero to zero

All changed code is under CONFIG_RT_GROUP_SCHED.

Testcase:

 # mkdir /sys/fs/cgroup/cpu/test{1..10000}
 # echo 0 | tee /sys/fs/cgroup/cpu/test*/cpu.rt_runtime_us

At the same time without patch fork time will be >100ms:

 # perf trace -e clone --duration 100 stress-ng --fork 1

Also remote ping will show timings >100ms caused by irq latency.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/157996383820.4651.11292439232549211693.stgit@buzz
2020-01-28 21:37:09 +01:00
Qais Yousef
804d402fb6 sched/rt: Make RT capacity-aware
Capacity Awareness refers to the fact that on heterogeneous systems
(like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
when placing tasks we need to be aware of this difference of CPU
capacities.

In such scenarios we want to ensure that the selected CPU has enough
capacity to meet the requirement of the running task. Enough capacity
means here that capacity_orig_of(cpu) >= task.requirement.

The definition of task.requirement is dependent on the scheduling class.

For CFS, utilization is used to select a CPU that has >= capacity value
than the cfs_task.util.

	capacity_orig_of(cpu) >= cfs_task.util

DL isn't capacity aware at the moment but can make use of the bandwidth
reservation to implement that in a similar manner CFS uses utilization.
The following patchset implements that:

https://lore.kernel.org/lkml/20190506044836.2914-1-luca.abeni@santannapisa.it/

	capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime

For RT we don't have a per task utilization signal and we lack any
information in general about what performance requirement the RT task
needs. But with the introduction of uclamp, RT tasks can now control
that by setting uclamp_min to guarantee a minimum performance point.

ATM the uclamp value are only used for frequency selection; but on
heterogeneous systems this is not enough and we need to ensure that the
capacity of the CPU is >= uclamp_min. Which is what implemented here.

	capacity_orig_of(cpu) >= rt_task.uclamp_min

Note that by default uclamp.min is 1024, which means that RT tasks will
always be biased towards the big CPUs, which make for a better more
predictable behavior for the default case.

Must stress that the bias acts as a hint rather than a definite
placement strategy. For example, if all big cores are busy executing
other RT tasks we can't guarantee that a new RT task will be placed
there.

On non-heterogeneous systems the original behavior of RT should be
retained. Similarly if uclamp is not selected in the config.

[ mingo: Minor edits to comments. ]

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191009104611.15363-1-qais.yousef@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-25 10:42:10 +01:00
Peter Zijlstra
a0e813f26e sched/core: Further clarify sched_class::set_next_task()
It turns out there really is something special to the first
set_next_task() invocation. In specific the 'change' pattern really
should not cause balance callbacks.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@redhat.com
Cc: ktkhai@virtuozzo.com
Cc: mgorman@suse.de
Cc: qais.yousef@arm.com
Cc: qperret@google.com
Cc: rostedt@goodmis.org
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Fixes: f95d4eaee6 ("sched/{rt,deadline}: Fix set_next_task vs pick_next_task")
Link: https://lkml.kernel.org/r/20191108131909.775434698@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11 08:35:21 +01:00
Peter Zijlstra
98c2f700ed sched/core: Simplify sched_class::pick_next_task()
Now that the indirect class call never uses the last two arguments of
pick_next_task(), remove them.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@redhat.com
Cc: ktkhai@virtuozzo.com
Cc: mgorman@suse.de
Cc: qais.yousef@arm.com
Cc: qperret@google.com
Cc: rostedt@goodmis.org
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: https://lkml.kernel.org/r/20191108131909.660595546@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11 08:35:20 +01:00
Peter Zijlstra
6e2df0581f sched: Fix pick_next_task() vs 'change' pattern race
Commit 67692435c4 ("sched: Rework pick_next_task() slow-path")
inadvertly introduced a race because it changed a previously
unexplored dependency between dropping the rq->lock and
sched_class::put_prev_task().

The comments about dropping rq->lock, in for example
newidle_balance(), only mentions the task being current and ->on_cpu
being set. But when we look at the 'change' pattern (in for example
sched_setnuma()):

	queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */
	running = task_current(rq, p); /* rq->curr == p */

	if (queued)
		dequeue_task(...);
	if (running)
		put_prev_task(...);

	/* change task properties */

	if (queued)
		enqueue_task(...);
	if (running)
		set_next_task(...);

It becomes obvious that if we do this after put_prev_task() has
already been called on @p, things go sideways. This is exactly what
the commit in question allows to happen when it does:

	prev->sched_class->put_prev_task(rq, prev, rf);
	if (!rq->nr_running)
		newidle_balance(rq, rf);

The newidle_balance() call will drop rq->lock after we've called
put_prev_task() and that allows the above 'change' pattern to
interleave and mess up the state.

Furthermore, it turns out we lost the RT-pull when we put the last DL
task.

Fix both problems by extracting the balancing from put_prev_task() and
doing a multi-class balance() pass before put_prev_task().

Fixes: 67692435c4 ("sched: Rework pick_next_task() slow-path")
Reported-by: Quentin Perret <qperret@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Quentin Perret <qperret@google.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
2019-11-08 22:34:14 +01:00
Linus Torvalds
7f2444d38f Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core timer updates from Thomas Gleixner:
 "Timers and timekeeping updates:

   - A large overhaul of the posix CPU timer code which is a preparation
     for moving the CPU timer expiry out into task work so it can be
     properly accounted on the task/process.

     An update to the bogus permission checks will come later during the
     merge window as feedback was not complete before heading of for
     travel.

   - Switch the timerqueue code to use cached rbtrees and get rid of the
     homebrewn caching of the leftmost node.

   - Consolidate hrtimer_init() + hrtimer_init_sleeper() calls into a
     single function

   - Implement the separation of hrtimers to be forced to expire in hard
     interrupt context even when PREEMPT_RT is enabled and mark the
     affected timers accordingly.

   - Implement a mechanism for hrtimers and the timer wheel to protect
     RT against priority inversion and live lock issues when a (hr)timer
     which should be canceled is currently executing the callback.
     Instead of infinitely spinning, the task which tries to cancel the
     timer blocks on a per cpu base expiry lock which is held and
     released by the (hr)timer expiry code.

   - Enable the Hyper-V TSC page based sched_clock for Hyper-V guests
     resulting in faster access to timekeeping functions.

   - Updates to various clocksource/clockevent drivers and their device
     tree bindings.

   - The usual small improvements all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
  posix-cpu-timers: Fix permission check regression
  posix-cpu-timers: Always clear head pointer on dequeue
  hrtimer: Add a missing bracket and hide `migration_base' on !SMP
  posix-cpu-timers: Make expiry_active check actually work correctly
  posix-timers: Unbreak CONFIG_POSIX_TIMERS=n build
  tick: Mark sched_timer to expire in hard interrupt context
  hrtimer: Add kernel doc annotation for HRTIMER_MODE_HARD
  x86/hyperv: Hide pv_ops access for CONFIG_PARAVIRT=n
  posix-cpu-timers: Utilize timerqueue for storage
  posix-cpu-timers: Move state tracking to struct posix_cputimers
  posix-cpu-timers: Deduplicate rlimit handling
  posix-cpu-timers: Remove pointless comparisons
  posix-cpu-timers: Get rid of 64bit divisions
  posix-cpu-timers: Consolidate timer expiry further
  posix-cpu-timers: Get rid of zero checks
  rlimit: Rewrite non-sensical RLIMIT_CPU comment
  posix-cpu-timers: Respect INFINITY for hard RTTIME limit
  posix-cpu-timers: Switch thread group sampling to array
  posix-cpu-timers: Restructure expiry array
  posix-cpu-timers: Remove cputime_expires
  ...
2019-09-17 12:35:15 -07:00