sched_ext: Initial pull request for v6.12

This is the initial pull request of sched_ext. The v7 patchset (https://lkml.kernel.org/r/20240618212056.2833381-1-tj@kernel.org) is applied on top of tip/sched/core + bpf/master as of Jun 18th. tip/sched/core 793a62823d1c ("sched/core: Drop spinlocks on contention iff kernel is preempti ble") bpf/master f6afdaf72a ("Merge branch 'bpf-support-resilient-split-btf'") Since then, the following pulls were made: - v6.11-rc1 is pulled to keep up with the mainline. - tip/sched/core was pulled several times: - 7b9f6c864a, 0df340ceae, 5ac998574f, 0b1777f0fa: To resolve conflicts. See each commit for details on conflicts and their resolutions. - d7b01aef9d: To receive fd03c5b858 ("sched: Rework pick_next_task()") and related commits. @prev in added to sched_class->put_prev_task() and put_prev_task() is reordered after ->pick_task(), which makes sched_class->switch_class() unnecessary. The follow-up commits update sched_ext accordingly and drop sched_class->switch_class(). - bpf/master was pulled to receive baebe9aaba ("bpf: allow passing struct bpf_iter_<type> as kfunc arguments") and related changes in preparation for the DSQ iterator patchset To obtain the net sched_ext changes, diff against: git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git for-6.12-base which is the merge of: tip/sched/core bc9057da1a ("sched/cpufreq: Use NSEC_PER_MSEC for deadline task") bpf/master 2ad6d23f46 ("selftests/bpf: Do not update vmlinux.h unnecessarily") Since the v7 patchset, the following changes were made: - cpuperf support which was a part of the v6 patchset was posted separately and then applied after reviews. - cgroup support which was a part of the v6 patchset was posted seprately, iterated and then applied. - Improve integration with sched core. - Double locking usage in migration paths dropped. Depend on TASK_ON_RQ_MIGRATING synchronization instead. - The BPF scheduler couldn't directly dispatch to the local DSQ of another CPU using a SCX_DSQ_LOCAL_ON verdict. This caused difficulties around handling non-wakeup enqueues. Updated so that SCX_DSQ_LOCAL_ON can be used in the enqueue path too. - DSQ iterator which was a part of the v6 patchset was posted separately. The iterator itself was applied after a couple revisions. The associated selective consumption kfunc can use further improvements and is still being worked on. - scx_bpf_dispatch[_vtime]_from_dsq() added to increase flexibility. A task can now be transferred between two DSQs from almost any context. This involved significant refactoring of migration code. - Various fixes and improvements. As the branch is based on top of tip/sched/core + bpf/master, please merge after both are applied. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZuOSuA4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGVZyAQDBU3WPkYKB8gl6a6YQ+/PzBXorOK7mioS9A2iJ vBR3FgEAg1vtcss1S+2juWmVq7ItiFNWCqtXzUr/bVmL9CqqDwA= =bOOC -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext support from Tejun Heo: "This implements a new scheduler class called ‘ext_sched_class’, or sched_ext, which allows scheduling policies to be implemented as BPF programs. The goals of this are: - Ease of experimentation and exploration: Enabling rapid iteration of new scheduling policies. - Customization: Building application-specific schedulers which implement policies that are not applicable to general-purpose schedulers. - Rapid scheduler deployments: Non-disruptive swap outs of scheduling policies in production environments" See individual commits for more documentation, but also the cover letter for the latest series: Link: https://lore.kernel.org/all/20240618212056.2833381-1-tj@kernel.org/ * tag 'sched_ext-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (110 commits) sched: Move update_other_load_avgs() to kernel/sched/pelt.c sched_ext: Don't trigger ops.quiescent/runnable() on migrations sched_ext: Synchronize bypass state changes with rq lock scx_qmap: Implement highpri boosting sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() sched_ext: Compact struct bpf_iter_scx_dsq_kern sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq() sched_ext: Move consume_local_task() upward sched_ext: Move sanity check and dsq_mod_nr() into task_unlink_from_dsq() sched_ext: Reorder args for consume_local/remote_task() sched_ext: Restructure dispatch_to_local_dsq() sched_ext: Fix processs_ddsp_deferred_locals() by unifying DTL_INVALID handling sched_ext: Make find_dsq_for_dispatch() handle SCX_DSQ_LOCAL_ON sched_ext: Refactor consume_remote_task() sched_ext: Rename scx_kfunc_set_sleepable to unlocked and relocate sched_ext: Add missing static to scx_dump_data sched_ext: Add missing static to scx_has_op[] sched_ext: Temporarily work around pick_task_scx() being called without balance_scx() sched_ext: Add a cgroup scheduler which uses flattened hierarchy sched_ext: Add cgroup support ...
2024-11-01 08:58:07 +00:00 · 2024-09-21 09:44:57 -07:00 · 2024-09-21 09:44:57 -07:00 · 88264981f2
commit 88264981f2
parent 440b652328 902d67a2d4
99 changed files with 16028 additions and 125 deletions
--- a/Documentation/scheduler/index.rst
+++ b/Documentation/scheduler/index.rst
@ -21,6 +21,7 @@ Scheduler
    sched-nice-design
    sched-rt-group
    sched-stats
    sched-ext
    sched-debug
    text_files
--- a/Documentation/scheduler/sched-ext.rst
+++ b/Documentation/scheduler/sched-ext.rst
@ -0,0 +1,316 @@
 ==========================
 Extensible Scheduler Class
 ==========================
 sched_ext is a scheduler class whose behavior can be defined by a set of BPF
 programs - the BPF scheduler.
 * sched_ext exports a full scheduling interface so that any scheduling
  algorithm can be implemented on top.
 * The BPF scheduler can group CPUs however it sees fit and schedule them
  together, as tasks aren't tied to specific CPUs at the time of wakeup.
 * The BPF scheduler can be turned on and off dynamically anytime.
 * The system integrity is maintained no matter what the BPF scheduler does.
  The default scheduling behavior is restored anytime an error is detected,
  a runnable task stalls, or on invoking the SysRq key sequence
  :kbd:`SysRq-S`.
 * When the BPF scheduler triggers an error, debug information is dumped to
  aid debugging. The debug dump is passed to and printed out by the
  scheduler binary. The debug dump can also be accessed through the
  `sched_ext_dump` tracepoint. The SysRq key sequence :kbd:`SysRq-D`
  triggers a debug dump. This doesn't terminate the BPF scheduler and can
  only be read through the tracepoint.
 Switching to and from sched_ext
 ===============================
 ``CONFIG_SCHED_CLASS_EXT`` is the config option to enable sched_ext and
 ``tools/sched_ext`` contains the example schedulers. The following config
 options should be enabled to use sched_ext:
 .. code-block:: none
    CONFIG_BPF=y
    CONFIG_SCHED_CLASS_EXT=y
    CONFIG_BPF_SYSCALL=y
    CONFIG_BPF_JIT=y
    CONFIG_DEBUG_INFO_BTF=y
    CONFIG_BPF_JIT_ALWAYS_ON=y
    CONFIG_BPF_JIT_DEFAULT_ON=y
    CONFIG_PAHOLE_HAS_SPLIT_BTF=y
    CONFIG_PAHOLE_HAS_BTF_TAG=y
 sched_ext is used only when the BPF scheduler is loaded and running.
 If a task explicitly sets its scheduling policy to ``SCHED_EXT``, it will be
 treated as ``SCHED_NORMAL`` and scheduled by CFS until the BPF scheduler is
 loaded.
 When the BPF scheduler is loaded and ``SCX_OPS_SWITCH_PARTIAL`` is not set
 in ``ops->flags``, all ``SCHED_NORMAL``, ``SCHED_BATCH``, ``SCHED_IDLE``, and
 ``SCHED_EXT`` tasks are scheduled by sched_ext.
 However, when the BPF scheduler is loaded and ``SCX_OPS_SWITCH_PARTIAL`` is
 set in ``ops->flags``, only tasks with the ``SCHED_EXT`` policy are scheduled
 by sched_ext, while tasks with ``SCHED_NORMAL``, ``SCHED_BATCH`` and
 ``SCHED_IDLE`` policies are scheduled by CFS.
 Terminating the sched_ext scheduler program, triggering :kbd:`SysRq-S`, or
 detection of any internal error including stalled runnable tasks aborts the
 BPF scheduler and reverts all tasks back to CFS.
 .. code-block:: none
    # make -j16 -C tools/sched_ext
    # tools/sched_ext/scx_simple
    local=0 global=3
    local=5 global=24
    local=9 global=44
    local=13 global=56
    local=17 global=72
    ^CEXIT: BPF scheduler unregistered
 The current status of the BPF scheduler can be determined as follows:
 .. code-block:: none
    # cat /sys/kernel/sched_ext/state
    enabled
    # cat /sys/kernel/sched_ext/root/ops
    simple
 ``tools/sched_ext/scx_show_state.py`` is a drgn script which shows more
 detailed information:
 .. code-block:: none
    # tools/sched_ext/scx_show_state.py
    ops           : simple
    enabled       : 1
    switching_all : 1
    switched_all  : 1
    enable_state  : enabled (2)
    bypass_depth  : 0
    nr_rejected   : 0
 If ``CONFIG_SCHED_DEBUG`` is set, whether a given task is on sched_ext can
 be determined as follows:
 .. code-block:: none
    # grep ext /proc/self/sched
    ext.enabled                                  :                    1
 The Basics
 ==========
 Userspace can implement an arbitrary BPF scheduler by loading a set of BPF
 programs that implement ``struct sched_ext_ops``. The only mandatory field
 is ``ops.name`` which must be a valid BPF object name. All operations are
 optional. The following modified excerpt is from
 ``tools/sched_ext/scx_simple.bpf.c`` showing a minimal global FIFO scheduler.
 .. code-block:: c
    /*
     * Decide which CPU a task should be migrated to before being
     * enqueued (either at wakeup, fork time, or exec time). If an
     * idle core is found by the default ops.select_cpu() implementation,
     * then dispatch the task directly to SCX_DSQ_LOCAL and skip the
     * ops.enqueue() callback.
     *
     * Note that this implementation has exactly the same behavior as the
     * default ops.select_cpu implementation. The behavior of the scheduler
     * would be exactly same if the implementation just didn't define the
     * simple_select_cpu() struct_ops prog.
     */
    s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p,
                       s32 prev_cpu, u64 wake_flags)
    {
            s32 cpu;
            /* Need to initialize or the BPF verifier will reject the program */
            bool direct = false;
            cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &direct);
            if (direct)
                    scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
            return cpu;
    }
    /*
     * Do a direct dispatch of a task to the global DSQ. This ops.enqueue()
     * callback will only be invoked if we failed to find a core to dispatch
     * to in ops.select_cpu() above.
     *
     * Note that this implementation has exactly the same behavior as the
     * default ops.enqueue implementation, which just dispatches the task
     * to SCX_DSQ_GLOBAL. The behavior of the scheduler would be exactly same
     * if the implementation just didn't define the simple_enqueue struct_ops
     * prog.
     */
    void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags)
    {
            scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
    }
    s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init)
    {
            /*
             * By default, all SCHED_EXT, SCHED_OTHER, SCHED_IDLE, and
             * SCHED_BATCH tasks should use sched_ext.
             */
            return 0;
    }
    void BPF_STRUCT_OPS(simple_exit, struct scx_exit_info *ei)
    {
            exit_type = ei->type;
    }
    SEC(".struct_ops")
    struct sched_ext_ops simple_ops = {
            .select_cpu             = (void *)simple_select_cpu,
            .enqueue                = (void *)simple_enqueue,
            .init                   = (void *)simple_init,
            .exit                   = (void *)simple_exit,
            .name                   = "simple",
    };
 Dispatch Queues
 ---------------
 To match the impedance between the scheduler core and the BPF scheduler,
 sched_ext uses DSQs (dispatch queues) which can operate as both a FIFO and a
 priority queue. By default, there is one global FIFO (``SCX_DSQ_GLOBAL``),
 and one local dsq per CPU (``SCX_DSQ_LOCAL``). The BPF scheduler can manage
 an arbitrary number of dsq's using ``scx_bpf_create_dsq()`` and
 ``scx_bpf_destroy_dsq()``.
 A CPU always executes a task from its local DSQ. A task is "dispatched" to a
 DSQ. A non-local DSQ is "consumed" to transfer a task to the consuming CPU's
 local DSQ.
 When a CPU is looking for the next task to run, if the local DSQ is not
 empty, the first task is picked. Otherwise, the CPU tries to consume the
 global DSQ. If that doesn't yield a runnable task either, ``ops.dispatch()``
 is invoked.
 Scheduling Cycle
 ----------------
 The following briefly shows how a waking task is scheduled and executed.
 1. When a task is waking up, ``ops.select_cpu()`` is the first operation
   invoked. This serves two purposes. First, CPU selection optimization
   hint. Second, waking up the selected CPU if idle.
   The CPU selected by ``ops.select_cpu()`` is an optimization hint and not
   binding. The actual decision is made at the last step of scheduling.
   However, there is a small performance gain if the CPU
   ``ops.select_cpu()`` returns matches the CPU the task eventually runs on.
   A side-effect of selecting a CPU is waking it up from idle. While a BPF
   scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper,
   using ``ops.select_cpu()`` judiciously can be simpler and more efficient.
   A task can be immediately dispatched to a DSQ from ``ops.select_cpu()`` by
   calling ``scx_bpf_dispatch()``. If the task is dispatched to
   ``SCX_DSQ_LOCAL`` from ``ops.select_cpu()``, it will be dispatched to the
   local DSQ of whichever CPU is returned from ``ops.select_cpu()``.
   Additionally, dispatching directly from ``ops.select_cpu()`` will cause the
   ``ops.enqueue()`` callback to be skipped.
   Note that the scheduler core will ignore an invalid CPU selection, for
   example, if it's outside the allowed cpumask of the task.
 2. Once the target CPU is selected, ``ops.enqueue()`` is invoked (unless the
   task was dispatched directly from ``ops.select_cpu()``). ``ops.enqueue()``
   can make one of the following decisions:
   * Immediately dispatch the task to either the global or local DSQ by
     calling ``scx_bpf_dispatch()`` with ``SCX_DSQ_GLOBAL`` or
     ``SCX_DSQ_LOCAL``, respectively.
   * Immediately dispatch the task to a custom DSQ by calling
     ``scx_bpf_dispatch()`` with a DSQ ID which is smaller than 2^63.
   * Queue the task on the BPF side.
 3. When a CPU is ready to schedule, it first looks at its local DSQ. If
   empty, it then looks at the global DSQ. If there still isn't a task to
   run, ``ops.dispatch()`` is invoked which can use the following two
   functions to populate the local DSQ.
   * ``scx_bpf_dispatch()`` dispatches a task to a DSQ. Any target DSQ can
     be used - ``SCX_DSQ_LOCAL``, ``SCX_DSQ_LOCAL_ON | cpu``,
     ``SCX_DSQ_GLOBAL`` or a custom DSQ. While ``scx_bpf_dispatch()``
     currently can't be called with BPF locks held, this is being worked on
     and will be supported. ``scx_bpf_dispatch()`` schedules dispatching
     rather than performing them immediately. There can be up to
     ``ops.dispatch_max_batch`` pending tasks.
   * ``scx_bpf_consume()`` tranfers a task from the specified non-local DSQ
     to the dispatching DSQ. This function cannot be called with any BPF
     locks held. ``scx_bpf_consume()`` flushes the pending dispatched tasks
     before trying to consume the specified DSQ.
 4. After ``ops.dispatch()`` returns, if there are tasks in the local DSQ,
   the CPU runs the first one. If empty, the following steps are taken:
   * Try to consume the global DSQ. If successful, run the task.
   * If ``ops.dispatch()`` has dispatched any tasks, retry #3.
   * If the previous task is an SCX task and still runnable, keep executing
     it (see ``SCX_OPS_ENQ_LAST``).
   * Go idle.
 Note that the BPF scheduler can always choose to dispatch tasks immediately
 in ``ops.enqueue()`` as illustrated in the above simple example. If only the
 built-in DSQs are used, there is no need to implement ``ops.dispatch()`` as
 a task is never queued on the BPF scheduler and both the local and global
 DSQs are consumed automatically.
 ``scx_bpf_dispatch()`` queues the task on the FIFO of the target DSQ. Use
 ``scx_bpf_dispatch_vtime()`` for the priority queue. Internal DSQs such as
 ``SCX_DSQ_LOCAL`` and ``SCX_DSQ_GLOBAL`` do not support priority-queue
 dispatching, and must be dispatched to with ``scx_bpf_dispatch()``.  See the
 function documentation and usage in ``tools/sched_ext/scx_simple.bpf.c`` for
 more information.
 Where to Look
 =============
 * ``include/linux/sched/ext.h`` defines the core data structures, ops table
  and constants.
 * ``kernel/sched/ext.c`` contains sched_ext core implementation and helpers.
  The functions prefixed with ``scx_bpf_`` can be called from the BPF
  scheduler.
 * ``tools/sched_ext/`` hosts example BPF scheduler implementations.
  * ``scx_simple[.bpf].c``: Minimal global FIFO scheduler example using a
    custom DSQ.
  * ``scx_qmap[.bpf].c``: A multi-level FIFO scheduler supporting five
    levels of priority implemented with ``BPF_MAP_TYPE_QUEUE``.
 ABI Instability
 ===============
 The APIs provided by sched_ext to BPF schedulers programs have no stability
 guarantees. This includes the ops table callbacks and constants defined in
 ``include/linux/sched/ext.h``, as well as the ``scx_bpf_`` kfuncs defined in
 ``kernel/sched/ext.c``.
 While we will attempt to provide a relatively stable API surface when
 possible, they are subject to change without warning between kernel
 versions.
--- a/13
+++ b/13
@ -20511,6 +20511,19 @@ F:	include/linux/wait.h
 F:	include/uapi/linux/sched.h
 F:	kernel/sched/
 SCHEDULER - SCHED_EXT
 R:	Tejun Heo <tj@kernel.org>
 R:	David Vernet <void@manifault.com>
 L:	linux-kernel@vger.kernel.org
 S:	Maintained
 W:	https://github.com/sched-ext/scx
 T:	git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git
 F:	include/linux/sched/ext.h
 F:	kernel/sched/ext.h
 F:	kernel/sched/ext.c
 F:	tools/sched_ext/
 F:	tools/testing/selftests/sched_ext
 SCIOSENSE ENS160 MULTI-GAS SENSOR DRIVER
 M:	Gustavo Silva <gustavograzs@gmail.com>
 S:	Maintained
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@ -531,6 +531,7 @@ static const struct sysrq_key_op *sysrq_key_table[62] = {
 	NULL,				/* P */
 	NULL,				/* Q */
 	&sysrq_replay_logs_op,		/* R */
 	/* S: May be registered by sched_ext for resetting */
 	NULL,				/* S */
 	NULL,				/* T */
 	NULL,				/* U */
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@ -133,6 +133,7 @@
 	*(__dl_sched_class)			\
 	*(__rt_sched_class)			\
 	*(__fair_sched_class)			\
 	*(__ext_sched_class)			\
 	*(__idle_sched_class)			\
 	__sched_class_lowest = .;
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@ -29,8 +29,6 @@
 struct kernel_clone_args;
 #ifdef CONFIG_CGROUPS
 /*
 * All weight knobs on the default hierarchy should use the following min,
 * default and max values.  The default value is the logarithmic center of
@ -40,6 +38,8 @@ struct kernel_clone_args;
 #define CGROUP_WEIGHT_DFL		100
 #define CGROUP_WEIGHT_MAX		10000
 #ifdef CONFIG_CGROUPS
 enum {
 	CSS_TASK_ITER_PROCS    = (1U << 0),  /* walk only threadgroup leaders */
 	CSS_TASK_ITER_THREADED = (1U << 1),  /* walk all threaded css_sets in the domain */
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@ -82,6 +82,8 @@ struct task_group;
 struct task_struct;
 struct user_event_mm;
 #include <linux/sched/ext.h>
 /*
 * Task state bitmask. NOTE! These bits are also
 * encoded in fs/proc/array.c: get_task_state().
@ -830,6 +832,9 @@ struct task_struct {
 	struct sched_rt_entity		rt;
 	struct sched_dl_entity		dl;
 	struct sched_dl_entity		*dl_server;
 #ifdef CONFIG_SCHED_CLASS_EXT
 	struct sched_ext_entity		scx;
 #endif
 	const struct sched_class	*sched_class;
 #ifdef CONFIG_SCHED_CORE
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@ -0,0 +1,215 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst
 *
 * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2022 Tejun Heo <tj@kernel.org>
 * Copyright (c) 2022 David Vernet <dvernet@meta.com>
 */
 #ifndef _LINUX_SCHED_EXT_H
 #define _LINUX_SCHED_EXT_H
 #ifdef CONFIG_SCHED_CLASS_EXT
 #include <linux/llist.h>
 #include <linux/rhashtable-types.h>
 enum scx_public_consts {
 	SCX_OPS_NAME_LEN	= 128,
 	SCX_SLICE_DFL		= 20 * 1000000,	/* 20ms */
 	SCX_SLICE_INF		= U64_MAX,	/* infinite, implies nohz */
 };
 /*
 * DSQ (dispatch queue) IDs are 64bit of the format:
 *
 *   Bits: [63] [62 ..  0]
 *         [ B] [   ID   ]
 *
 *    B: 1 for IDs for built-in DSQs, 0 for ops-created user DSQs
 *   ID: 63 bit ID
 *
 * Built-in IDs:
 *
 *   Bits: [63] [62] [61..32] [31 ..  0]
 *         [ 1] [ L] [   R  ] [    V   ]
 *
 *    1: 1 for built-in DSQs.
 *    L: 1 for LOCAL_ON DSQ IDs, 0 for others
 *    V: For LOCAL_ON DSQ IDs, a CPU number. For others, a pre-defined value.
 */
 enum scx_dsq_id_flags {
 	SCX_DSQ_FLAG_BUILTIN	= 1LLU << 63,
 	SCX_DSQ_FLAG_LOCAL_ON	= 1LLU << 62,
 	SCX_DSQ_INVALID		= SCX_DSQ_FLAG_BUILTIN | 0,
 	SCX_DSQ_GLOBAL		= SCX_DSQ_FLAG_BUILTIN | 1,
 	SCX_DSQ_LOCAL		= SCX_DSQ_FLAG_BUILTIN | 2,
 	SCX_DSQ_LOCAL_ON	= SCX_DSQ_FLAG_BUILTIN | SCX_DSQ_FLAG_LOCAL_ON,
 	SCX_DSQ_LOCAL_CPU_MASK	= 0xffffffffLLU,
 };
 /*
 * A dispatch queue (DSQ) can be either a FIFO or p->scx.dsq_vtime ordered
 * queue. A built-in DSQ is always a FIFO. The built-in local DSQs are used to
 * buffer between the scheduler core and the BPF scheduler. See the
 * documentation for more details.
 */
 struct scx_dispatch_q {
 	raw_spinlock_t		lock;
 	struct list_head	list;	/* tasks in dispatch order */
 	struct rb_root		priq;	/* used to order by p->scx.dsq_vtime */
 	u32			nr;
 	u32			seq;	/* used by BPF iter */
 	u64			id;
 	struct rhash_head	hash_node;
 	struct llist_node	free_node;
 	struct rcu_head		rcu;
 };
 /* scx_entity.flags */
 enum scx_ent_flags {
 	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
 	SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
 	SCX_TASK_DEQD_FOR_SLEEP	= 1 << 3, /* last dequeue was for SLEEP */
 	SCX_TASK_STATE_SHIFT	= 8,	  /* bit 8 and 9 are used to carry scx_task_state */
 	SCX_TASK_STATE_BITS	= 2,
 	SCX_TASK_STATE_MASK	= ((1 << SCX_TASK_STATE_BITS) - 1) << SCX_TASK_STATE_SHIFT,
 	SCX_TASK_CURSOR		= 1 << 31, /* iteration cursor, not a task */
 };
 /* scx_entity.flags & SCX_TASK_STATE_MASK */
 enum scx_task_state {
 	SCX_TASK_NONE,		/* ops.init_task() not called yet */
 	SCX_TASK_INIT,		/* ops.init_task() succeeded, but task can be cancelled */
 	SCX_TASK_READY,		/* fully initialized, but not in sched_ext */
 	SCX_TASK_ENABLED,	/* fully initialized and in sched_ext */
 	SCX_TASK_NR_STATES,
 };
 /* scx_entity.dsq_flags */
 enum scx_ent_dsq_flags {
 	SCX_TASK_DSQ_ON_PRIQ	= 1 << 0, /* task is queued on the priority queue of a dsq */
 };
 /*
 * Mask bits for scx_entity.kf_mask. Not all kfuncs can be called from
 * everywhere and the following bits track which kfunc sets are currently
 * allowed for %current. This simple per-task tracking works because SCX ops
 * nest in a limited way. BPF will likely implement a way to allow and disallow
 * kfuncs depending on the calling context which will replace this manual
 * mechanism. See scx_kf_allow().
 */
 enum scx_kf_mask {
 	SCX_KF_UNLOCKED		= 0,	  /* sleepable and not rq locked */
 	/* ENQUEUE and DISPATCH may be nested inside CPU_RELEASE */
 	SCX_KF_CPU_RELEASE	= 1 << 0, /* ops.cpu_release() */
 	/* ops.dequeue (in REST) may be nested inside DISPATCH */
 	SCX_KF_DISPATCH		= 1 << 1, /* ops.dispatch() */
 	SCX_KF_ENQUEUE		= 1 << 2, /* ops.enqueue() and ops.select_cpu() */
 	SCX_KF_SELECT_CPU	= 1 << 3, /* ops.select_cpu() */
 	SCX_KF_REST		= 1 << 4, /* other rq-locked operations */
 	__SCX_KF_RQ_LOCKED	= SCX_KF_CPU_RELEASE | SCX_KF_DISPATCH |
 				  SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST,
 	__SCX_KF_TERMINAL	= SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU | SCX_KF_REST,
 };
 enum scx_dsq_lnode_flags {
 	SCX_DSQ_LNODE_ITER_CURSOR = 1 << 0,
 	/* high 16 bits can be for iter cursor flags */
 	__SCX_DSQ_LNODE_PRIV_SHIFT = 16,
 };
 struct scx_dsq_list_node {
 	struct list_head	node;
 	u32			flags;
 	u32			priv;		/* can be used by iter cursor */
 };
 /*
 * The following is embedded in task_struct and contains all fields necessary
 * for a task to be scheduled by SCX.
 */
 struct sched_ext_entity {
 	struct scx_dispatch_q	*dsq;
 	struct scx_dsq_list_node dsq_list;	/* dispatch order */
 	struct rb_node		dsq_priq;	/* p->scx.dsq_vtime order */
 	u32			dsq_seq;
 	u32			dsq_flags;	/* protected by DSQ lock */
 	u32			flags;		/* protected by rq lock */
 	u32			weight;
 	s32			sticky_cpu;
 	s32			holding_cpu;
 	u32			kf_mask;	/* see scx_kf_mask above */
 	struct task_struct	*kf_tasks[2];	/* see SCX_CALL_OP_TASK() */
 	atomic_long_t		ops_state;
 	struct list_head	runnable_node;	/* rq->scx.runnable_list */
 	unsigned long		runnable_at;
 #ifdef CONFIG_SCHED_CORE
 	u64			core_sched_at;	/* see scx_prio_less() */
 #endif
 	u64			ddsp_dsq_id;
 	u64			ddsp_enq_flags;
 	/* BPF scheduler modifiable fields */
 	/*
 	 * Runtime budget in nsecs. This is usually set through
 	 * scx_bpf_dispatch() but can also be modified directly by the BPF
 	 * scheduler. Automatically decreased by SCX as the task executes. On
 	 * depletion, a scheduling event is triggered.
 	 *
 	 * This value is cleared to zero if the task is preempted by
 	 * %SCX_KICK_PREEMPT and shouldn't be used to determine how long the
 	 * task ran. Use p->se.sum_exec_runtime instead.
 	 */
 	u64			slice;
 	/*
 	 * Used to order tasks when dispatching to the vtime-ordered priority
 	 * queue of a dsq. This is usually set through scx_bpf_dispatch_vtime()
 	 * but can also be modified directly by the BPF scheduler. Modifying it
 	 * while a task is queued on a dsq may mangle the ordering and is not
 	 * recommended.
 	 */
 	u64			dsq_vtime;
 	/*
 	 * If set, reject future sched_setscheduler(2) calls updating the policy
 	 * to %SCHED_EXT with -%EACCES.
 	 *
 	 * Can be set from ops.init_task() while the BPF scheduler is being
 	 * loaded (!scx_init_task_args->fork). If set and the task's policy is
 	 * already %SCHED_EXT, the task's policy is rejected and forcefully
 	 * reverted to %SCHED_NORMAL. The number of such events are reported
 	 * through /sys/kernel/debug/sched_ext::nr_rejected. Setting this flag
 	 * during fork is not allowed.
 	 */
 	bool			disallow;	/* reject switching into SCX */
 	/* cold fields */
 #ifdef CONFIG_EXT_GROUP_SCHED
 	struct cgroup		*cgrp_moving_from;
 #endif
 	/* must be the last field, see init_scx_entity() */
 	struct list_head	tasks_node;
 };
 void sched_ext_free(struct task_struct *p);
 void print_scx_info(const char *log_lvl, struct task_struct *p);
 #else	/* !CONFIG_SCHED_CLASS_EXT */
 static inline void sched_ext_free(struct task_struct *p) {}
 static inline void print_scx_info(const char *log_lvl, struct task_struct *p) {}
 #endif	/* CONFIG_SCHED_CLASS_EXT */
 #endif	/* _LINUX_SCHED_EXT_H */
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@ -63,7 +63,8 @@ extern asmlinkage void schedule_tail(struct task_struct *prev);
 extern void init_idle(struct task_struct *idle, int cpu);
 extern int sched_fork(unsigned long clone_flags, struct task_struct *p);
-extern void sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs);
+extern int sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs);
 extern void sched_cancel_fork(struct task_struct *p);
 extern void sched_post_fork(struct task_struct *p);
 extern void sched_dead(struct task_struct *p);
@ -119,6 +120,11 @@ static inline struct task_struct *get_task_struct(struct task_struct *t)
 	return t;
 }
 static inline struct task_struct *tryget_task_struct(struct task_struct *t)
 {
 	return refcount_inc_not_zero(&t->usage) ? t : NULL;
 }
 extern void __put_task_struct(struct task_struct *t);
 extern void __put_task_struct_rcu_cb(struct rcu_head *rhp);
--- a/include/trace/events/sched_ext.h
+++ b/include/trace/events/sched_ext.h
@ -0,0 +1,32 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #undef TRACE_SYSTEM
 #define TRACE_SYSTEM sched_ext
 #if !defined(_TRACE_SCHED_EXT_H) || defined(TRACE_HEADER_MULTI_READ)
 #define _TRACE_SCHED_EXT_H
 #include <linux/tracepoint.h>
 TRACE_EVENT(sched_ext_dump,
 	TP_PROTO(const char *line),
 	TP_ARGS(line),
 	TP_STRUCT__entry(
 		__string(line, line)
 	),
 	TP_fast_assign(
 		__assign_str(line);
 	),
 	TP_printk("%s",
 		__get_str(line)
 	)
 );
 #endif /* _TRACE_SCHED_EXT_H */
 /* This part must be outside protection */
 #include <trace/define_trace.h>
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@ -118,6 +118,7 @@ struct clone_args {
 /* SCHED_ISO: reserved but not implemented yet */
 #define SCHED_IDLE		5
 #define SCHED_DEADLINE		6
 #define SCHED_EXT		7
 /* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */
 #define SCHED_RESET_ON_FORK     0x40000000
--- a/init/Kconfig
+++ b/init/Kconfig
@ -1025,9 +1025,13 @@ menuconfig CGROUP_SCHED
 	  tasks.
 if CGROUP_SCHED
 config GROUP_SCHED_WEIGHT
 	def_bool n
 config FAIR_GROUP_SCHED
 	bool "Group scheduling for SCHED_OTHER"
 	depends on CGROUP_SCHED
 	select GROUP_SCHED_WEIGHT
 	default CGROUP_SCHED
 config CFS_BANDWIDTH
@ -1052,6 +1056,12 @@ config RT_GROUP_SCHED
 	  realtime bandwidth for them.
 	  See Documentation/scheduler/sched-rt-group.rst for more information.
 config EXT_GROUP_SCHED
 	bool
 	depends on SCHED_CLASS_EXT && CGROUP_SCHED
 	select GROUP_SCHED_WEIGHT
 	default y
 endif #CGROUP_SCHED
 config SCHED_MM_CID
--- a/init/init_task.c
+++ b/init/init_task.c
@ -6,6 +6,7 @@
 #include <linux/sched/sysctl.h>
 #include <linux/sched/rt.h>
 #include <linux/sched/task.h>
 #include <linux/sched/ext.h>
 #include <linux/init.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
@ -98,6 +99,17 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 #endif
 #ifdef CONFIG_CGROUP_SCHED
 	.sched_task_group = &root_task_group,
 #endif
 #ifdef CONFIG_SCHED_CLASS_EXT
 	.scx		= {
 		.dsq_list.node	= LIST_HEAD_INIT(init_task.scx.dsq_list.node),
 		.sticky_cpu	= -1,
 		.holding_cpu	= -1,
 		.runnable_node	= LIST_HEAD_INIT(init_task.scx.runnable_node),
 		.runnable_at	= INITIAL_JIFFIES,
 		.ddsp_dsq_id	= SCX_DSQ_INVALID,
 		.slice		= SCX_SLICE_DFL,
 	},
 #endif
 	.ptraced	= LIST_HEAD_INIT(init_task.ptraced),
 	.ptrace_entry	= LIST_HEAD_INIT(init_task.ptrace_entry),
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@ -133,4 +133,29 @@ config SCHED_CORE
 	  which is the likely usage by Linux distributions, there should
 	  be no measurable impact on performance.
 config SCHED_CLASS_EXT
 	bool "Extensible Scheduling Class"
 	depends on BPF_SYSCALL && BPF_JIT && DEBUG_INFO_BTF
 	select STACKTRACE if STACKTRACE_SUPPORT
 	help
 	  This option enables a new scheduler class sched_ext (SCX), which
 	  allows scheduling policies to be implemented as BPF programs to
 	  achieve the following:
 	  - Ease of experimentation and exploration: Enabling rapid
 	    iteration of new scheduling policies.
 	  - Customization: Building application-specific schedulers which
 	    implement policies that are not applicable to general-purpose
 	    schedulers.
 	  - Rapid scheduler deployments: Non-disruptive swap outs of
 	    scheduling policies in production environments.
 	  sched_ext leverages BPF struct_ops feature to define a structure
 	  which exports function callbacks and flags to BPF programs that
 	  wish to implement scheduling policies. The struct_ops structure
 	  exported by sched_ext is struct sched_ext_ops, and is conceptually
 	  similar to struct sched_class.
 	  For more information:
 	    Documentation/scheduler/sched-ext.rst
 	    https://github.com/sched-ext/scx
--- a/kernel/fork.c
+++ b/kernel/fork.c
@ -23,6 +23,7 @@
 #include <linux/sched/task.h>
 #include <linux/sched/task_stack.h>
 #include <linux/sched/cputime.h>
 #include <linux/sched/ext.h>
 #include <linux/seq_file.h>
 #include <linux/rtmutex.h>
 #include <linux/init.h>
@ -969,6 +970,7 @@ void __put_task_struct(struct task_struct *tsk)
 	WARN_ON(refcount_read(&tsk->usage));
 	WARN_ON(tsk == current);
 	sched_ext_free(tsk);
 	io_uring_free(tsk);
 	cgroup_free(tsk);
 	task_numa_free(tsk, true);
@ -2346,7 +2348,7 @@ __latent_entropy struct task_struct *copy_process(
 	retval = perf_event_init_task(p, clone_flags);
 	if (retval)
-		goto bad_fork_cleanup_policy;
+		goto bad_fork_sched_cancel_fork;
 	retval = audit_alloc(p);
 	if (retval)
 		goto bad_fork_cleanup_perf;
@ -2479,7 +2481,9 @@ __latent_entropy struct task_struct *copy_process(
 	 * cgroup specific, it unconditionally needs to place the task on a
 	 * runqueue.
 	 */
-	sched_cgroup_fork(p, args);
+	retval = sched_cgroup_fork(p, args);
 	if (retval)
 		goto bad_fork_cancel_cgroup;
 	/*
 	 * From this point on we must avoid any synchronous user-space
@ -2525,13 +2529,13 @@ __latent_entropy struct task_struct *copy_process(
 	/* Don't start children in a dying pid namespace */
 	if (unlikely(!(ns_of_pid(pid)->pid_allocated & PIDNS_ADDING))) {
 		retval = -ENOMEM;
-		goto bad_fork_cancel_cgroup;
+		goto bad_fork_core_free;
 	}
 	/* Let kill terminate clone/fork in the middle */
 	if (fatal_signal_pending(current)) {
 		retval = -EINTR;
-		goto bad_fork_cancel_cgroup;
+		goto bad_fork_core_free;
 	}
 	/* No more failure paths after this point. */
@ -2605,10 +2609,11 @@ __latent_entropy struct task_struct *copy_process(
 	return p;
-bad_fork_cancel_cgroup:
+bad_fork_core_free:
 	sched_core_free(p);
 	spin_unlock(&current->sighand->siglock);
 	write_unlock_irq(&tasklist_lock);
 bad_fork_cancel_cgroup:
 	cgroup_cancel_fork(p, args);
 bad_fork_put_pidfd:
 	if (clone_flags & CLONE_PIDFD) {
@ -2647,6 +2652,8 @@ __latent_entropy struct task_struct *copy_process(
 	audit_free(p);
 bad_fork_cleanup_perf:
 	perf_event_free_task(p);
 bad_fork_sched_cancel_fork:
 	sched_cancel_fork(p);
 bad_fork_cleanup_policy:
 	lockdep_free_task(p);
 #ifdef CONFIG_NUMA
--- a/kernel/sched/build_policy.c
+++ b/kernel/sched/build_policy.c
@ -16,18 +16,25 @@
 #include <linux/sched/clock.h>
 #include <linux/sched/cputime.h>
 #include <linux/sched/hotplug.h>
 #include <linux/sched/isolation.h>
 #include <linux/sched/posix-timers.h>
 #include <linux/sched/rt.h>
 #include <linux/cpuidle.h>
 #include <linux/jiffies.h>
 #include <linux/kobject.h>
 #include <linux/livepatch.h>
 #include <linux/pm.h>
 #include <linux/psi.h>
 #include <linux/rhashtable.h>
 #include <linux/seq_buf.h>
 #include <linux/seqlock_api.h>
 #include <linux/slab.h>
 #include <linux/suspend.h>
 #include <linux/tsacct_kern.h>
 #include <linux/vtime.h>
 #include <linux/sysrq.h>
 #include <linux/percpu-rwsem.h>
 #include <uapi/linux/sched/types.h>
@ -52,4 +59,8 @@
 #include "cputime.c"
 #include "deadline.c"
 #ifdef CONFIG_SCHED_CLASS_EXT
 # include "ext.c"
 #endif
 #include "syscalls.c"
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@ -172,7 +172,10 @@ static inline int __task_prio(const struct task_struct *p)
 	if (p->sched_class == &idle_sched_class)
 		return MAX_RT_PRIO + NICE_WIDTH; /* 140 */
-	return MAX_RT_PRIO + MAX_NICE; /* 120, squash fair */
+	if (task_on_scx(p))
 		return MAX_RT_PRIO + MAX_NICE + 1; /* 120, squash ext */
 	return MAX_RT_PRIO + MAX_NICE; /* 119, squash fair */
 }
 /*
@ -217,6 +220,11 @@ static inline bool prio_less(const struct task_struct *a,
 	if (pa == MAX_RT_PRIO + MAX_NICE)	/* fair */
 		return cfs_prio_less(a, b, in_fi);
 #ifdef CONFIG_SCHED_CLASS_EXT
 	if (pa == MAX_RT_PRIO + MAX_NICE + 1)	/* ext */
 		return scx_prio_less(a, b, in_fi);
 #endif
 	return false;
 }
@ -1280,11 +1288,14 @@ bool sched_can_stop_tick(struct rq *rq)
 		return true;
 	/*
-	 * If there are no DL,RR/FIFO tasks, there must only be CFS tasks left;
+	 * If there are no DL,RR/FIFO tasks, there must only be CFS or SCX tasks
-	 * if there's more than one we need the tick for involuntary
+	 * left. For CFS, if there's more than one we need the tick for
-	 * preemption.
+	 * involuntary preemption. For SCX, ask.
 	 */
-	if (rq->nr_running > 1)
+	if (scx_enabled() && !scx_can_stop_tick(rq))
 		return false;
 	if (rq->cfs.nr_running > 1)
 		return false;
 	/*
@ -1366,8 +1377,8 @@ void set_load_weight(struct task_struct *p, bool update_load)
 	 * SCHED_OTHER tasks have to update their load when changing their
 	 * weight
 	 */
-	if (update_load && p->sched_class == &fair_sched_class)
+	if (update_load && p->sched_class->reweight_task)
-		reweight_task(p, &lw);
+		p->sched_class->reweight_task(task_rq(p), p, &lw);
 	else
 		p->se.load = lw;
 }
@ -2086,6 +2097,17 @@ inline int task_curr(const struct task_struct *p)
 	return cpu_curr(task_cpu(p)) == p;
 }
 /*
 * ->switching_to() is called with the pi_lock and rq_lock held and must not
 * mess with locking.
 */
 void check_class_changing(struct rq *rq, struct task_struct *p,
 			  const struct sched_class *prev_class)
 {
 	if (prev_class != p->sched_class && p->sched_class->switching_to)
 		p->sched_class->switching_to(rq, p);
 }
 /*
 * switched_from, switched_to and prio_changed must _NOT_ drop rq->lock,
 * use the balance_callback list if you want balancing.
@ -2356,7 +2378,7 @@ static inline bool rq_has_pinned_tasks(struct rq *rq)
 static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 {
 	/* When not in the task's cpumask, no point in looking further. */
-	if (!cpumask_test_cpu(cpu, p->cpus_ptr))
+	if (!task_allowed_on_cpu(p, cpu))
 		return false;
 	/* migrate_disabled() must be allowed to finish. */
@ -2365,7 +2387,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 	/* Non kernel threads are not allowed during either online or offline. */
 	if (!(p->flags & PF_KTHREAD))
-		return cpu_active(cpu) && task_cpu_possible(cpu, p);
+		return cpu_active(cpu);
 	/* KTHREAD_IS_PER_CPU is always allowed. */
 	if (kthread_is_per_cpu(p))
@ -3842,6 +3864,15 @@ bool cpus_share_resources(int this_cpu, int that_cpu)
 static inline bool ttwu_queue_cond(struct task_struct *p, int cpu)
 {
 	/*
 	 * The BPF scheduler may depend on select_task_rq() being invoked during
 	 * wakeups. In addition, @p may end up executing on a different CPU
 	 * regardless of what happens in the wakeup path making the ttwu_queue
 	 * optimization less meaningful. Skip if on SCX.
 	 */
 	if (task_on_scx(p))
 		return false;
 	/*
 	 * Do not complicate things with the async wake_list while the CPU is
 	 * in hotplug state.
@ -4416,6 +4447,10 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->rt.on_rq		= 0;
 	p->rt.on_list		= 0;
 #ifdef CONFIG_SCHED_CLASS_EXT
 	init_scx_entity(&p->scx);
 #endif
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	INIT_HLIST_HEAD(&p->preempt_notifiers);
 #endif
@ -4658,10 +4693,18 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 	if (dl_prio(p->prio))
 		return -EAGAIN;
-	else if (rt_prio(p->prio))
+
 	scx_pre_fork(p);
 	if (rt_prio(p->prio)) {
 		p->sched_class = &rt_sched_class;
-	else
+#ifdef CONFIG_SCHED_CLASS_EXT
 	} else if (task_should_scx(p)) {
 		p->sched_class = &ext_sched_class;
 #endif
 	} else {
 		p->sched_class = &fair_sched_class;
 	}
 	init_entity_runnable_average(&p->se);
@ -4681,7 +4724,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
 	return 0;
 }
-void sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs)
+int sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs)
 {
 	unsigned long flags;
@ -4708,11 +4751,19 @@ void sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs)
 	if (p->sched_class->task_fork)
 		p->sched_class->task_fork(p);
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 	return scx_fork(p);
 }
 void sched_cancel_fork(struct task_struct *p)
 {
 	scx_cancel_fork(p);
 }
 void sched_post_fork(struct task_struct *p)
 {
 	uclamp_post_fork(p);
 	scx_post_fork(p);
 }
 unsigned long to_ratio(u64 period, u64 runtime)
@ -5545,6 +5596,7 @@ void sched_tick(void)
 	calc_global_load_tick(rq);
 	sched_core_tick(rq);
 	task_tick_mm_cid(rq, curr);
 	scx_tick(rq);
 	rq_unlock(rq, &rf);
@ -5557,8 +5609,10 @@ void sched_tick(void)
 		wq_worker_tick(curr);
 #ifdef CONFIG_SMP
-	rq->idle_balance = idle_cpu(cpu);
+	if (!scx_switched_all()) {
-	sched_balance_trigger(rq);
+		rq->idle_balance = idle_cpu(cpu);
 		sched_balance_trigger(rq);
 	}
 #endif
 }
@ -5848,8 +5902,19 @@ static inline void schedule_debug(struct task_struct *prev, bool preempt)
 static void prev_balance(struct rq *rq, struct task_struct *prev,
 			 struct rq_flags *rf)
 {
-#ifdef CONFIG_SMP
+	const struct sched_class *start_class = prev->sched_class;
 	const struct sched_class *class;
 #ifdef CONFIG_SCHED_CLASS_EXT
 	/*
 	 * SCX requires a balance() call before every pick_next_task() including
 	 * when waking up from SCHED_IDLE. If @start_class is below SCX, start
 	 * from SCX instead.
 	 */
 	if (scx_enabled() && sched_class_above(&ext_sched_class, start_class))
 		start_class = &ext_sched_class;
 #endif
 	/*
 	 * We must do the balancing pass before put_prev_task(), such
 	 * that when we release the rq->lock the task is in the same
@ -5858,11 +5923,10 @@ static void prev_balance(struct rq *rq, struct task_struct *prev,
 	 * We can terminate the balance pass as soon as we know there is
 	 * a runnable task of @class priority or higher.
 	 */
-	for_class_range(class, prev->sched_class, &idle_sched_class) {
+	for_active_class_range(class, start_class, &idle_sched_class) {
-		if (class->balance(rq, prev, rf))
+		if (class->balance && class->balance(rq, prev, rf))
 			break;
 	}
 #endif
 }
 /*
@ -5876,6 +5940,9 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	rq->dl_server = NULL;
 	if (scx_enabled())
 		goto restart;
 	/*
 	 * Optimization: we know that if all tasks are in the fair class we can
 	 * call that function directly, but only if the @prev task wasn't of a
@ -5901,7 +5968,7 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 restart:
 	prev_balance(rq, prev, rf);
-	for_each_class(class) {
+	for_each_active_class(class) {
 		if (class->pick_next_task) {
 			p = class->pick_next_task(rq, prev);
 			if (p)
@ -5944,7 +6011,7 @@ static inline struct task_struct *pick_task(struct rq *rq)
 	rq->dl_server = NULL;
-	for_each_class(class) {
+	for_each_active_class(class) {
 		p = class->pick_task(rq);
 		if (p)
 			return p;
@ -6948,6 +7015,10 @@ void __setscheduler_prio(struct task_struct *p, int prio)
 		p->sched_class = &dl_sched_class;
 	else if (rt_prio(prio))
 		p->sched_class = &rt_sched_class;
 #ifdef CONFIG_SCHED_CLASS_EXT
 	else if (task_should_scx(p))
 		p->sched_class = &ext_sched_class;
 #endif
 	else
 		p->sched_class = &fair_sched_class;
@ -7093,6 +7164,7 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
 	}
 	__setscheduler_prio(p, prio);
 	check_class_changing(rq, p, prev_class);
 	if (queued)
 		enqueue_task(rq, p, queue_flag);
@ -7505,6 +7577,7 @@ void sched_show_task(struct task_struct *p)
 	print_worker_info(KERN_INFO, p);
 	print_stop_info(KERN_INFO, p);
 	print_scx_info(KERN_INFO, p);
 	show_stack(p, NULL, KERN_INFO);
 	put_task_stack(p);
 }
@ -8033,6 +8106,8 @@ int sched_cpu_activate(unsigned int cpu)
 		cpuset_cpu_active();
 	}
 	scx_rq_activate(rq);
 	/*
 	 * Put the rq online, if not already. This happens:
 	 *
@ -8082,6 +8157,8 @@ int sched_cpu_deactivate(unsigned int cpu)
 	sched_set_rq_offline(rq, cpu);
 	scx_rq_deactivate(rq);
 	/*
 	 * When going down, decrement the number of cores with SMT present.
 	 */
@ -8266,11 +8343,15 @@ void __init sched_init(void)
 	int i;
 	/* Make sure the linker didn't screw up */
 	BUG_ON(&idle_sched_class != &fair_sched_class + 1 ||
 	       &fair_sched_class != &rt_sched_class + 1 ||
 	       &rt_sched_class   != &dl_sched_class + 1);
 #ifdef CONFIG_SMP
-	BUG_ON(&dl_sched_class != &stop_sched_class + 1);
+	BUG_ON(!sched_class_above(&stop_sched_class, &dl_sched_class));
 #endif
 	BUG_ON(!sched_class_above(&dl_sched_class, &rt_sched_class));
 	BUG_ON(!sched_class_above(&rt_sched_class, &fair_sched_class));
 	BUG_ON(!sched_class_above(&fair_sched_class, &idle_sched_class));
 #ifdef CONFIG_SCHED_CLASS_EXT
 	BUG_ON(!sched_class_above(&fair_sched_class, &ext_sched_class));
 	BUG_ON(!sched_class_above(&ext_sched_class, &idle_sched_class));
 #endif
 	wait_bit_init();
@ -8294,6 +8375,9 @@ void __init sched_init(void)
 		root_task_group.shares = ROOT_TASK_GROUP_LOAD;
 		init_cfs_bandwidth(&root_task_group.cfs_bandwidth, NULL);
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 #ifdef CONFIG_EXT_GROUP_SCHED
 		root_task_group.scx_weight = CGROUP_WEIGHT_DFL;
 #endif /* CONFIG_EXT_GROUP_SCHED */
 #ifdef CONFIG_RT_GROUP_SCHED
 		root_task_group.rt_se = (struct sched_rt_entity **)ptr;
 		ptr += nr_cpu_ids * sizeof(void **);
@ -8445,6 +8529,7 @@ void __init sched_init(void)
 	balance_push_set(smp_processor_id(), false);
 #endif
 	init_sched_fair_class();
 	init_sched_ext_class();
 	psi_init();
@ -8730,6 +8815,7 @@ struct task_group *sched_create_group(struct task_group *parent)
 	if (!alloc_rt_sched_group(tg, parent))
 		goto err;
 	scx_group_set_weight(tg, CGROUP_WEIGHT_DFL);
 	alloc_uclamp_sched_group(tg, parent);
 	return tg;
@ -8857,6 +8943,7 @@ void sched_move_task(struct task_struct *tsk)
 		put_prev_task(rq, tsk);
 	sched_change_group(tsk, group);
 	scx_move_task(tsk);
 	if (queued)
 		enqueue_task(rq, tsk, queue_flags);
@ -8871,11 +8958,6 @@ void sched_move_task(struct task_struct *tsk)
 	}
 }
 static inline struct task_group *css_tg(struct cgroup_subsys_state *css)
 {
 	return css ? container_of(css, struct task_group, css) : NULL;
 }
 static struct cgroup_subsys_state *
 cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 {
@ -8899,6 +8981,11 @@ static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
 {
 	struct task_group *tg = css_tg(css);
 	struct task_group *parent = css_tg(css->parent);
 	int ret;
 	ret = scx_tg_online(tg);
 	if (ret)
 		return ret;
 	if (parent)
 		sched_online_group(tg, parent);
@ -8913,6 +9000,13 @@ static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
 	return 0;
 }
 static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
 {
 	struct task_group *tg = css_tg(css);
 	scx_tg_offline(tg);
 }
 static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
 {
 	struct task_group *tg = css_tg(css);
@ -8930,9 +9024,9 @@ static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
 	sched_unregister_group(tg);
 }
 #ifdef CONFIG_RT_GROUP_SCHED
 static int cpu_cgroup_can_attach(struct cgroup_taskset *tset)
 {
 #ifdef CONFIG_RT_GROUP_SCHED
 	struct task_struct *task;
 	struct cgroup_subsys_state *css;
@ -8940,9 +9034,9 @@ static int cpu_cgroup_can_attach(struct cgroup_taskset *tset)
 		if (!sched_rt_can_attach(css_tg(css), task))
 			return -EINVAL;
 	}
 	return 0;
 }
 #endif
 	return scx_cgroup_can_attach(tset);
 }
 static void cpu_cgroup_attach(struct cgroup_taskset *tset)
 {
@ -8951,6 +9045,13 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
 	cgroup_taskset_for_each(task, css, tset)
 		sched_move_task(task);
 	scx_cgroup_finish_attach();
 }
 static void cpu_cgroup_cancel_attach(struct cgroup_taskset *tset)
 {
 	scx_cgroup_cancel_attach(tset);
 }
 #ifdef CONFIG_UCLAMP_TASK_GROUP
@ -9127,22 +9228,36 @@ static int cpu_uclamp_max_show(struct seq_file *sf, void *v)
 }
 #endif /* CONFIG_UCLAMP_TASK_GROUP */
 #ifdef CONFIG_GROUP_SCHED_WEIGHT
 static unsigned long tg_weight(struct task_group *tg)
 {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	return scale_load_down(tg->shares);
 #else
 	return sched_weight_from_cgroup(tg->scx_weight);
 #endif
 }
 static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
 				struct cftype *cftype, u64 shareval)
 {
 	int ret;
 	if (shareval > scale_load_down(ULONG_MAX))
 		shareval = MAX_SHARES;
-	return sched_group_set_shares(css_tg(css), scale_load(shareval));
+	ret = sched_group_set_shares(css_tg(css), scale_load(shareval));
 	if (!ret)
 		scx_group_set_weight(css_tg(css),
 				     sched_weight_to_cgroup(shareval));
 	return ret;
 }
 static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css,
 			       struct cftype *cft)
 {
-	struct task_group *tg = css_tg(css);
+	return tg_weight(css_tg(css));
 	return (u64) scale_load_down(tg->shares);
 }
 #endif /* CONFIG_GROUP_SCHED_WEIGHT */
 #ifdef CONFIG_CFS_BANDWIDTH
 static DEFINE_MUTEX(cfs_constraints_mutex);
@ -9488,7 +9603,6 @@ static int cpu_cfs_local_stat_show(struct seq_file *sf, void *v)
 	return 0;
 }
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 #ifdef CONFIG_RT_GROUP_SCHED
 static int cpu_rt_runtime_write(struct cgroup_subsys_state *css,
@ -9516,7 +9630,7 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
-#ifdef CONFIG_FAIR_GROUP_SCHED
+#ifdef CONFIG_GROUP_SCHED_WEIGHT
 static s64 cpu_idle_read_s64(struct cgroup_subsys_state *css,
 			       struct cftype *cft)
 {
@ -9526,12 +9640,17 @@ static s64 cpu_idle_read_s64(struct cgroup_subsys_state *css,
 static int cpu_idle_write_s64(struct cgroup_subsys_state *css,
 				struct cftype *cft, s64 idle)
 {
-	return sched_group_set_idle(css_tg(css), idle);
+	int ret;
 	ret = sched_group_set_idle(css_tg(css), idle);
 	if (!ret)
 		scx_group_set_idle(css_tg(css), idle);
 	return ret;
 }
 #endif
 static struct cftype cpu_legacy_files[] = {
-#ifdef CONFIG_FAIR_GROUP_SCHED
+#ifdef CONFIG_GROUP_SCHED_WEIGHT
 	{
 		.name = "shares",
 		.read_u64 = cpu_shares_read_u64,
@ -9641,38 +9760,35 @@ static int cpu_local_stat_show(struct seq_file *sf,
 	return 0;
 }
-#ifdef CONFIG_FAIR_GROUP_SCHED
+#ifdef CONFIG_GROUP_SCHED_WEIGHT
 static u64 cpu_weight_read_u64(struct cgroup_subsys_state *css,
 			       struct cftype *cft)
 {
-	struct task_group *tg = css_tg(css);
+	return sched_weight_to_cgroup(tg_weight(css_tg(css)));
 	u64 weight = scale_load_down(tg->shares);
 	return DIV_ROUND_CLOSEST_ULL(weight * CGROUP_WEIGHT_DFL, 1024);
 }
 static int cpu_weight_write_u64(struct cgroup_subsys_state *css,
-				struct cftype *cft, u64 weight)
+				struct cftype *cft, u64 cgrp_weight)
 {
-	/*
+	unsigned long weight;
-	 * cgroup weight knobs should use the common MIN, DFL and MAX
+	int ret;
-	 * values which are 1, 100 and 10000 respectively.  While it loses
+
-	 * a bit of range on both ends, it maps pretty well onto the shares
+	if (cgrp_weight < CGROUP_WEIGHT_MIN || cgrp_weight > CGROUP_WEIGHT_MAX)
 	 * value used by scheduler and the round-trip conversions preserve
 	 * the original value over the entire range.
 	 */
 	if (weight < CGROUP_WEIGHT_MIN || weight > CGROUP_WEIGHT_MAX)
 		return -ERANGE;
-	weight = DIV_ROUND_CLOSEST_ULL(weight * 1024, CGROUP_WEIGHT_DFL);
+	weight = sched_weight_from_cgroup(cgrp_weight);
-	return sched_group_set_shares(css_tg(css), scale_load(weight));
+	ret = sched_group_set_shares(css_tg(css), scale_load(weight));
 	if (!ret)
 		scx_group_set_weight(css_tg(css), cgrp_weight);
 	return ret;
 }
 static s64 cpu_weight_nice_read_s64(struct cgroup_subsys_state *css,
 				    struct cftype *cft)
 {
-	unsigned long weight = scale_load_down(css_tg(css)->shares);
+	unsigned long weight = tg_weight(css_tg(css));
 	int last_delta = INT_MAX;
 	int prio, delta;
@ -9691,7 +9807,7 @@ static int cpu_weight_nice_write_s64(struct cgroup_subsys_state *css,
 				     struct cftype *cft, s64 nice)
 {
 	unsigned long weight;
-	int idx;
+	int idx, ret;
 	if (nice < MIN_NICE || nice > MAX_NICE)
 		return -ERANGE;
@ -9700,9 +9816,13 @@ static int cpu_weight_nice_write_s64(struct cgroup_subsys_state *css,
 	idx = array_index_nospec(idx, 40);
 	weight = sched_prio_to_weight[idx];
-	return sched_group_set_shares(css_tg(css), scale_load(weight));
+	ret = sched_group_set_shares(css_tg(css), scale_load(weight));
 	if (!ret)
 		scx_group_set_weight(css_tg(css),
 				     sched_weight_to_cgroup(weight));
 	return ret;
 }
-#endif
+#endif /* CONFIG_GROUP_SCHED_WEIGHT */
 static void __maybe_unused cpu_period_quota_print(struct seq_file *sf,
 						  long period, long quota)
@ -9762,7 +9882,7 @@ static ssize_t cpu_max_write(struct kernfs_open_file *of,
 #endif
 static struct cftype cpu_files[] = {
-#ifdef CONFIG_FAIR_GROUP_SCHED
+#ifdef CONFIG_GROUP_SCHED_WEIGHT
 	{
 		.name = "weight",
 		.flags = CFTYPE_NOT_ON_ROOT,
@ -9816,14 +9936,14 @@ static struct cftype cpu_files[] = {
 struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_alloc	= cpu_cgroup_css_alloc,
 	.css_online	= cpu_cgroup_css_online,
 	.css_offline	= cpu_cgroup_css_offline,
 	.css_released	= cpu_cgroup_css_released,
 	.css_free	= cpu_cgroup_css_free,
 	.css_extra_stat_show = cpu_extra_stat_show,
 	.css_local_stat_show = cpu_local_stat_show,
 #ifdef CONFIG_RT_GROUP_SCHED
 	.can_attach	= cpu_cgroup_can_attach,
 #endif
 	.attach		= cpu_cgroup_attach,
 	.cancel_attach	= cpu_cgroup_cancel_attach,
 	.legacy_cftypes	= cpu_legacy_files,
 	.dfl_cftypes	= cpu_files,
 	.early_init	= true,
@ -10413,3 +10533,38 @@ void sched_mm_cid_fork(struct task_struct *t)
 	t->mm_cid_active = 1;
 }
 #endif
 #ifdef CONFIG_SCHED_CLASS_EXT
 void sched_deq_and_put_task(struct task_struct *p, int queue_flags,
 			    struct sched_enq_and_set_ctx *ctx)
 {
 	struct rq *rq = task_rq(p);
 	lockdep_assert_rq_held(rq);
 	*ctx = (struct sched_enq_and_set_ctx){
 		.p = p,
 		.queue_flags = queue_flags,
 		.queued = task_on_rq_queued(p),
 		.running = task_current(rq, p),
 	};
 	update_rq_clock(rq);
 	if (ctx->queued)
 		dequeue_task(rq, p, queue_flags | DEQUEUE_NOCLOCK);
 	if (ctx->running)
 		put_prev_task(rq, p);
 }
 void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx)
 {
 	struct rq *rq = task_rq(ctx->p);
 	lockdep_assert_rq_held(rq);
 	if (ctx->queued)
 		enqueue_task(rq, ctx->p, ctx->queue_flags | ENQUEUE_NOCLOCK);
 	if (ctx->running)
 		set_next_task(rq, ctx->p);
 }
 #endif	/* CONFIG_SCHED_CLASS_EXT */
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@ -197,8 +197,10 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
 static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
 {
-	unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
+	unsigned long min, max, util = scx_cpuperf_target(sg_cpu->cpu);
 	if (!scx_switched_all())
 		util += cpu_util_cfs_boost(sg_cpu->cpu);
 	util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
 	util = max(util, boost);
 	sg_cpu->bw_min = min;
@ -325,16 +327,35 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
 }
 #ifdef CONFIG_NO_HZ_COMMON
-static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
+static bool sugov_hold_freq(struct sugov_cpu *sg_cpu)
 {
-	unsigned long idle_calls = tick_nohz_get_idle_calls_cpu(sg_cpu->cpu);
+	unsigned long idle_calls;
-	bool ret = idle_calls == sg_cpu->saved_idle_calls;
+	bool ret;
 	/*
 	 * The heuristics in this function is for the fair class. For SCX, the
 	 * performance target comes directly from the BPF scheduler. Let's just
 	 * follow it.
 	 */
 	if (scx_switched_all())
 		return false;
 	/* if capped by uclamp_max, always update to be in compliance */
 	if (uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)))
 		return false;
 	/*
 	 * Maintain the frequency if the CPU has not been idle recently, as
 	 * reduction is likely to be premature.
 	 */
 	idle_calls = tick_nohz_get_idle_calls_cpu(sg_cpu->cpu);
 	ret = idle_calls == sg_cpu->saved_idle_calls;
 	sg_cpu->saved_idle_calls = idle_calls;
 	return ret;
 }
 #else
-static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
+static inline bool sugov_hold_freq(struct sugov_cpu *sg_cpu) { return false; }
 #endif /* CONFIG_NO_HZ_COMMON */
 /*
@ -382,14 +403,8 @@ static void sugov_update_single_freq(struct update_util_data *hook, u64 time,
 		return;
 	next_f = get_next_freq(sg_policy, sg_cpu->util, max_cap);
-	/*
+
-	 * Do not reduce the frequency if the CPU has not been idle
+	if (sugov_hold_freq(sg_cpu) && next_f < sg_policy->next_freq &&
 	 * recently, as the reduction is likely to be premature then.
 	 *
 	 * Except when the rq is capped by uclamp_max.
 	 */
 	if (!uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)) &&
 	    sugov_cpu_is_busy(sg_cpu) && next_f < sg_policy->next_freq &&
 	    !sg_policy->need_freq_update) {
 		next_f = sg_policy->next_freq;
@ -436,14 +451,7 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
 	if (!sugov_update_single_common(sg_cpu, time, max_cap, flags))
 		return;
-	/*
+	if (sugov_hold_freq(sg_cpu) && sg_cpu->util < prev_util)
 	 * Do not reduce the target performance level if the CPU has not been
 	 * idle recently, as the reduction is likely to be premature then.
 	 *
 	 * Except when the rq is capped by uclamp_max.
 	 */
 	if (!uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)) &&
 	    sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util)
 		sg_cpu->util = prev_util;
 	cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_min,
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@ -1264,6 +1264,9 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P(dl.runtime);
 		P(dl.deadline);
 	}
 #ifdef CONFIG_SCHED_CLASS_EXT
 	__PS("ext.enabled", task_on_scx(p));
 #endif
 #undef PN_SCHEDSTAT
 #undef P_SCHEDSTAT
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@ -0,0 +1,91 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst
 *
 * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2022 Tejun Heo <tj@kernel.org>
 * Copyright (c) 2022 David Vernet <dvernet@meta.com>
 */
 #ifdef CONFIG_SCHED_CLASS_EXT
 void scx_tick(struct rq *rq);
 void init_scx_entity(struct sched_ext_entity *scx);
 void scx_pre_fork(struct task_struct *p);
 int scx_fork(struct task_struct *p);
 void scx_post_fork(struct task_struct *p);
 void scx_cancel_fork(struct task_struct *p);
 bool scx_can_stop_tick(struct rq *rq);
 void scx_rq_activate(struct rq *rq);
 void scx_rq_deactivate(struct rq *rq);
 int scx_check_setscheduler(struct task_struct *p, int policy);
 bool task_should_scx(struct task_struct *p);
 void init_sched_ext_class(void);
 static inline u32 scx_cpuperf_target(s32 cpu)
 {
 	if (scx_enabled())
 		return cpu_rq(cpu)->scx.cpuperf_target;
 	else
 		return 0;
 }
 static inline bool task_on_scx(const struct task_struct *p)
 {
 	return scx_enabled() && p->sched_class == &ext_sched_class;
 }
 #ifdef CONFIG_SCHED_CORE
 bool scx_prio_less(const struct task_struct *a, const struct task_struct *b,
 		   bool in_fi);
 #endif
 #else	/* CONFIG_SCHED_CLASS_EXT */
 static inline void scx_tick(struct rq *rq) {}
 static inline void scx_pre_fork(struct task_struct *p) {}
 static inline int scx_fork(struct task_struct *p) { return 0; }
 static inline void scx_post_fork(struct task_struct *p) {}
 static inline void scx_cancel_fork(struct task_struct *p) {}
 static inline u32 scx_cpuperf_target(s32 cpu) { return 0; }
 static inline bool scx_can_stop_tick(struct rq *rq) { return true; }
 static inline void scx_rq_activate(struct rq *rq) {}
 static inline void scx_rq_deactivate(struct rq *rq) {}
 static inline int scx_check_setscheduler(struct task_struct *p, int policy) { return 0; }
 static inline bool task_on_scx(const struct task_struct *p) { return false; }
 static inline void init_sched_ext_class(void) {}
 #endif	/* CONFIG_SCHED_CLASS_EXT */
 #if defined(CONFIG_SCHED_CLASS_EXT) && defined(CONFIG_SMP)
 void __scx_update_idle(struct rq *rq, bool idle);
 static inline void scx_update_idle(struct rq *rq, bool idle)
 {
 	if (scx_enabled())
 		__scx_update_idle(rq, idle);
 }
 #else
 static inline void scx_update_idle(struct rq *rq, bool idle) {}
 #endif
 #ifdef CONFIG_CGROUP_SCHED
 #ifdef CONFIG_EXT_GROUP_SCHED
 int scx_tg_online(struct task_group *tg);
 void scx_tg_offline(struct task_group *tg);
 int scx_cgroup_can_attach(struct cgroup_taskset *tset);
 void scx_move_task(struct task_struct *p);
 void scx_cgroup_finish_attach(void);
 void scx_cgroup_cancel_attach(struct cgroup_taskset *tset);
 void scx_group_set_weight(struct task_group *tg, unsigned long cgrp_weight);
 void scx_group_set_idle(struct task_group *tg, bool idle);
 #else	/* CONFIG_EXT_GROUP_SCHED */
 static inline int scx_tg_online(struct task_group *tg) { return 0; }
 static inline void scx_tg_offline(struct task_group *tg) {}
 static inline int scx_cgroup_can_attach(struct cgroup_taskset *tset) { return 0; }
 static inline void scx_move_task(struct task_struct *p) {}
 static inline void scx_cgroup_finish_attach(void) {}
 static inline void scx_cgroup_cancel_attach(struct cgroup_taskset *tset) {}
 static inline void scx_group_set_weight(struct task_group *tg, unsigned long cgrp_weight) {}
 static inline void scx_group_set_idle(struct task_group *tg, bool idle) {}
 #endif	/* CONFIG_EXT_GROUP_SCHED */
 #endif	/* CONFIG_CGROUP_SCHED */
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@ -3924,7 +3924,8 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 	}
 }
-void reweight_task(struct task_struct *p, const struct load_weight *lw)
+static void reweight_task_fair(struct rq *rq, struct task_struct *p,
 			       const struct load_weight *lw)
 {
 	struct sched_entity *se = &p->se;
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
@ -8807,7 +8808,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	/*
 	 * BATCH and IDLE tasks do not preempt others.
 	 */
-	if (unlikely(p->policy != SCHED_NORMAL))
+	if (unlikely(!normal_policy(p->policy)))
 		return;
 	cfs_rq = cfs_rq_of(se);
@ -9716,29 +9717,18 @@ static inline void update_blocked_load_status(struct rq *rq, bool has_blocked) {
 static bool __update_blocked_others(struct rq *rq, bool *done)
 {
-	const struct sched_class *curr_class;
+	bool updated;
 	u64 now = rq_clock_pelt(rq);
 	unsigned long hw_pressure;
 	bool decayed;
 	/*
 	 * update_load_avg() can call cpufreq_update_util(). Make sure that RT,
 	 * DL and IRQ signals have been updated before updating CFS.
 	 */
-	curr_class = rq->curr->sched_class;
+	updated = update_other_load_avgs(rq);
 	hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
 	/* hw_pressure doesn't care about invariance */
 	decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) |
 		  update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) |
 		  update_hw_load_avg(rq_clock_task(rq), rq, hw_pressure) |
 		  update_irq_load_avg(rq, 0);
 	if (others_have_blocked(rq))
 		*done = false;
-	return decayed;
+	return updated;
 }
 #ifdef CONFIG_FAIR_GROUP_SCHED
@ -13612,6 +13602,7 @@ DEFINE_SCHED_CLASS(fair) = {
 	.task_tick		= task_tick_fair,
 	.task_fork		= task_fork_fair,
 	.reweight_task		= reweight_task_fair,
 	.prio_changed		= prio_changed_fair,
 	.switched_from		= switched_from_fair,
 	.switched_to		= switched_to_fair,
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@ -453,11 +453,13 @@ static void wakeup_preempt_idle(struct rq *rq, struct task_struct *p, int flags)
 static void put_prev_task_idle(struct rq *rq, struct task_struct *prev, struct task_struct *next)
 {
 	dl_server_update_idle_time(rq, prev);
 	scx_update_idle(rq, false);
 }
 static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
 {
 	update_idle_core(rq);
 	scx_update_idle(rq, true);
 	schedstat_inc(rq->sched_goidle);
 	next->se.exec_start = rq_clock_task(rq);
 }
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@ -467,3 +467,23 @@ int update_irq_load_avg(struct rq *rq, u64 running)
 	return ret;
 }
 #endif
 /*
 * Load avg and utiliztion metrics need to be updated periodically and before
 * consumption. This function updates the metrics for all subsystems except for
 * the fair class. @rq must be locked and have its clock updated.
 */
 bool update_other_load_avgs(struct rq *rq)
 {
 	u64 now = rq_clock_pelt(rq);
 	const struct sched_class *curr_class = rq->curr->sched_class;
 	unsigned long hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
 	lockdep_assert_rq_held(rq);
 	/* hw_pressure doesn't care about invariance */
 	return update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) |
 		update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) |
 		update_hw_load_avg(rq_clock_task(rq), rq, hw_pressure) |
 		update_irq_load_avg(rq, 0);
 }
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@ -6,6 +6,7 @@ int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se
 int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq);
 int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
 int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
 bool update_other_load_avgs(struct rq *rq);
 #ifdef CONFIG_SCHED_HW_PRESSURE
 int update_hw_load_avg(u64 now, struct rq *rq, u64 capacity);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@ -193,9 +193,18 @@ static inline int idle_policy(int policy)
 	return policy == SCHED_IDLE;
 }
 static inline int normal_policy(int policy)
 {
 #ifdef CONFIG_SCHED_CLASS_EXT
 	if (policy == SCHED_EXT)
 		return true;
 #endif
 	return policy == SCHED_NORMAL;
 }
 static inline int fair_policy(int policy)
 {
-	return policy == SCHED_NORMAL || policy == SCHED_BATCH;
+	return normal_policy(policy) || policy == SCHED_BATCH;
 }
 static inline int rt_policy(int policy)
@ -245,6 +254,24 @@ static inline void update_avg(u64 *avg, u64 sample)
 #define shr_bound(val, shift)							\
 	(val >> min_t(typeof(shift), shift, BITS_PER_TYPE(typeof(val)) - 1))
 /*
 * cgroup weight knobs should use the common MIN, DFL and MAX values which are
 * 1, 100 and 10000 respectively. While it loses a bit of range on both ends, it
 * maps pretty well onto the shares value used by scheduler and the round-trip
 * conversions preserve the original value over the entire range.
 */
 static inline unsigned long sched_weight_from_cgroup(unsigned long cgrp_weight)
 {
 	return DIV_ROUND_CLOSEST_ULL(cgrp_weight * 1024, CGROUP_WEIGHT_DFL);
 }
 static inline unsigned long sched_weight_to_cgroup(unsigned long weight)
 {
 	return clamp_t(unsigned long,
 		       DIV_ROUND_CLOSEST_ULL(weight * CGROUP_WEIGHT_DFL, 1024),
 		       CGROUP_WEIGHT_MIN, CGROUP_WEIGHT_MAX);
 }
 /*
 * !! For sched_setattr_nocheck() (kernel) only !!
 *
@ -432,6 +459,11 @@ struct task_group {
 	struct rt_bandwidth	rt_bandwidth;
 #endif
 #ifdef CONFIG_EXT_GROUP_SCHED
 	u32			scx_flags;	/* SCX_TG_* */
 	u32			scx_weight;
 #endif
 	struct rcu_head		rcu;
 	struct list_head	list;
@ -456,7 +488,7 @@ struct task_group {
 };
-#ifdef CONFIG_FAIR_GROUP_SCHED
+#ifdef CONFIG_GROUP_SCHED_WEIGHT
 #define ROOT_TASK_GROUP_LOAD	NICE_0_LOAD
 /*
@ -487,6 +519,11 @@ static inline int walk_tg_tree(tg_visitor down, tg_visitor up, void *data)
 	return walk_tg_tree_from(&root_task_group, down, up, data);
 }
 static inline struct task_group *css_tg(struct cgroup_subsys_state *css)
 {
 	return css ? container_of(css, struct task_group, css) : NULL;
 }
 extern int tg_nop(struct task_group *tg, void *data);
 #ifdef CONFIG_FAIR_GROUP_SCHED
@ -543,6 +580,8 @@ extern void set_task_rq_fair(struct sched_entity *se,
 static inline void set_task_rq_fair(struct sched_entity *se,
 			     struct cfs_rq *prev, struct cfs_rq *next) { }
 #endif /* CONFIG_SMP */
 #else /* !CONFIG_FAIR_GROUP_SCHED */
 static inline int sched_group_set_shares(struct task_group *tg, unsigned long shares) { return 0; }
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 #else /* CONFIG_CGROUP_SCHED */
@ -596,6 +635,11 @@ do {									\
 # define u64_u32_load(var)		u64_u32_load_copy(var, var##_copy)
 # define u64_u32_store(var, val)	u64_u32_store_copy(var, var##_copy, val)
 struct balance_callback {
 	struct balance_callback *next;
 	void (*func)(struct rq *rq);
 };
 /* CFS-related fields in a runqueue */
 struct cfs_rq {
 	struct load_weight	load;
@ -695,6 +739,44 @@ struct cfs_rq {
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 };
 #ifdef CONFIG_SCHED_CLASS_EXT
 /* scx_rq->flags, protected by the rq lock */
 enum scx_rq_flags {
 	/*
 	 * A hotplugged CPU starts scheduling before rq_online_scx(). Track
 	 * ops.cpu_on/offline() state so that ops.enqueue/dispatch() are called
 	 * only while the BPF scheduler considers the CPU to be online.
 	 */
 	SCX_RQ_ONLINE		= 1 << 0,
 	SCX_RQ_CAN_STOP_TICK	= 1 << 1,
 	SCX_RQ_BAL_KEEP		= 1 << 2, /* balance decided to keep current */
 	SCX_RQ_BYPASSING	= 1 << 3,
 	SCX_RQ_IN_WAKEUP	= 1 << 16,
 	SCX_RQ_IN_BALANCE	= 1 << 17,
 };
 struct scx_rq {
 	struct scx_dispatch_q	local_dsq;
 	struct list_head	runnable_list;		/* runnable tasks on this rq */
 	struct list_head	ddsp_deferred_locals;	/* deferred ddsps from enq */
 	unsigned long		ops_qseq;
 	u64			extra_enq_flags;	/* see move_task_to_local_dsq() */
 	u32			nr_running;
 	u32			flags;
 	u32			cpuperf_target;		/* [0, SCHED_CAPACITY_SCALE] */
 	bool			cpu_released;
 	cpumask_var_t		cpus_to_kick;
 	cpumask_var_t		cpus_to_kick_if_idle;
 	cpumask_var_t		cpus_to_preempt;
 	cpumask_var_t		cpus_to_wait;
 	unsigned long		pnt_seq;
 	struct balance_callback	deferred_bal_cb;
 	struct irq_work		deferred_irq_work;
 	struct irq_work		kick_cpus_irq_work;
 };
 #endif /* CONFIG_SCHED_CLASS_EXT */
 static inline int rt_bandwidth_enabled(void)
 {
 	return sysctl_sched_rt_runtime >= 0;
@ -1001,11 +1083,6 @@ struct uclamp_rq {
 DECLARE_STATIC_KEY_FALSE(sched_uclamp_used);
 #endif /* CONFIG_UCLAMP_TASK */
 struct balance_callback {
 	struct balance_callback *next;
 	void (*func)(struct rq *rq);
 };
 /*
 * This is the main, per-CPU runqueue data structure.
 *
@ -1048,6 +1125,9 @@ struct rq {
 	struct cfs_rq		cfs;
 	struct rt_rq		rt;
 	struct dl_rq		dl;
 #ifdef CONFIG_SCHED_CLASS_EXT
 	struct scx_rq		scx;
 #endif
 	struct sched_dl_entity	fair_server;
@ -2302,6 +2382,7 @@ struct sched_class {
 	void (*wakeup_preempt)(struct rq *rq, struct task_struct *p, int flags);
 	int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
 	struct task_struct *(*pick_task)(struct rq *rq);
 	/*
 	 * Optional! When implemented pick_next_task() should be equivalent to:
@ -2318,7 +2399,6 @@ struct sched_class {
 	void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first);
 #ifdef CONFIG_SMP
 	int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
 	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int flags);
 	void (*migrate_task_rq)(struct task_struct *p, int new_cpu);
@ -2342,8 +2422,11 @@ struct sched_class {
 	 * cannot assume the switched_from/switched_to pair is serialized by
 	 * rq->lock. They are however serialized by p->pi_lock.
 	 */
 	void (*switching_to) (struct rq *this_rq, struct task_struct *task);
 	void (*switched_from)(struct rq *this_rq, struct task_struct *task);
 	void (*switched_to)  (struct rq *this_rq, struct task_struct *task);
 	void (*reweight_task)(struct rq *this_rq, struct task_struct *task,
 			      const struct load_weight *lw);
 	void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
 			      int oldprio);
@ -2416,19 +2499,54 @@ const struct sched_class name##_sched_class \
 extern struct sched_class __sched_class_highest[];
 extern struct sched_class __sched_class_lowest[];
 extern const struct sched_class stop_sched_class;
 extern const struct sched_class dl_sched_class;
 extern const struct sched_class rt_sched_class;
 extern const struct sched_class fair_sched_class;
 extern const struct sched_class idle_sched_class;
 #ifdef CONFIG_SCHED_CLASS_EXT
 extern const struct sched_class ext_sched_class;
 DECLARE_STATIC_KEY_FALSE(__scx_ops_enabled);	/* SCX BPF scheduler loaded */
 DECLARE_STATIC_KEY_FALSE(__scx_switched_all);	/* all fair class tasks on SCX */
 #define scx_enabled()		static_branch_unlikely(&__scx_ops_enabled)
 #define scx_switched_all()	static_branch_unlikely(&__scx_switched_all)
 #else /* !CONFIG_SCHED_CLASS_EXT */
 #define scx_enabled()		false
 #define scx_switched_all()	false
 #endif /* !CONFIG_SCHED_CLASS_EXT */
 /*
 * Iterate only active classes. SCX can take over all fair tasks or be
 * completely disabled. If the former, skip fair. If the latter, skip SCX.
 */
 static inline const struct sched_class *next_active_class(const struct sched_class *class)
 {
 	class++;
 #ifdef CONFIG_SCHED_CLASS_EXT
 	if (scx_switched_all() && class == &fair_sched_class)
 		class++;
 	if (!scx_enabled() && class == &ext_sched_class)
 		class++;
 #endif
 	return class;
 }
 #define for_class_range(class, _from, _to) \
 	for (class = (_from); class < (_to); class++)
 #define for_each_class(class) \
 	for_class_range(class, __sched_class_highest, __sched_class_lowest)
-#define sched_class_above(_a, _b)	((_a) < (_b))
+#define for_active_class_range(class, _from, _to)				\
 	for (class = (_from); class != (_to); class = next_active_class(class))
-extern const struct sched_class stop_sched_class;
+#define for_each_active_class(class)						\
-extern const struct sched_class dl_sched_class;
+	for_active_class_range(class, __sched_class_highest, __sched_class_lowest)
-extern const struct sched_class rt_sched_class;
+
-extern const struct sched_class fair_sched_class;
+#define sched_class_above(_a, _b)	((_a) < (_b))
 extern const struct sched_class idle_sched_class;
 static inline bool sched_stop_runnable(struct rq *rq)
 {
@ -2467,6 +2585,19 @@ extern void sched_balance_trigger(struct rq *rq);
 extern int __set_cpus_allowed_ptr(struct task_struct *p, struct affinity_context *ctx);
 extern void set_cpus_allowed_common(struct task_struct *p, struct affinity_context *ctx);
 static inline bool task_allowed_on_cpu(struct task_struct *p, int cpu)
 {
 	/* When not in the task's cpumask, no point in looking further. */
 	if (!cpumask_test_cpu(cpu, p->cpus_ptr))
 		return false;
 	/* Can @cpu run a user thread? */
 	if (!(p->flags & PF_KTHREAD) && !task_cpu_possible(cpu, p))
 		return false;
 	return true;
 }
 static inline cpumask_t *alloc_user_cpus_ptr(int node)
 {
 	/*
@ -2500,6 +2631,11 @@ extern int push_cpu_stop(void *arg);
 #else /* !CONFIG_SMP: */
 static inline bool task_allowed_on_cpu(struct task_struct *p, int cpu)
 {
 	return true;
 }
 static inline int __set_cpus_allowed_ptr(struct task_struct *p,
 					 struct affinity_context *ctx)
 {
@ -2553,8 +2689,6 @@ extern void init_sched_dl_class(void);
 extern void init_sched_rt_class(void);
 extern void init_sched_fair_class(void);
 extern void reweight_task(struct task_struct *p, const struct load_weight *lw);
 extern void resched_curr(struct rq *rq);
 extern void resched_cpu(int cpu);
@ -3154,6 +3288,8 @@ static inline unsigned long cpu_util_rt(struct rq *rq)
 	return READ_ONCE(rq->avg_rt.util_avg);
 }
 #else /* !CONFIG_SMP */
 static inline bool update_other_load_avgs(struct rq *rq) { return false; }
 #endif /* CONFIG_SMP */
 #ifdef CONFIG_UCLAMP_TASK
@ -3664,6 +3800,8 @@ extern void set_load_weight(struct task_struct *p, bool update_load);
 extern void enqueue_task(struct rq *rq, struct task_struct *p, int flags);
 extern bool dequeue_task(struct rq *rq, struct task_struct *p, int flags);
 extern void check_class_changing(struct rq *rq, struct task_struct *p,
 				 const struct sched_class *prev_class);
 extern void check_class_changed(struct rq *rq, struct task_struct *p,
 				const struct sched_class *prev_class,
 				int oldprio);
@ -3684,4 +3822,24 @@ static inline void balance_callbacks(struct rq *rq, struct balance_callback *hea
 #endif
 #ifdef CONFIG_SCHED_CLASS_EXT
 /*
 * Used by SCX in the enable/disable paths to move tasks between sched_classes
 * and establish invariants.
 */
 struct sched_enq_and_set_ctx {
 	struct task_struct	*p;
 	int			queue_flags;
 	bool			queued;
 	bool			running;
 };
 void sched_deq_and_put_task(struct task_struct *p, int queue_flags,
 			    struct sched_enq_and_set_ctx *ctx);
 void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx);
 #endif /* CONFIG_SCHED_CLASS_EXT */
 #include "ext.h"
 #endif /* _KERNEL_SCHED_SCHED_H */
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@ -612,6 +612,10 @@ int __sched_setscheduler(struct task_struct *p,
 		goto unlock;
 	}
 	retval = scx_check_setscheduler(p, policy);
 	if (retval)
 		goto unlock;
 	/*
 	 * If not changing anything there's no need to proceed further,
 	 * but store a possible modification of reset_on_fork.
@ -716,6 +720,7 @@ int __sched_setscheduler(struct task_struct *p,
 		__setscheduler_prio(p, newprio);
 	}
 	__setscheduler_uclamp(p, attr);
 	check_class_changing(rq, p, prev_class);
 	if (queued) {
 		/*
@ -1526,6 +1531,7 @@ SYSCALL_DEFINE1(sched_get_priority_max, int, policy)
 	case SCHED_NORMAL:
 	case SCHED_BATCH:
 	case SCHED_IDLE:
 	case SCHED_EXT:
 		ret = 0;
 		break;
 	}
@ -1553,6 +1559,7 @@ SYSCALL_DEFINE1(sched_get_priority_min, int, policy)
 	case SCHED_NORMAL:
 	case SCHED_BATCH:
 	case SCHED_IDLE:
 	case SCHED_EXT:
 		ret = 0;
 	}
 	return ret;
--- a/lib/dump_stack.c
+++ b/lib/dump_stack.c
@ -73,6 +73,7 @@ void dump_stack_print_info(const char *log_lvl)
 	print_worker_info(log_lvl, current);
 	print_stop_info(log_lvl, current);
 	print_scx_info(log_lvl, current);
 }
 /**
--- a/tools/Makefile
+++ b/tools/Makefile
@ -28,6 +28,7 @@ help:
 	@echo '  pci                    - PCI tools'
 	@echo '  perf                   - Linux performance measurement and analysis tool'
 	@echo '  selftests              - various kernel selftests'
 	@echo '  sched_ext              - sched_ext example schedulers'
 	@echo '  bootconfig             - boot config tool'
 	@echo '  spi                    - spi tools'
 	@echo '  tmon                   - thermal monitoring and tuning tool'
@ -91,6 +92,9 @@ perf: FORCE
 	$(Q)mkdir -p $(PERF_O) .
 	$(Q)$(MAKE) --no-print-directory -C perf O=$(PERF_O) subdir=
 sched_ext: FORCE
 	$(call descend,sched_ext)
 selftests: FORCE
 	$(call descend,testing/$@)
@ -184,6 +188,9 @@ perf_clean:
 	$(Q)mkdir -p $(PERF_O) .
 	$(Q)$(MAKE) --no-print-directory -C perf O=$(PERF_O) subdir= clean
 sched_ext_clean:
 	$(call descend,sched_ext,clean)
 selftests_clean:
 	$(call descend,testing/$(@:_clean=),clean)
@ -213,6 +220,7 @@ clean: acpi_clean counter_clean cpupower_clean hv_clean firewire_clean \
 		mm_clean bpf_clean iio_clean x86_energy_perf_policy_clean tmon_clean \
 		freefall_clean build_clean libbpf_clean libsubcmd_clean \
 		gpio_clean objtool_clean leds_clean wmi_clean pci_clean firmware_clean debugging_clean \
-		intel-speed-select_clean tracing_clean thermal_clean thermometer_clean thermal-engine_clean
+		intel-speed-select_clean tracing_clean thermal_clean thermometer_clean thermal-engine_clean \
 		sched_ext_clean
 .PHONY: FORCE
--- a/tools/sched_ext/.gitignore
+++ b/tools/sched_ext/.gitignore
@ -0,0 +1,2 @@
 tools/
 build/
--- a/tools/sched_ext/Makefile
+++ b/tools/sched_ext/Makefile
@ -0,0 +1,246 @@
 # SPDX-License-Identifier: GPL-2.0
 # Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
 include ../build/Build.include
 include ../scripts/Makefile.arch
 include ../scripts/Makefile.include
 all: all_targets
 ifneq ($(LLVM),)
 ifneq ($(filter %/,$(LLVM)),)
 LLVM_PREFIX := $(LLVM)
 else ifneq ($(filter -%,$(LLVM)),)
 LLVM_SUFFIX := $(LLVM)
 endif
 CLANG_TARGET_FLAGS_arm          := arm-linux-gnueabi
 CLANG_TARGET_FLAGS_arm64        := aarch64-linux-gnu
 CLANG_TARGET_FLAGS_hexagon      := hexagon-linux-musl
 CLANG_TARGET_FLAGS_m68k         := m68k-linux-gnu
 CLANG_TARGET_FLAGS_mips         := mipsel-linux-gnu
 CLANG_TARGET_FLAGS_powerpc      := powerpc64le-linux-gnu
 CLANG_TARGET_FLAGS_riscv        := riscv64-linux-gnu
 CLANG_TARGET_FLAGS_s390         := s390x-linux-gnu
 CLANG_TARGET_FLAGS_x86          := x86_64-linux-gnu
 CLANG_TARGET_FLAGS              := $(CLANG_TARGET_FLAGS_$(ARCH))
 ifeq ($(CROSS_COMPILE),)
 ifeq ($(CLANG_TARGET_FLAGS),)
 $(error Specify CROSS_COMPILE or add '--target=' option to lib.mk)
 else
 CLANG_FLAGS     += --target=$(CLANG_TARGET_FLAGS)
 endif # CLANG_TARGET_FLAGS
 else
 CLANG_FLAGS     += --target=$(notdir $(CROSS_COMPILE:%-=%))
 endif # CROSS_COMPILE
 CC := $(LLVM_PREFIX)clang$(LLVM_SUFFIX) $(CLANG_FLAGS) -fintegrated-as
 else
 CC := $(CROSS_COMPILE)gcc
 endif # LLVM
 CURDIR := $(abspath .)
 TOOLSDIR := $(abspath ..)
 LIBDIR := $(TOOLSDIR)/lib
 BPFDIR := $(LIBDIR)/bpf
 TOOLSINCDIR := $(TOOLSDIR)/include
 BPFTOOLDIR := $(TOOLSDIR)/bpf/bpftool
 APIDIR := $(TOOLSINCDIR)/uapi
 GENDIR := $(abspath ../../include/generated)
 GENHDR := $(GENDIR)/autoconf.h
 ifeq ($(O),)
 OUTPUT_DIR := $(CURDIR)/build
 else
 OUTPUT_DIR := $(O)/build
 endif # O
 OBJ_DIR := $(OUTPUT_DIR)/obj
 INCLUDE_DIR := $(OUTPUT_DIR)/include
 BPFOBJ_DIR := $(OBJ_DIR)/libbpf
 SCXOBJ_DIR := $(OBJ_DIR)/sched_ext
 BINDIR := $(OUTPUT_DIR)/bin
 BPFOBJ := $(BPFOBJ_DIR)/libbpf.a
 ifneq ($(CROSS_COMPILE),)
 HOST_BUILD_DIR		:= $(OBJ_DIR)/host
 HOST_OUTPUT_DIR	:= host-tools
 HOST_INCLUDE_DIR	:= $(HOST_OUTPUT_DIR)/include
 else
 HOST_BUILD_DIR		:= $(OBJ_DIR)
 HOST_OUTPUT_DIR	:= $(OUTPUT_DIR)
 HOST_INCLUDE_DIR	:= $(INCLUDE_DIR)
 endif
 HOST_BPFOBJ := $(HOST_BUILD_DIR)/libbpf/libbpf.a
 RESOLVE_BTFIDS := $(HOST_BUILD_DIR)/resolve_btfids/resolve_btfids
 DEFAULT_BPFTOOL := $(HOST_OUTPUT_DIR)/sbin/bpftool
 VMLINUX_BTF_PATHS ?= $(if $(O),$(O)/vmlinux)					\
 		     $(if $(KBUILD_OUTPUT),$(KBUILD_OUTPUT)/vmlinux)		\
 		     ../../vmlinux						\
 		     /sys/kernel/btf/vmlinux					\
 		     /boot/vmlinux-$(shell uname -r)
 VMLINUX_BTF ?= $(abspath $(firstword $(wildcard $(VMLINUX_BTF_PATHS))))
 ifeq ($(VMLINUX_BTF),)
 $(error Cannot find a vmlinux for VMLINUX_BTF at any of "$(VMLINUX_BTF_PATHS)")
 endif
 BPFTOOL ?= $(DEFAULT_BPFTOOL)
 ifneq ($(wildcard $(GENHDR)),)
  GENFLAGS := -DHAVE_GENHDR
 endif
 CFLAGS += -g -O2 -rdynamic -pthread -Wall -Werror $(GENFLAGS)			\
 	  -I$(INCLUDE_DIR) -I$(GENDIR) -I$(LIBDIR)				\
 	  -I$(TOOLSINCDIR) -I$(APIDIR) -I$(CURDIR)/include
 # Silence some warnings when compiled with clang
 ifneq ($(LLVM),)
 CFLAGS += -Wno-unused-command-line-argument
 endif
 LDFLAGS = -lelf -lz -lpthread
 IS_LITTLE_ENDIAN = $(shell $(CC) -dM -E - </dev/null |				\
 			grep 'define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__')
 # Get Clang's default includes on this system, as opposed to those seen by
 # '-target bpf'. This fixes "missing" files on some architectures/distros,
 # such as asm/byteorder.h, asm/socket.h, asm/sockios.h, sys/cdefs.h etc.
 #
 # Use '-idirafter': Don't interfere with include mechanics except where the
 # build would have failed anyways.
 define get_sys_includes
 $(shell $(1) -v -E - </dev/null 2>&1 \
 	| sed -n '/<...> search starts here:/,/End of search list./{ s| \(/.*\)|-idirafter \1|p }') \
 $(shell $(1) -dM -E - </dev/null | grep '__riscv_xlen ' | awk '{printf("-D__riscv_xlen=%d -D__BITS_PER_LONG=%d", $$3, $$3)}')
 endef
 BPF_CFLAGS = -g -D__TARGET_ARCH_$(SRCARCH)					\
 	     $(if $(IS_LITTLE_ENDIAN),-mlittle-endian,-mbig-endian)		\
 	     -I$(CURDIR)/include -I$(CURDIR)/include/bpf-compat			\
 	     -I$(INCLUDE_DIR) -I$(APIDIR)					\
 	     -I../../include							\
 	     $(call get_sys_includes,$(CLANG))					\
 	     -Wall -Wno-compare-distinct-pointer-types				\
 	     -O2 -mcpu=v3
 # sort removes libbpf duplicates when not cross-building
 MAKE_DIRS := $(sort $(OBJ_DIR)/libbpf $(HOST_BUILD_DIR)/libbpf			\
 	       $(HOST_BUILD_DIR)/bpftool $(HOST_BUILD_DIR)/resolve_btfids	\
 	       $(INCLUDE_DIR) $(SCXOBJ_DIR) $(BINDIR))
 $(MAKE_DIRS):
 	$(call msg,MKDIR,,$@)
 	$(Q)mkdir -p $@
 $(BPFOBJ): $(wildcard $(BPFDIR)/*.[ch] $(BPFDIR)/Makefile)			\
 	   $(APIDIR)/linux/bpf.h						\
 	   | $(OBJ_DIR)/libbpf
 	$(Q)$(MAKE) $(submake_extras) -C $(BPFDIR) OUTPUT=$(OBJ_DIR)/libbpf/	\
 		    EXTRA_CFLAGS='-g -O0 -fPIC'					\
 		    DESTDIR=$(OUTPUT_DIR) prefix= all install_headers
 $(DEFAULT_BPFTOOL): $(wildcard $(BPFTOOLDIR)/*.[ch] $(BPFTOOLDIR)/Makefile)	\
 		    $(HOST_BPFOBJ) | $(HOST_BUILD_DIR)/bpftool
 	$(Q)$(MAKE) $(submake_extras)  -C $(BPFTOOLDIR)				\
 		    ARCH= CROSS_COMPILE= CC=$(HOSTCC) LD=$(HOSTLD)		\
 		    EXTRA_CFLAGS='-g -O0'					\
 		    OUTPUT=$(HOST_BUILD_DIR)/bpftool/				\
 		    LIBBPF_OUTPUT=$(HOST_BUILD_DIR)/libbpf/			\
 		    LIBBPF_DESTDIR=$(HOST_OUTPUT_DIR)/				\
 		    prefix= DESTDIR=$(HOST_OUTPUT_DIR)/ install-bin
 $(INCLUDE_DIR)/vmlinux.h: $(VMLINUX_BTF) $(BPFTOOL) | $(INCLUDE_DIR)
 ifeq ($(VMLINUX_H),)
 	$(call msg,GEN,,$@)
 	$(Q)$(BPFTOOL) btf dump file $(VMLINUX_BTF) format c > $@
 else
 	$(call msg,CP,,$@)
 	$(Q)cp "$(VMLINUX_H)" $@
 endif
 $(SCXOBJ_DIR)/%.bpf.o: %.bpf.c $(INCLUDE_DIR)/vmlinux.h include/scx/*.h		\
 		       | $(BPFOBJ) $(SCXOBJ_DIR)
 	$(call msg,CLNG-BPF,,$(notdir $@))
 	$(Q)$(CLANG) $(BPF_CFLAGS) -target bpf -c $< -o $@
 $(INCLUDE_DIR)/%.bpf.skel.h: $(SCXOBJ_DIR)/%.bpf.o $(INCLUDE_DIR)/vmlinux.h $(BPFTOOL)
 	$(eval sched=$(notdir $@))
 	$(call msg,GEN-SKEL,,$(sched))
 	$(Q)$(BPFTOOL) gen object $(<:.o=.linked1.o) $<
 	$(Q)$(BPFTOOL) gen object $(<:.o=.linked2.o) $(<:.o=.linked1.o)
 	$(Q)$(BPFTOOL) gen object $(<:.o=.linked3.o) $(<:.o=.linked2.o)
 	$(Q)diff $(<:.o=.linked2.o) $(<:.o=.linked3.o)
 	$(Q)$(BPFTOOL) gen skeleton $(<:.o=.linked3.o) name $(subst .bpf.skel.h,,$(sched)) > $@
 	$(Q)$(BPFTOOL) gen subskeleton $(<:.o=.linked3.o) name $(subst .bpf.skel.h,,$(sched)) > $(@:.skel.h=.subskel.h)
 SCX_COMMON_DEPS := include/scx/common.h include/scx/user_exit_info.h | $(BINDIR)
 c-sched-targets = scx_simple scx_qmap scx_central scx_flatcg
 $(addprefix $(BINDIR)/,$(c-sched-targets)): \
 	$(BINDIR)/%: \
 		$(filter-out %.bpf.c,%.c) \
 		$(INCLUDE_DIR)/%.bpf.skel.h \
 		$(SCX_COMMON_DEPS)
 	$(eval sched=$(notdir $@))
 	$(CC) $(CFLAGS) -c $(sched).c -o $(SCXOBJ_DIR)/$(sched).o
 	$(CC) -o $@ $(SCXOBJ_DIR)/$(sched).o $(HOST_BPFOBJ) $(LDFLAGS)
 $(c-sched-targets): %: $(BINDIR)/%
 install: all
 	$(Q)mkdir -p $(DESTDIR)/usr/local/bin/
 	$(Q)cp $(BINDIR)/* $(DESTDIR)/usr/local/bin/
 clean:
 	rm -rf $(OUTPUT_DIR) $(HOST_OUTPUT_DIR)
 	rm -f *.o *.bpf.o *.bpf.skel.h *.bpf.subskel.h
 	rm -f $(c-sched-targets)
 help:
 	@echo   'Building targets'
 	@echo   '================'
 	@echo   ''
 	@echo   '  all		  - Compile all schedulers'
 	@echo   ''
 	@echo   'Alternatively, you may compile individual schedulers:'
 	@echo   ''
 	@printf '  %s\n' $(c-sched-targets)
 	@echo   ''
 	@echo   'For any scheduler build target, you may specify an alternative'
 	@echo   'build output path with the O= environment variable. For example:'
 	@echo   ''
 	@echo   '   O=/tmp/sched_ext make all'
 	@echo   ''
 	@echo   'will compile all schedulers, and emit the build artifacts to'
 	@echo   '/tmp/sched_ext/build.'
 	@echo   ''
 	@echo   ''
 	@echo   'Installing targets'
 	@echo   '=================='
 	@echo   ''
 	@echo   '  install	  - Compile and install all schedulers to /usr/bin.'
 	@echo   '		    You may specify the DESTDIR= environment variable'
 	@echo   '		    to indicate a prefix for /usr/bin. For example:'
 	@echo   ''
 	@echo   '                     DESTDIR=/tmp/sched_ext make install'
 	@echo   ''
 	@echo   '		    will build the schedulers in CWD/build, and'
 	@echo   '		    install the schedulers to /tmp/sched_ext/usr/bin.'
 	@echo   ''
 	@echo   ''
 	@echo   'Cleaning targets'
 	@echo   '================'
 	@echo   ''
 	@echo   '  clean		  - Remove all generated files'
 all_targets: $(c-sched-targets)
 .PHONY: all all_targets $(c-sched-targets) clean help
 # delete failed targets
 .DELETE_ON_ERROR:
 # keep intermediate (.bpf.skel.h, .bpf.o, etc) targets
 .SECONDARY:
--- a/tools/sched_ext/README.md
+++ b/tools/sched_ext/README.md
@ -0,0 +1,270 @@
 SCHED_EXT EXAMPLE SCHEDULERS
 ============================
 # Introduction
 This directory contains a number of example sched_ext schedulers. These
 schedulers are meant to provide examples of different types of schedulers
 that can be built using sched_ext, and illustrate how various features of
 sched_ext can be used.
 Some of the examples are performant, production-ready schedulers. That is, for
 the correct workload and with the correct tuning, they may be deployed in a
 production environment with acceptable or possibly even improved performance.
 Others are just examples that in practice, would not provide acceptable
 performance (though they could be improved to get there).
 This README will describe these example schedulers, including describing the
 types of workloads or scenarios they're designed to accommodate, and whether or
 not they're production ready. For more details on any of these schedulers,
 please see the header comment in their .bpf.c file.
 # Compiling the examples
 There are a few toolchain dependencies for compiling the example schedulers.
 ## Toolchain dependencies
 1. clang >= 16.0.0
 The schedulers are BPF programs, and therefore must be compiled with clang. gcc
 is actively working on adding a BPF backend compiler as well, but are still
 missing some features such as BTF type tags which are necessary for using
 kptrs.
 2. pahole >= 1.25
 You may need pahole in order to generate BTF from DWARF.
 3. rust >= 1.70.0
 Rust schedulers uses features present in the rust toolchain >= 1.70.0. You
 should be able to use the stable build from rustup, but if that doesn't
 work, try using the rustup nightly build.
 There are other requirements as well, such as make, but these are the main /
 non-trivial ones.
 ## Compiling the kernel
 In order to run a sched_ext scheduler, you'll have to run a kernel compiled
 with the patches in this repository, and with a minimum set of necessary
 Kconfig options:
 ```
 CONFIG_BPF=y
 CONFIG_SCHED_CLASS_EXT=y
 CONFIG_BPF_SYSCALL=y
 CONFIG_BPF_JIT=y
 CONFIG_DEBUG_INFO_BTF=y
 ```
 It's also recommended that you also include the following Kconfig options:
 ```
 CONFIG_BPF_JIT_ALWAYS_ON=y
 CONFIG_BPF_JIT_DEFAULT_ON=y
 CONFIG_PAHOLE_HAS_SPLIT_BTF=y
 CONFIG_PAHOLE_HAS_BTF_TAG=y
 ```
 There is a `Kconfig` file in this directory whose contents you can append to
 your local `.config` file, as long as there are no conflicts with any existing
 options in the file.
 ## Getting a vmlinux.h file
 You may notice that most of the example schedulers include a "vmlinux.h" file.
 This is a large, auto-generated header file that contains all of the types
 defined in some vmlinux binary that was compiled with
 [BTF](https://docs.kernel.org/bpf/btf.html) (i.e. with the BTF-related Kconfig
 options specified above).
 The header file is created using `bpftool`, by passing it a vmlinux binary
 compiled with BTF as follows:
 ```bash
 $ bpftool btf dump file /path/to/vmlinux format c > vmlinux.h
 ```
 `bpftool` analyzes all of the BTF encodings in the binary, and produces a
 header file that can be included by BPF programs to access those types.  For
 example, using vmlinux.h allows a scheduler to access fields defined directly
 in vmlinux as follows:
 ```c
 #include "vmlinux.h"
 // vmlinux.h is also implicitly included by scx_common.bpf.h.
 #include "scx_common.bpf.h"
 /*
 * vmlinux.h provides definitions for struct task_struct and
 * struct scx_enable_args.
 */
 void BPF_STRUCT_OPS(example_enable, struct task_struct *p,
 		    struct scx_enable_args *args)
 {
 	bpf_printk("Task %s enabled in example scheduler", p->comm);
 }
 // vmlinux.h provides the definition for struct sched_ext_ops.
 SEC(".struct_ops.link")
 struct sched_ext_ops example_ops {
 	.enable	= (void *)example_enable,
 	.name	= "example",
 }
 ```
 The scheduler build system will generate this vmlinux.h file as part of the
 scheduler build pipeline. It looks for a vmlinux file in the following
 dependency order:
 1. If the O= environment variable is defined, at `$O/vmlinux`
 2. If the KBUILD_OUTPUT= environment variable is defined, at
   `$KBUILD_OUTPUT/vmlinux`
 3. At `../../vmlinux` (i.e. at the root of the kernel tree where you're
   compiling the schedulers)
 3. `/sys/kernel/btf/vmlinux`
 4. `/boot/vmlinux-$(uname -r)`
 In other words, if you have compiled a kernel in your local repo, its vmlinux
 file will be used to generate vmlinux.h. Otherwise, it will be the vmlinux of
 the kernel you're currently running on. This means that if you're running on a
 kernel with sched_ext support, you may not need to compile a local kernel at
 all.
 ### Aside on CO-RE
 One of the cooler features of BPF is that it supports
 [CO-RE](https://nakryiko.com/posts/bpf-core-reference-guide/) (Compile Once Run
 Everywhere). This feature allows you to reference fields inside of structs with
 types defined internal to the kernel, and not have to recompile if you load the
 BPF program on a different kernel with the field at a different offset. In our
 example above, we print out a task name with `p->comm`. CO-RE would perform
 relocations for that access when the program is loaded to ensure that it's
 referencing the correct offset for the currently running kernel.
 ## Compiling the schedulers
 Once you have your toolchain setup, and a vmlinux that can be used to generate
 a full vmlinux.h file, you can compile the schedulers using `make`:
 ```bash
 $ make -j($nproc)
 ```
 # Example schedulers
 This directory contains the following example schedulers. These schedulers are
 for testing and demonstrating different aspects of sched_ext. While some may be
 useful in limited scenarios, they are not intended to be practical.
 For more scheduler implementations, tools and documentation, visit
 https://github.com/sched-ext/scx.
 ## scx_simple
 A simple scheduler that provides an example of a minimal sched_ext scheduler.
 scx_simple can be run in either global weighted vtime mode, or FIFO mode.
 Though very simple, in limited scenarios, this scheduler can perform reasonably
 well on single-socket systems with a unified L3 cache.
 ## scx_qmap
 Another simple, yet slightly more complex scheduler that provides an example of
 a basic weighted FIFO queuing policy. It also provides examples of some common
 useful BPF features, such as sleepable per-task storage allocation in the
 `ops.prep_enable()` callback, and using the `BPF_MAP_TYPE_QUEUE` map type to
 enqueue tasks. It also illustrates how core-sched support could be implemented.
 ## scx_central
 A "central" scheduler where scheduling decisions are made from a single CPU.
 This scheduler illustrates how scheduling decisions can be dispatched from a
 single CPU, allowing other cores to run with infinite slices, without timer
 ticks, and without having to incur the overhead of making scheduling decisions.
 The approach demonstrated by this scheduler may be useful for any workload that
 benefits from minimizing scheduling overhead and timer ticks. An example of
 where this could be particularly useful is running VMs, where running with
 infinite slices and no timer ticks allows the VM to avoid unnecessary expensive
 vmexits.
 ## scx_flatcg
 A flattened cgroup hierarchy scheduler. This scheduler implements hierarchical
 weight-based cgroup CPU control by flattening the cgroup hierarchy into a single
 layer, by compounding the active weight share at each level. The effect of this
 is a much more performant CPU controller, which does not need to descend down
 cgroup trees in order to properly compute a cgroup's share.
 Similar to scx_simple, in limited scenarios, this scheduler can perform
 reasonably well on single socket-socket systems with a unified L3 cache and show
 significantly lowered hierarchical scheduling overhead.
 # Troubleshooting
 There are a number of common issues that you may run into when building the
 schedulers. We'll go over some of the common ones here.
 ## Build Failures
 ### Old version of clang
 ```
 error: static assertion failed due to requirement 'SCX_DSQ_FLAG_BUILTIN': bpftool generated vmlinux.h is missing high bits for 64bit enums, upgrade clang and pahole
        _Static_assert(SCX_DSQ_FLAG_BUILTIN,
                       ^~~~~~~~~~~~~~~~~~~~
 1 error generated.
 ```
 This means you built the kernel or the schedulers with an older version of
 clang than what's supported (i.e. older than 16.0.0). To remediate this:
 1. `which clang` to make sure you're using a sufficiently new version of clang.
 2. `make fullclean` in the root path of the repository, and rebuild the kernel
   and schedulers.
 3. Rebuild the kernel, and then your example schedulers.
 The schedulers are also cleaned if you invoke `make mrproper` in the root
 directory of the tree.
 ### Stale kernel build / incomplete vmlinux.h file
 As described above, you'll need a `vmlinux.h` file that was generated from a
 vmlinux built with BTF, and with sched_ext support enabled. If you don't,
 you'll see errors such as the following which indicate that a type being
 referenced in a scheduler is unknown:
 ```
 /path/to/sched_ext/tools/sched_ext/user_exit_info.h:25:23: note: forward declaration of 'struct scx_exit_info'
 const struct scx_exit_info *ei)
 ^
 ```
 In order to resolve this, please follow the steps above in
 [Getting a vmlinux.h file](#getting-a-vmlinuxh-file) in order to ensure your
 schedulers are using a vmlinux.h file that includes the requisite types.
 ## Misc
 ### llvm: [OFF]
 You may see the following output when building the schedulers:
 ```
 Auto-detecting system features:
 ...                         clang-bpf-co-re: [ on  ]
 ...                                    llvm: [ OFF ]
 ...                                  libcap: [ on  ]
 ...                                  libbfd: [ on  ]
 ```
 Seeing `llvm: [ OFF ]` here is not an issue. You can safely ignore.
--- a/tools/sched_ext/include/bpf-compat/gnu/stubs.h
+++ b/tools/sched_ext/include/bpf-compat/gnu/stubs.h
@ -0,0 +1,11 @@
 /*
 * Dummy gnu/stubs.h. clang can end up including /usr/include/gnu/stubs.h when
 * compiling BPF files although its content doesn't play any role. The file in
 * turn includes stubs-64.h or stubs-32.h depending on whether __x86_64__ is
 * defined. When compiling a BPF source, __x86_64__ isn't set and thus
 * stubs-32.h is selected. However, the file is not there if the system doesn't
 * have 32bit glibc devel package installed leading to a build failure.
 *
 * The problem is worked around by making this file available in the include
 * search paths before the system one when building BPF.
 */
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@ -0,0 +1,412 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2022 Tejun Heo <tj@kernel.org>
 * Copyright (c) 2022 David Vernet <dvernet@meta.com>
 */
 #ifndef __SCX_COMMON_BPF_H
 #define __SCX_COMMON_BPF_H
 #include "vmlinux.h"
 #include <bpf/bpf_helpers.h>
 #include <bpf/bpf_tracing.h>
 #include <asm-generic/errno.h>
 #include "user_exit_info.h"
 #define PF_WQ_WORKER			0x00000020	/* I'm a workqueue worker */
 #define PF_KTHREAD			0x00200000	/* I am a kernel thread */
 #define PF_EXITING			0x00000004
 #define CLOCK_MONOTONIC			1
 /*
 * Earlier versions of clang/pahole lost upper 32bits in 64bit enums which can
 * lead to really confusing misbehaviors. Let's trigger a build failure.
 */
 static inline void ___vmlinux_h_sanity_check___(void)
 {
 	_Static_assert(SCX_DSQ_FLAG_BUILTIN,
 		       "bpftool generated vmlinux.h is missing high bits for 64bit enums, upgrade clang and pahole");
 }
 s32 scx_bpf_create_dsq(u64 dsq_id, s32 node) __ksym;
 s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, bool *is_idle) __ksym;
 void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice, u64 enq_flags) __ksym;
 void scx_bpf_dispatch_vtime(struct task_struct *p, u64 dsq_id, u64 slice, u64 vtime, u64 enq_flags) __ksym;
 u32 scx_bpf_dispatch_nr_slots(void) __ksym;
 void scx_bpf_dispatch_cancel(void) __ksym;
 bool scx_bpf_consume(u64 dsq_id) __ksym;
 void scx_bpf_dispatch_from_dsq_set_slice(struct bpf_iter_scx_dsq *it__iter, u64 slice) __ksym;
 void scx_bpf_dispatch_from_dsq_set_vtime(struct bpf_iter_scx_dsq *it__iter, u64 vtime) __ksym;
 bool scx_bpf_dispatch_from_dsq(struct bpf_iter_scx_dsq *it__iter, struct task_struct *p, u64 dsq_id, u64 enq_flags) __ksym __weak;
 bool scx_bpf_dispatch_vtime_from_dsq(struct bpf_iter_scx_dsq *it__iter, struct task_struct *p, u64 dsq_id, u64 enq_flags) __ksym __weak;
 u32 scx_bpf_reenqueue_local(void) __ksym;
 void scx_bpf_kick_cpu(s32 cpu, u64 flags) __ksym;
 s32 scx_bpf_dsq_nr_queued(u64 dsq_id) __ksym;
 void scx_bpf_destroy_dsq(u64 dsq_id) __ksym;
 int bpf_iter_scx_dsq_new(struct bpf_iter_scx_dsq *it, u64 dsq_id, u64 flags) __ksym __weak;
 struct task_struct *bpf_iter_scx_dsq_next(struct bpf_iter_scx_dsq *it) __ksym __weak;
 void bpf_iter_scx_dsq_destroy(struct bpf_iter_scx_dsq *it) __ksym __weak;
 void scx_bpf_exit_bstr(s64 exit_code, char *fmt, unsigned long long *data, u32 data__sz) __ksym __weak;
 void scx_bpf_error_bstr(char *fmt, unsigned long long *data, u32 data_len) __ksym;
 void scx_bpf_dump_bstr(char *fmt, unsigned long long *data, u32 data_len) __ksym __weak;
 u32 scx_bpf_cpuperf_cap(s32 cpu) __ksym __weak;
 u32 scx_bpf_cpuperf_cur(s32 cpu) __ksym __weak;
 void scx_bpf_cpuperf_set(s32 cpu, u32 perf) __ksym __weak;
 u32 scx_bpf_nr_cpu_ids(void) __ksym __weak;
 const struct cpumask *scx_bpf_get_possible_cpumask(void) __ksym __weak;
 const struct cpumask *scx_bpf_get_online_cpumask(void) __ksym __weak;
 void scx_bpf_put_cpumask(const struct cpumask *cpumask) __ksym __weak;
 const struct cpumask *scx_bpf_get_idle_cpumask(void) __ksym;
 const struct cpumask *scx_bpf_get_idle_smtmask(void) __ksym;
 void scx_bpf_put_idle_cpumask(const struct cpumask *cpumask) __ksym;
 bool scx_bpf_test_and_clear_cpu_idle(s32 cpu) __ksym;
 s32 scx_bpf_pick_idle_cpu(const cpumask_t *cpus_allowed, u64 flags) __ksym;
 s32 scx_bpf_pick_any_cpu(const cpumask_t *cpus_allowed, u64 flags) __ksym;
 bool scx_bpf_task_running(const struct task_struct *p) __ksym;
 s32 scx_bpf_task_cpu(const struct task_struct *p) __ksym;
 struct rq *scx_bpf_cpu_rq(s32 cpu) __ksym;
 struct cgroup *scx_bpf_task_cgroup(struct task_struct *p) __ksym;
 /*
 * Use the following as @it__iter when calling
 * scx_bpf_dispatch[_vtime]_from_dsq() from within bpf_for_each() loops.
 */
 #define BPF_FOR_EACH_ITER	(&___it)
 static inline __attribute__((format(printf, 1, 2)))
 void ___scx_bpf_bstr_format_checker(const char *fmt, ...) {}
 /*
 * Helper macro for initializing the fmt and variadic argument inputs to both
 * bstr exit kfuncs. Callers to this function should use ___fmt and ___param to
 * refer to the initialized list of inputs to the bstr kfunc.
 */
 #define scx_bpf_bstr_preamble(fmt, args...)					\
 	static char ___fmt[] = fmt;						\
 	/*									\
 	 * Note that __param[] must have at least one				\
 	 * element to keep the verifier happy.					\
 	 */									\
 	unsigned long long ___param[___bpf_narg(args) ?: 1] = {};		\
 										\
 	_Pragma("GCC diagnostic push")						\
 	_Pragma("GCC diagnostic ignored \"-Wint-conversion\"")			\
 	___bpf_fill(___param, args);						\
 	_Pragma("GCC diagnostic pop")						\
 /*
 * scx_bpf_exit() wraps the scx_bpf_exit_bstr() kfunc with variadic arguments
 * instead of an array of u64. Using this macro will cause the scheduler to
 * exit cleanly with the specified exit code being passed to user space.
 */
 #define scx_bpf_exit(code, fmt, args...)					\
 ({										\
 	scx_bpf_bstr_preamble(fmt, args)					\
 	scx_bpf_exit_bstr(code, ___fmt, ___param, sizeof(___param));		\
 	___scx_bpf_bstr_format_checker(fmt, ##args);				\
 })
 /*
 * scx_bpf_error() wraps the scx_bpf_error_bstr() kfunc with variadic arguments
 * instead of an array of u64. Invoking this macro will cause the scheduler to
 * exit in an erroneous state, with diagnostic information being passed to the
 * user.
 */
 #define scx_bpf_error(fmt, args...)						\
 ({										\
 	scx_bpf_bstr_preamble(fmt, args)					\
 	scx_bpf_error_bstr(___fmt, ___param, sizeof(___param));			\
 	___scx_bpf_bstr_format_checker(fmt, ##args);				\
 })
 /*
 * scx_bpf_dump() wraps the scx_bpf_dump_bstr() kfunc with variadic arguments
 * instead of an array of u64. To be used from ops.dump() and friends.
 */
 #define scx_bpf_dump(fmt, args...)						\
 ({										\
 	scx_bpf_bstr_preamble(fmt, args)					\
 	scx_bpf_dump_bstr(___fmt, ___param, sizeof(___param));			\
 	___scx_bpf_bstr_format_checker(fmt, ##args);				\
 })
 #define BPF_STRUCT_OPS(name, args...)						\
 SEC("struct_ops/"#name)								\
 BPF_PROG(name, ##args)
 #define BPF_STRUCT_OPS_SLEEPABLE(name, args...)					\
 SEC("struct_ops.s/"#name)							\
 BPF_PROG(name, ##args)
 /**
 * RESIZABLE_ARRAY - Generates annotations for an array that may be resized
 * @elfsec: the data section of the BPF program in which to place the array
 * @arr: the name of the array
 *
 * libbpf has an API for setting map value sizes. Since data sections (i.e.
 * bss, data, rodata) themselves are maps, a data section can be resized. If
 * a data section has an array as its last element, the BTF info for that
 * array will be adjusted so that length of the array is extended to meet the
 * new length of the data section. This macro annotates an array to have an
 * element count of one with the assumption that this array can be resized
 * within the userspace program. It also annotates the section specifier so
 * this array exists in a custom sub data section which can be resized
 * independently.
 *
 * See RESIZE_ARRAY() for the userspace convenience macro for resizing an
 * array declared with RESIZABLE_ARRAY().
 */
 #define RESIZABLE_ARRAY(elfsec, arr) arr[1] SEC("."#elfsec"."#arr)
 /**
 * MEMBER_VPTR - Obtain the verified pointer to a struct or array member
 * @base: struct or array to index
 * @member: dereferenced member (e.g. .field, [idx0][idx1], .field[idx0] ...)
 *
 * The verifier often gets confused by the instruction sequence the compiler
 * generates for indexing struct fields or arrays. This macro forces the
 * compiler to generate a code sequence which first calculates the byte offset,
 * checks it against the struct or array size and add that byte offset to
 * generate the pointer to the member to help the verifier.
 *
 * Ideally, we want to abort if the calculated offset is out-of-bounds. However,
 * BPF currently doesn't support abort, so evaluate to %NULL instead. The caller
 * must check for %NULL and take appropriate action to appease the verifier. To
 * avoid confusing the verifier, it's best to check for %NULL and dereference
 * immediately.
 *
 *	vptr = MEMBER_VPTR(my_array, [i][j]);
 *	if (!vptr)
 *		return error;
 *	*vptr = new_value;
 *
 * sizeof(@base) should encompass the memory area to be accessed and thus can't
 * be a pointer to the area. Use `MEMBER_VPTR(*ptr, .member)` instead of
 * `MEMBER_VPTR(ptr, ->member)`.
 */
 #define MEMBER_VPTR(base, member) (typeof((base) member) *)			\
 ({										\
 	u64 __base = (u64)&(base);						\
 	u64 __addr = (u64)&((base) member) - __base;				\
 	_Static_assert(sizeof(base) >= sizeof((base) member),			\
 		       "@base is smaller than @member, is @base a pointer?");	\
 	asm volatile (								\
 		"if %0 <= %[max] goto +2\n"					\
 		"%0 = 0\n"							\
 		"goto +1\n"							\
 		"%0 += %1\n"							\
 		: "+r"(__addr)							\
 		: "r"(__base),							\
 		  [max]"i"(sizeof(base) - sizeof((base) member)));		\
 	__addr;									\
 })
 /**
 * ARRAY_ELEM_PTR - Obtain the verified pointer to an array element
 * @arr: array to index into
 * @i: array index
 * @n: number of elements in array
 *
 * Similar to MEMBER_VPTR() but is intended for use with arrays where the
 * element count needs to be explicit.
 * It can be used in cases where a global array is defined with an initial
 * size but is intended to be be resized before loading the BPF program.
 * Without this version of the macro, MEMBER_VPTR() will use the compile time
 * size of the array to compute the max, which will result in rejection by
 * the verifier.
 */
 #define ARRAY_ELEM_PTR(arr, i, n) (typeof(arr[i]) *)				\
 ({										\
 	u64 __base = (u64)arr;							\
 	u64 __addr = (u64)&(arr[i]) - __base;					\
 	asm volatile (								\
 		"if %0 <= %[max] goto +2\n"					\
 		"%0 = 0\n"							\
 		"goto +1\n"							\
 		"%0 += %1\n"							\
 		: "+r"(__addr)							\
 		: "r"(__base),							\
 		  [max]"r"(sizeof(arr[0]) * ((n) - 1)));			\
 	__addr;									\
 })
 /*
 * BPF declarations and helpers
 */
 /* list and rbtree */
 #define __contains(name, node) __attribute__((btf_decl_tag("contains:" #name ":" #node)))
 #define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
 void *bpf_obj_new_impl(__u64 local_type_id, void *meta) __ksym;
 void bpf_obj_drop_impl(void *kptr, void *meta) __ksym;
 #define bpf_obj_new(type) ((type *)bpf_obj_new_impl(bpf_core_type_id_local(type), NULL))
 #define bpf_obj_drop(kptr) bpf_obj_drop_impl(kptr, NULL)
 void bpf_list_push_front(struct bpf_list_head *head, struct bpf_list_node *node) __ksym;
 void bpf_list_push_back(struct bpf_list_head *head, struct bpf_list_node *node) __ksym;
 struct bpf_list_node *bpf_list_pop_front(struct bpf_list_head *head) __ksym;
 struct bpf_list_node *bpf_list_pop_back(struct bpf_list_head *head) __ksym;
 struct bpf_rb_node *bpf_rbtree_remove(struct bpf_rb_root *root,
 				      struct bpf_rb_node *node) __ksym;
 int bpf_rbtree_add_impl(struct bpf_rb_root *root, struct bpf_rb_node *node,
 			bool (less)(struct bpf_rb_node *a, const struct bpf_rb_node *b),
 			void *meta, __u64 off) __ksym;
 #define bpf_rbtree_add(head, node, less) bpf_rbtree_add_impl(head, node, less, NULL, 0)
 struct bpf_rb_node *bpf_rbtree_first(struct bpf_rb_root *root) __ksym;
 void *bpf_refcount_acquire_impl(void *kptr, void *meta) __ksym;
 #define bpf_refcount_acquire(kptr) bpf_refcount_acquire_impl(kptr, NULL)
 /* task */
 struct task_struct *bpf_task_from_pid(s32 pid) __ksym;
 struct task_struct *bpf_task_acquire(struct task_struct *p) __ksym;
 void bpf_task_release(struct task_struct *p) __ksym;
 /* cgroup */
 struct cgroup *bpf_cgroup_ancestor(struct cgroup *cgrp, int level) __ksym;
 void bpf_cgroup_release(struct cgroup *cgrp) __ksym;
 struct cgroup *bpf_cgroup_from_id(u64 cgid) __ksym;
 /* css iteration */
 struct bpf_iter_css;
 struct cgroup_subsys_state;
 extern int bpf_iter_css_new(struct bpf_iter_css *it,
 			    struct cgroup_subsys_state *start,
 			    unsigned int flags) __weak __ksym;
 extern struct cgroup_subsys_state *
 bpf_iter_css_next(struct bpf_iter_css *it) __weak __ksym;
 extern void bpf_iter_css_destroy(struct bpf_iter_css *it) __weak __ksym;
 /* cpumask */
 struct bpf_cpumask *bpf_cpumask_create(void) __ksym;
 struct bpf_cpumask *bpf_cpumask_acquire(struct bpf_cpumask *cpumask) __ksym;
 void bpf_cpumask_release(struct bpf_cpumask *cpumask) __ksym;
 u32 bpf_cpumask_first(const struct cpumask *cpumask) __ksym;
 u32 bpf_cpumask_first_zero(const struct cpumask *cpumask) __ksym;
 void bpf_cpumask_set_cpu(u32 cpu, struct bpf_cpumask *cpumask) __ksym;
 void bpf_cpumask_clear_cpu(u32 cpu, struct bpf_cpumask *cpumask) __ksym;
 bool bpf_cpumask_test_cpu(u32 cpu, const struct cpumask *cpumask) __ksym;
 bool bpf_cpumask_test_and_set_cpu(u32 cpu, struct bpf_cpumask *cpumask) __ksym;
 bool bpf_cpumask_test_and_clear_cpu(u32 cpu, struct bpf_cpumask *cpumask) __ksym;
 void bpf_cpumask_setall(struct bpf_cpumask *cpumask) __ksym;
 void bpf_cpumask_clear(struct bpf_cpumask *cpumask) __ksym;
 bool bpf_cpumask_and(struct bpf_cpumask *dst, const struct cpumask *src1,
 		     const struct cpumask *src2) __ksym;
 void bpf_cpumask_or(struct bpf_cpumask *dst, const struct cpumask *src1,
 		    const struct cpumask *src2) __ksym;
 void bpf_cpumask_xor(struct bpf_cpumask *dst, const struct cpumask *src1,
 		     const struct cpumask *src2) __ksym;
 bool bpf_cpumask_equal(const struct cpumask *src1, const struct cpumask *src2) __ksym;
 bool bpf_cpumask_intersects(const struct cpumask *src1, const struct cpumask *src2) __ksym;
 bool bpf_cpumask_subset(const struct cpumask *src1, const struct cpumask *src2) __ksym;
 bool bpf_cpumask_empty(const struct cpumask *cpumask) __ksym;
 bool bpf_cpumask_full(const struct cpumask *cpumask) __ksym;
 void bpf_cpumask_copy(struct bpf_cpumask *dst, const struct cpumask *src) __ksym;
 u32 bpf_cpumask_any_distribute(const struct cpumask *cpumask) __ksym;
 u32 bpf_cpumask_any_and_distribute(const struct cpumask *src1,
 				   const struct cpumask *src2) __ksym;
 /* rcu */
 void bpf_rcu_read_lock(void) __ksym;
 void bpf_rcu_read_unlock(void) __ksym;
 /*
 * Other helpers
 */
 /* useful compiler attributes */
 #define likely(x) __builtin_expect(!!(x), 1)
 #define unlikely(x) __builtin_expect(!!(x), 0)
 #define __maybe_unused __attribute__((__unused__))
 /*
 * READ/WRITE_ONCE() are from kernel (include/asm-generic/rwonce.h). They
 * prevent compiler from caching, redoing or reordering reads or writes.
 */
 typedef __u8  __attribute__((__may_alias__))  __u8_alias_t;
 typedef __u16 __attribute__((__may_alias__)) __u16_alias_t;
 typedef __u32 __attribute__((__may_alias__)) __u32_alias_t;
 typedef __u64 __attribute__((__may_alias__)) __u64_alias_t;
 static __always_inline void __read_once_size(const volatile void *p, void *res, int size)
 {
 	switch (size) {
 	case 1: *(__u8_alias_t  *) res = *(volatile __u8_alias_t  *) p; break;
 	case 2: *(__u16_alias_t *) res = *(volatile __u16_alias_t *) p; break;
 	case 4: *(__u32_alias_t *) res = *(volatile __u32_alias_t *) p; break;
 	case 8: *(__u64_alias_t *) res = *(volatile __u64_alias_t *) p; break;
 	default:
 		barrier();
 		__builtin_memcpy((void *)res, (const void *)p, size);
 		barrier();
 	}
 }
 static __always_inline void __write_once_size(volatile void *p, void *res, int size)
 {
 	switch (size) {
 	case 1: *(volatile  __u8_alias_t *) p = *(__u8_alias_t  *) res; break;
 	case 2: *(volatile __u16_alias_t *) p = *(__u16_alias_t *) res; break;
 	case 4: *(volatile __u32_alias_t *) p = *(__u32_alias_t *) res; break;
 	case 8: *(volatile __u64_alias_t *) p = *(__u64_alias_t *) res; break;
 	default:
 		barrier();
 		__builtin_memcpy((void *)p, (const void *)res, size);
 		barrier();
 	}
 }
 #define READ_ONCE(x)					\
 ({							\
 	union { typeof(x) __val; char __c[1]; } __u =	\
 		{ .__c = { 0 } };			\
 	__read_once_size(&(x), __u.__c, sizeof(x));	\
 	__u.__val;					\
 })
 #define WRITE_ONCE(x, val)				\
 ({							\
 	union { typeof(x) __val; char __c[1]; } __u =	\
 		{ .__val = (val) }; 			\
 	__write_once_size(&(x), __u.__c, sizeof(x));	\
 	__u.__val;					\
 })
 /*
 * log2_u32 - Compute the base 2 logarithm of a 32-bit exponential value.
 * @v: The value for which we're computing the base 2 logarithm.
 */
 static inline u32 log2_u32(u32 v)
 {
        u32 r;
        u32 shift;
        r = (v > 0xFFFF) << 4; v >>= r;
        shift = (v > 0xFF) << 3; v >>= shift; r |= shift;
        shift = (v > 0xF) << 2; v >>= shift; r |= shift;
        shift = (v > 0x3) << 1; v >>= shift; r |= shift;
        r |= (v >> 1);
        return r;
 }
 /*
 * log2_u64 - Compute the base 2 logarithm of a 64-bit exponential value.
 * @v: The value for which we're computing the base 2 logarithm.
 */
 static inline u32 log2_u64(u64 v)
 {
        u32 hi = v >> 32;
        if (hi)
                return log2_u32(hi) + 32 + 1;
        else
                return log2_u32(v) + 1;
 }
 #include "compat.bpf.h"
 #endif	/* __SCX_COMMON_BPF_H */
--- a/tools/sched_ext/include/scx/common.h
+++ b/tools/sched_ext/include/scx/common.h
@ -0,0 +1,75 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2023 Tejun Heo <tj@kernel.org>
 * Copyright (c) 2023 David Vernet <dvernet@meta.com>
 */
 #ifndef __SCHED_EXT_COMMON_H
 #define __SCHED_EXT_COMMON_H
 #ifdef __KERNEL__
 #error "Should not be included by BPF programs"
 #endif
 #include <stdarg.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <stdint.h>
 #include <errno.h>
 typedef uint8_t u8;
 typedef uint16_t u16;
 typedef uint32_t u32;
 typedef uint64_t u64;
 typedef int8_t s8;
 typedef int16_t s16;
 typedef int32_t s32;
 typedef int64_t s64;
 #define SCX_BUG(__fmt, ...)							\
 	do {									\
 		fprintf(stderr, "[SCX_BUG] %s:%d", __FILE__, __LINE__);		\
 		if (errno)							\
 			fprintf(stderr, " (%s)\n", strerror(errno));		\
 		else								\
 			fprintf(stderr, "\n");					\
 		fprintf(stderr, __fmt __VA_OPT__(,) __VA_ARGS__);		\
 		fprintf(stderr, "\n");						\
 										\
 		exit(EXIT_FAILURE);						\
 	} while (0)
 #define SCX_BUG_ON(__cond, __fmt, ...)					\
 	do {								\
 		if (__cond)						\
 			SCX_BUG((__fmt) __VA_OPT__(,) __VA_ARGS__);	\
 	} while (0)
 /**
 * RESIZE_ARRAY - Convenience macro for resizing a BPF array
 * @__skel: the skeleton containing the array
 * @elfsec: the data section of the BPF program in which the array exists
 * @arr: the name of the array
 * @n: the desired array element count
 *
 * For BPF arrays declared with RESIZABLE_ARRAY(), this macro performs two
 * operations. It resizes the map which corresponds to the custom data
 * section that contains the target array. As a side effect, the BTF info for
 * the array is adjusted so that the array length is sized to cover the new
 * data section size. The second operation is reassigning the skeleton pointer
 * for that custom data section so that it points to the newly memory mapped
 * region.
 */
 #define RESIZE_ARRAY(__skel, elfsec, arr, n)						\
 	do {										\
 		size_t __sz;								\
 		bpf_map__set_value_size((__skel)->maps.elfsec##_##arr,			\
 				sizeof((__skel)->elfsec##_##arr->arr[0]) * (n));	\
 		(__skel)->elfsec##_##arr =						\
 			bpf_map__initial_value((__skel)->maps.elfsec##_##arr, &__sz);	\
 	} while (0)
 #include "user_exit_info.h"
 #include "compat.h"
 #endif	/* __SCHED_EXT_COMMON_H */
--- a/tools/sched_ext/include/scx/compat.bpf.h
+++ b/tools/sched_ext/include/scx/compat.bpf.h
@ -0,0 +1,28 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2024 Tejun Heo <tj@kernel.org>
 * Copyright (c) 2024 David Vernet <dvernet@meta.com>
 */
 #ifndef __SCX_COMPAT_BPF_H
 #define __SCX_COMPAT_BPF_H
 #define __COMPAT_ENUM_OR_ZERO(__type, __ent)					\
 ({										\
 	__type __ret = 0;							\
 	if (bpf_core_enum_value_exists(__type, __ent))				\
 		__ret = __ent;							\
 	__ret;									\
 })
 /*
 * Define sched_ext_ops. This may be expanded to define multiple variants for
 * backward compatibility. See compat.h::SCX_OPS_LOAD/ATTACH().
 */
 #define SCX_OPS_DEFINE(__name, ...)						\
 	SEC(".struct_ops.link")							\
 	struct sched_ext_ops __name = {						\
 		__VA_ARGS__,							\
 	};
 #endif	/* __SCX_COMPAT_BPF_H */
--- a/tools/sched_ext/include/scx/compat.h
+++ b/tools/sched_ext/include/scx/compat.h
@ -0,0 +1,186 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2024 Tejun Heo <tj@kernel.org>
 * Copyright (c) 2024 David Vernet <dvernet@meta.com>
 */
 #ifndef __SCX_COMPAT_H
 #define __SCX_COMPAT_H
 #include <bpf/btf.h>
 #include <fcntl.h>
 #include <stdlib.h>
 #include <unistd.h>
 struct btf *__COMPAT_vmlinux_btf __attribute__((weak));
 static inline void __COMPAT_load_vmlinux_btf(void)
 {
 	if (!__COMPAT_vmlinux_btf) {
 		__COMPAT_vmlinux_btf = btf__load_vmlinux_btf();
 		SCX_BUG_ON(!__COMPAT_vmlinux_btf, "btf__load_vmlinux_btf()");
 	}
 }
 static inline bool __COMPAT_read_enum(const char *type, const char *name, u64 *v)
 {
 	const struct btf_type *t;
 	const char *n;
 	s32 tid;
 	int i;
 	__COMPAT_load_vmlinux_btf();
 	tid = btf__find_by_name(__COMPAT_vmlinux_btf, type);
 	if (tid < 0)
 		return false;
 	t = btf__type_by_id(__COMPAT_vmlinux_btf, tid);
 	SCX_BUG_ON(!t, "btf__type_by_id(%d)", tid);
 	if (btf_is_enum(t)) {
 		struct btf_enum *e = btf_enum(t);
 		for (i = 0; i < BTF_INFO_VLEN(t->info); i++) {
 			n = btf__name_by_offset(__COMPAT_vmlinux_btf, e[i].name_off);
 			SCX_BUG_ON(!n, "btf__name_by_offset()");
 			if (!strcmp(n, name)) {
 				*v = e[i].val;
 				return true;
 			}
 		}
 	} else if (btf_is_enum64(t)) {
 		struct btf_enum64 *e = btf_enum64(t);
 		for (i = 0; i < BTF_INFO_VLEN(t->info); i++) {
 			n = btf__name_by_offset(__COMPAT_vmlinux_btf, e[i].name_off);
 			SCX_BUG_ON(!n, "btf__name_by_offset()");
 			if (!strcmp(n, name)) {
 				*v = btf_enum64_value(&e[i]);
 				return true;
 			}
 		}
 	}
 	return false;
 }
 #define __COMPAT_ENUM_OR_ZERO(__type, __ent)					\
 ({										\
 	u64 __val = 0;								\
 	__COMPAT_read_enum(__type, __ent, &__val);				\
 	__val;									\
 })
 static inline bool __COMPAT_has_ksym(const char *ksym)
 {
 	__COMPAT_load_vmlinux_btf();
 	return btf__find_by_name(__COMPAT_vmlinux_btf, ksym) >= 0;
 }
 static inline bool __COMPAT_struct_has_field(const char *type, const char *field)
 {
 	const struct btf_type *t;
 	const struct btf_member *m;
 	const char *n;
 	s32 tid;
 	int i;
 	__COMPAT_load_vmlinux_btf();
 	tid = btf__find_by_name_kind(__COMPAT_vmlinux_btf, type, BTF_KIND_STRUCT);
 	if (tid < 0)
 		return false;
 	t = btf__type_by_id(__COMPAT_vmlinux_btf, tid);
 	SCX_BUG_ON(!t, "btf__type_by_id(%d)", tid);
 	m = btf_members(t);
 	for (i = 0; i < BTF_INFO_VLEN(t->info); i++) {
 		n = btf__name_by_offset(__COMPAT_vmlinux_btf, m[i].name_off);
 		SCX_BUG_ON(!n, "btf__name_by_offset()");
 			if (!strcmp(n, field))
 				return true;
 	}
 	return false;
 }
 #define SCX_OPS_SWITCH_PARTIAL							\
 	__COMPAT_ENUM_OR_ZERO("scx_ops_flags", "SCX_OPS_SWITCH_PARTIAL")
 static inline long scx_hotplug_seq(void)
 {
 	int fd;
 	char buf[32];
 	ssize_t len;
 	long val;
 	fd = open("/sys/kernel/sched_ext/hotplug_seq", O_RDONLY);
 	if (fd < 0)
 		return -ENOENT;
 	len = read(fd, buf, sizeof(buf) - 1);
 	SCX_BUG_ON(len <= 0, "read failed (%ld)", len);
 	buf[len] = 0;
 	close(fd);
 	val = strtoul(buf, NULL, 10);
 	SCX_BUG_ON(val < 0, "invalid num hotplug events: %lu", val);
 	return val;
 }
 /*
 * struct sched_ext_ops can change over time. If compat.bpf.h::SCX_OPS_DEFINE()
 * is used to define ops and compat.h::SCX_OPS_LOAD/ATTACH() are used to load
 * and attach it, backward compatibility is automatically maintained where
 * reasonable.
 *
 * ec7e3b0463e1 ("implement-ops") in https://github.com/sched-ext/sched_ext is
 * the current minimum required kernel version.
 */
 #define SCX_OPS_OPEN(__ops_name, __scx_name) ({					\
 	struct __scx_name *__skel;						\
 										\
 	SCX_BUG_ON(!__COMPAT_struct_has_field("sched_ext_ops", "dump"),		\
 		   "sched_ext_ops.dump() missing, kernel too old?");		\
 										\
 	__skel = __scx_name##__open();						\
 	SCX_BUG_ON(!__skel, "Could not open " #__scx_name);			\
 	__skel->struct_ops.__ops_name->hotplug_seq = scx_hotplug_seq();		\
 	__skel; 								\
 })
 #define SCX_OPS_LOAD(__skel, __ops_name, __scx_name, __uei_name) ({		\
 	UEI_SET_SIZE(__skel, __ops_name, __uei_name);				\
 	SCX_BUG_ON(__scx_name##__load((__skel)), "Failed to load skel");	\
 })
 /*
 * New versions of bpftool now emit additional link placeholders for BPF maps,
 * and set up BPF skeleton in such a way that libbpf will auto-attach BPF maps
 * automatically, assumming libbpf is recent enough (v1.5+). Old libbpf will do
 * nothing with those links and won't attempt to auto-attach maps.
 *
 * To maintain compatibility with older libbpf while avoiding trying to attach
 * twice, disable the autoattach feature on newer libbpf.
 */
 #if LIBBPF_MAJOR_VERSION > 1 ||							\
 	(LIBBPF_MAJOR_VERSION == 1 && LIBBPF_MINOR_VERSION >= 5)
 #define __SCX_OPS_DISABLE_AUTOATTACH(__skel, __ops_name)			\
 	bpf_map__set_autoattach((__skel)->maps.__ops_name, false)
 #else
 #define __SCX_OPS_DISABLE_AUTOATTACH(__skel, __ops_name) do {} while (0)
 #endif
 #define SCX_OPS_ATTACH(__skel, __ops_name, __scx_name) ({			\
 	struct bpf_link *__link;						\
 	__SCX_OPS_DISABLE_AUTOATTACH(__skel, __ops_name);			\
 	SCX_BUG_ON(__scx_name##__attach((__skel)), "Failed to attach skel");	\
 	__link = bpf_map__attach_struct_ops((__skel)->maps.__ops_name);		\
 	SCX_BUG_ON(!__link, "Failed to attach struct_ops");			\
 	__link;									\
 })
 #endif	/* __SCX_COMPAT_H */
--- a/tools/sched_ext/include/scx/user_exit_info.h
+++ b/tools/sched_ext/include/scx/user_exit_info.h
@ -0,0 +1,111 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Define struct user_exit_info which is shared between BPF and userspace parts
 * to communicate exit status and other information.
 *
 * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2022 Tejun Heo <tj@kernel.org>
 * Copyright (c) 2022 David Vernet <dvernet@meta.com>
 */
 #ifndef __USER_EXIT_INFO_H
 #define __USER_EXIT_INFO_H
 enum uei_sizes {
 	UEI_REASON_LEN		= 128,
 	UEI_MSG_LEN		= 1024,
 	UEI_DUMP_DFL_LEN	= 32768,
 };
 struct user_exit_info {
 	int		kind;
 	s64		exit_code;
 	char		reason[UEI_REASON_LEN];
 	char		msg[UEI_MSG_LEN];
 };
 #ifdef __bpf__
 #include "vmlinux.h"
 #include <bpf/bpf_core_read.h>
 #define UEI_DEFINE(__name)							\
 	char RESIZABLE_ARRAY(data, __name##_dump);				\
 	const volatile u32 __name##_dump_len;					\
 	struct user_exit_info __name SEC(".data")
 #define UEI_RECORD(__uei_name, __ei) ({						\
 	bpf_probe_read_kernel_str(__uei_name.reason,				\
 				  sizeof(__uei_name.reason), (__ei)->reason);	\
 	bpf_probe_read_kernel_str(__uei_name.msg,				\
 				  sizeof(__uei_name.msg), (__ei)->msg);		\
 	bpf_probe_read_kernel_str(__uei_name##_dump,				\
 				  __uei_name##_dump_len, (__ei)->dump);		\
 	if (bpf_core_field_exists((__ei)->exit_code))				\
 		__uei_name.exit_code = (__ei)->exit_code;			\
 	/* use __sync to force memory barrier */				\
 	__sync_val_compare_and_swap(&__uei_name.kind, __uei_name.kind,		\
 				    (__ei)->kind);				\
 })
 #else	/* !__bpf__ */
 #include <stdio.h>
 #include <stdbool.h>
 /* no need to call the following explicitly if SCX_OPS_LOAD() is used */
 #define UEI_SET_SIZE(__skel, __ops_name, __uei_name) ({					\
 	u32 __len = (__skel)->struct_ops.__ops_name->exit_dump_len ?: UEI_DUMP_DFL_LEN;	\
 	(__skel)->rodata->__uei_name##_dump_len = __len;				\
 	RESIZE_ARRAY((__skel), data, __uei_name##_dump, __len);				\
 })
 #define UEI_EXITED(__skel, __uei_name) ({					\
 	/* use __sync to force memory barrier */				\
 	__sync_val_compare_and_swap(&(__skel)->data->__uei_name.kind, -1, -1);	\
 })
 #define UEI_REPORT(__skel, __uei_name) ({					\
 	struct user_exit_info *__uei = &(__skel)->data->__uei_name;		\
 	char *__uei_dump = (__skel)->data_##__uei_name##_dump->__uei_name##_dump; \
 	if (__uei_dump[0] != '\0') {						\
 		fputs("\nDEBUG DUMP\n", stderr);				\
 		fputs("================================================================================\n\n", stderr); \
 		fputs(__uei_dump, stderr);					\
 		fputs("\n================================================================================\n\n", stderr); \
 	}									\
 	fprintf(stderr, "EXIT: %s", __uei->reason);				\
 	if (__uei->msg[0] != '\0')						\
 		fprintf(stderr, " (%s)", __uei->msg);				\
 	fputs("\n", stderr);							\
 	__uei->exit_code;							\
 })
 /*
 * We can't import vmlinux.h while compiling user C code. Let's duplicate
 * scx_exit_code definition.
 */
 enum scx_exit_code {
 	/* Reasons */
 	SCX_ECODE_RSN_HOTPLUG		= 1LLU << 32,
 	/* Actions */
 	SCX_ECODE_ACT_RESTART		= 1LLU << 48,
 };
 enum uei_ecode_mask {
 	UEI_ECODE_USER_MASK		= ((1LLU << 32) - 1),
 	UEI_ECODE_SYS_RSN_MASK		= ((1LLU << 16) - 1) << 32,
 	UEI_ECODE_SYS_ACT_MASK		= ((1LLU << 16) - 1) << 48,
 };
 /*
 * These macro interpret the ecode returned from UEI_REPORT().
 */
 #define UEI_ECODE_USER(__ecode)		((__ecode) & UEI_ECODE_USER_MASK)
 #define UEI_ECODE_SYS_RSN(__ecode)	((__ecode) & UEI_ECODE_SYS_RSN_MASK)
 #define UEI_ECODE_SYS_ACT(__ecode)	((__ecode) & UEI_ECODE_SYS_ACT_MASK)
 #define UEI_ECODE_RESTART(__ecode)	(UEI_ECODE_SYS_ACT((__ecode)) == SCX_ECODE_ACT_RESTART)
 #endif	/* __bpf__ */
 #endif	/* __USER_EXIT_INFO_H */
--- a/tools/sched_ext/scx_central.bpf.c
+++ b/tools/sched_ext/scx_central.bpf.c
@ -0,0 +1,361 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * A central FIFO sched_ext scheduler which demonstrates the followings:
 *
 * a. Making all scheduling decisions from one CPU:
 *
 *    The central CPU is the only one making scheduling decisions. All other
 *    CPUs kick the central CPU when they run out of tasks to run.
 *
 *    There is one global BPF queue and the central CPU schedules all CPUs by
 *    dispatching from the global queue to each CPU's local dsq from dispatch().
 *    This isn't the most straightforward. e.g. It'd be easier to bounce
 *    through per-CPU BPF queues. The current design is chosen to maximally
 *    utilize and verify various SCX mechanisms such as LOCAL_ON dispatching.
 *
 * b. Tickless operation
 *
 *    All tasks are dispatched with the infinite slice which allows stopping the
 *    ticks on CONFIG_NO_HZ_FULL kernels running with the proper nohz_full
 *    parameter. The tickless operation can be observed through
 *    /proc/interrupts.
 *
 *    Periodic switching is enforced by a periodic timer checking all CPUs and
 *    preempting them as necessary. Unfortunately, BPF timer currently doesn't
 *    have a way to pin to a specific CPU, so the periodic timer isn't pinned to
 *    the central CPU.
 *
 * c. Preemption
 *
 *    Kthreads are unconditionally queued to the head of a matching local dsq
 *    and dispatched with SCX_DSQ_PREEMPT. This ensures that a kthread is always
 *    prioritized over user threads, which is required for ensuring forward
 *    progress as e.g. the periodic timer may run on a ksoftirqd and if the
 *    ksoftirqd gets starved by a user thread, there may not be anything else to
 *    vacate that user thread.
 *
 *    SCX_KICK_PREEMPT is used to trigger scheduling and CPUs to move to the
 *    next tasks.
 *
 * This scheduler is designed to maximize usage of various SCX mechanisms. A
 * more practical implementation would likely put the scheduling loop outside
 * the central CPU's dispatch() path and add some form of priority mechanism.
 *
 * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2022 Tejun Heo <tj@kernel.org>
 * Copyright (c) 2022 David Vernet <dvernet@meta.com>
 */
 #include <scx/common.bpf.h>
 char _license[] SEC("license") = "GPL";
 enum {
 	FALLBACK_DSQ_ID		= 0,
 	MS_TO_NS		= 1000LLU * 1000,
 	TIMER_INTERVAL_NS	= 1 * MS_TO_NS,
 };
 const volatile s32 central_cpu;
 const volatile u32 nr_cpu_ids = 1;	/* !0 for veristat, set during init */
 const volatile u64 slice_ns = SCX_SLICE_DFL;
 bool timer_pinned = true;
 u64 nr_total, nr_locals, nr_queued, nr_lost_pids;
 u64 nr_timers, nr_dispatches, nr_mismatches, nr_retries;
 u64 nr_overflows;
 UEI_DEFINE(uei);
 struct {
 	__uint(type, BPF_MAP_TYPE_QUEUE);
 	__uint(max_entries, 4096);
 	__type(value, s32);
 } central_q SEC(".maps");
 /* can't use percpu map due to bad lookups */
 bool RESIZABLE_ARRAY(data, cpu_gimme_task);
 u64 RESIZABLE_ARRAY(data, cpu_started_at);
 struct central_timer {
 	struct bpf_timer timer;
 };
 struct {
 	__uint(type, BPF_MAP_TYPE_ARRAY);
 	__uint(max_entries, 1);
 	__type(key, u32);
 	__type(value, struct central_timer);
 } central_timer SEC(".maps");
 static bool vtime_before(u64 a, u64 b)
 {
 	return (s64)(a - b) < 0;
 }
 s32 BPF_STRUCT_OPS(central_select_cpu, struct task_struct *p,
 		   s32 prev_cpu, u64 wake_flags)
 {
 	/*
 	 * Steer wakeups to the central CPU as much as possible to avoid
 	 * disturbing other CPUs. It's safe to blindly return the central cpu as
 	 * select_cpu() is a hint and if @p can't be on it, the kernel will
 	 * automatically pick a fallback CPU.
 	 */
 	return central_cpu;
 }
 void BPF_STRUCT_OPS(central_enqueue, struct task_struct *p, u64 enq_flags)
 {
 	s32 pid = p->pid;
 	__sync_fetch_and_add(&nr_total, 1);
 	/*
 	 * Push per-cpu kthreads at the head of local dsq's and preempt the
 	 * corresponding CPU. This ensures that e.g. ksoftirqd isn't blocked
 	 * behind other threads which is necessary for forward progress
 	 * guarantee as we depend on the BPF timer which may run from ksoftirqd.
 	 */
 	if ((p->flags & PF_KTHREAD) && p->nr_cpus_allowed == 1) {
 		__sync_fetch_and_add(&nr_locals, 1);
 		scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_INF,
 				 enq_flags | SCX_ENQ_PREEMPT);
 		return;
 	}
 	if (bpf_map_push_elem(&central_q, &pid, 0)) {
 		__sync_fetch_and_add(&nr_overflows, 1);
 		scx_bpf_dispatch(p, FALLBACK_DSQ_ID, SCX_SLICE_INF, enq_flags);
 		return;
 	}
 	__sync_fetch_and_add(&nr_queued, 1);
 	if (!scx_bpf_task_running(p))
 		scx_bpf_kick_cpu(central_cpu, SCX_KICK_PREEMPT);
 }
 static bool dispatch_to_cpu(s32 cpu)
 {
 	struct task_struct *p;
 	s32 pid;
 	bpf_repeat(BPF_MAX_LOOPS) {
 		if (bpf_map_pop_elem(&central_q, &pid))
 			break;
 		__sync_fetch_and_sub(&nr_queued, 1);
 		p = bpf_task_from_pid(pid);
 		if (!p) {
 			__sync_fetch_and_add(&nr_lost_pids, 1);
 			continue;
 		}
 		/*
 		 * If we can't run the task at the top, do the dumb thing and
 		 * bounce it to the fallback dsq.
 		 */
 		if (!bpf_cpumask_test_cpu(cpu, p->cpus_ptr)) {
 			__sync_fetch_and_add(&nr_mismatches, 1);
 			scx_bpf_dispatch(p, FALLBACK_DSQ_ID, SCX_SLICE_INF, 0);
 			bpf_task_release(p);
 			/*
 			 * We might run out of dispatch buffer slots if we continue dispatching
 			 * to the fallback DSQ, without dispatching to the local DSQ of the
 			 * target CPU. In such a case, break the loop now as will fail the
 			 * next dispatch operation.
 			 */
 			if (!scx_bpf_dispatch_nr_slots())
 				break;
 			continue;
 		}
 		/* dispatch to local and mark that @cpu doesn't need more */
 		scx_bpf_dispatch(p, SCX_DSQ_LOCAL_ON | cpu, SCX_SLICE_INF, 0);
 		if (cpu != central_cpu)
 			scx_bpf_kick_cpu(cpu, SCX_KICK_IDLE);
 		bpf_task_release(p);
 		return true;
 	}
 	return false;
 }
 void BPF_STRUCT_OPS(central_dispatch, s32 cpu, struct task_struct *prev)
 {
 	if (cpu == central_cpu) {
 		/* dispatch for all other CPUs first */
 		__sync_fetch_and_add(&nr_dispatches, 1);
 		bpf_for(cpu, 0, nr_cpu_ids) {
 			bool *gimme;
 			if (!scx_bpf_dispatch_nr_slots())
 				break;
 			/* central's gimme is never set */
 			gimme = ARRAY_ELEM_PTR(cpu_gimme_task, cpu, nr_cpu_ids);
 			if (!gimme || !*gimme)
 				continue;
 			if (dispatch_to_cpu(cpu))
 				*gimme = false;
 		}
 		/*
 		 * Retry if we ran out of dispatch buffer slots as we might have
 		 * skipped some CPUs and also need to dispatch for self. The ext
 		 * core automatically retries if the local dsq is empty but we
 		 * can't rely on that as we're dispatching for other CPUs too.
 		 * Kick self explicitly to retry.
 		 */
 		if (!scx_bpf_dispatch_nr_slots()) {
 			__sync_fetch_and_add(&nr_retries, 1);
 			scx_bpf_kick_cpu(central_cpu, SCX_KICK_PREEMPT);
 			return;
 		}
 		/* look for a task to run on the central CPU */
 		if (scx_bpf_consume(FALLBACK_DSQ_ID))
 			return;
 		dispatch_to_cpu(central_cpu);
 	} else {
 		bool *gimme;
 		if (scx_bpf_consume(FALLBACK_DSQ_ID))
 			return;
 		gimme = ARRAY_ELEM_PTR(cpu_gimme_task, cpu, nr_cpu_ids);
 		if (gimme)
 			*gimme = true;
 		/*
 		 * Force dispatch on the scheduling CPU so that it finds a task
 		 * to run for us.
 		 */
 		scx_bpf_kick_cpu(central_cpu, SCX_KICK_PREEMPT);
 	}
 }
 void BPF_STRUCT_OPS(central_running, struct task_struct *p)
 {
 	s32 cpu = scx_bpf_task_cpu(p);
 	u64 *started_at = ARRAY_ELEM_PTR(cpu_started_at, cpu, nr_cpu_ids);
 	if (started_at)
 		*started_at = bpf_ktime_get_ns() ?: 1;	/* 0 indicates idle */
 }
 void BPF_STRUCT_OPS(central_stopping, struct task_struct *p, bool runnable)
 {
 	s32 cpu = scx_bpf_task_cpu(p);
 	u64 *started_at = ARRAY_ELEM_PTR(cpu_started_at, cpu, nr_cpu_ids);
 	if (started_at)
 		*started_at = 0;
 }
 static int central_timerfn(void *map, int *key, struct bpf_timer *timer)
 {
 	u64 now = bpf_ktime_get_ns();
 	u64 nr_to_kick = nr_queued;
 	s32 i, curr_cpu;
 	curr_cpu = bpf_get_smp_processor_id();
 	if (timer_pinned && (curr_cpu != central_cpu)) {
 		scx_bpf_error("Central timer ran on CPU %d, not central CPU %d",
 			      curr_cpu, central_cpu);
 		return 0;
 	}
 	bpf_for(i, 0, nr_cpu_ids) {
 		s32 cpu = (nr_timers + i) % nr_cpu_ids;
 		u64 *started_at;
 		if (cpu == central_cpu)
 			continue;
 		/* kick iff the current one exhausted its slice */
 		started_at = ARRAY_ELEM_PTR(cpu_started_at, cpu, nr_cpu_ids);
 		if (started_at && *started_at &&
 		    vtime_before(now, *started_at + slice_ns))
 			continue;
 		/* and there's something pending */
 		if (scx_bpf_dsq_nr_queued(FALLBACK_DSQ_ID) ||
 		    scx_bpf_dsq_nr_queued(SCX_DSQ_LOCAL_ON | cpu))
 			;
 		else if (nr_to_kick)
 			nr_to_kick--;
 		else
 			continue;
 		scx_bpf_kick_cpu(cpu, SCX_KICK_PREEMPT);
 	}
 	bpf_timer_start(timer, TIMER_INTERVAL_NS, BPF_F_TIMER_CPU_PIN);
 	__sync_fetch_and_add(&nr_timers, 1);
 	return 0;
 }
 int BPF_STRUCT_OPS_SLEEPABLE(central_init)
 {
 	u32 key = 0;
 	struct bpf_timer *timer;
 	int ret;
 	ret = scx_bpf_create_dsq(FALLBACK_DSQ_ID, -1);
 	if (ret)
 		return ret;
 	timer = bpf_map_lookup_elem(&central_timer, &key);
 	if (!timer)
 		return -ESRCH;
 	if (bpf_get_smp_processor_id() != central_cpu) {
 		scx_bpf_error("init from non-central CPU");
 		return -EINVAL;
 	}
 	bpf_timer_init(timer, &central_timer, CLOCK_MONOTONIC);
 	bpf_timer_set_callback(timer, central_timerfn);
 	ret = bpf_timer_start(timer, TIMER_INTERVAL_NS, BPF_F_TIMER_CPU_PIN);
 	/*
 	 * BPF_F_TIMER_CPU_PIN is pretty new (>=6.7). If we're running in a
 	 * kernel which doesn't have it, bpf_timer_start() will return -EINVAL.
 	 * Retry without the PIN. This would be the perfect use case for
 	 * bpf_core_enum_value_exists() but the enum type doesn't have a name
 	 * and can't be used with bpf_core_enum_value_exists(). Oh well...
 	 */
 	if (ret == -EINVAL) {
 		timer_pinned = false;
 		ret = bpf_timer_start(timer, TIMER_INTERVAL_NS, 0);
 	}
 	if (ret)
 		scx_bpf_error("bpf_timer_start failed (%d)", ret);
 	return ret;
 }
 void BPF_STRUCT_OPS(central_exit, struct scx_exit_info *ei)
 {
 	UEI_RECORD(uei, ei);
 }
 SCX_OPS_DEFINE(central_ops,
 	       /*
 		* We are offloading all scheduling decisions to the central CPU
 		* and thus being the last task on a given CPU doesn't mean
 		* anything special. Enqueue the last tasks like any other tasks.
 		*/
 	       .flags			= SCX_OPS_ENQ_LAST,
 	       .select_cpu		= (void *)central_select_cpu,
 	       .enqueue			= (void *)central_enqueue,
 	       .dispatch		= (void *)central_dispatch,
 	       .running			= (void *)central_running,
 	       .stopping		= (void *)central_stopping,
 	       .init			= (void *)central_init,
 	       .exit			= (void *)central_exit,
 	       .name			= "central");
--- a/tools/sched_ext/scx_central.c
+++ b/tools/sched_ext/scx_central.c
@ -0,0 +1,135 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2022 Tejun Heo <tj@kernel.org>
 * Copyright (c) 2022 David Vernet <dvernet@meta.com>
 */
 #define _GNU_SOURCE
 #include <sched.h>
 #include <stdio.h>
 #include <unistd.h>
 #include <inttypes.h>
 #include <signal.h>
 #include <libgen.h>
 #include <bpf/bpf.h>
 #include <scx/common.h>
 #include "scx_central.bpf.skel.h"
 const char help_fmt[] =
 "A central FIFO sched_ext scheduler.\n"
 "\n"
 "See the top-level comment in .bpf.c for more details.\n"
 "\n"
 "Usage: %s [-s SLICE_US] [-c CPU]\n"
 "\n"
 "  -s SLICE_US   Override slice duration\n"
 "  -c CPU        Override the central CPU (default: 0)\n"
 "  -v            Print libbpf debug messages\n"
 "  -h            Display this help and exit\n";
 static bool verbose;
 static volatile int exit_req;
 static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
 {
 	if (level == LIBBPF_DEBUG && !verbose)
 		return 0;
 	return vfprintf(stderr, format, args);
 }
 static void sigint_handler(int dummy)
 {
 	exit_req = 1;
 }
 int main(int argc, char **argv)
 {
 	struct scx_central *skel;
 	struct bpf_link *link;
 	__u64 seq = 0, ecode;
 	__s32 opt;
 	cpu_set_t *cpuset;
 	libbpf_set_print(libbpf_print_fn);
 	signal(SIGINT, sigint_handler);
 	signal(SIGTERM, sigint_handler);
 restart:
 	skel = SCX_OPS_OPEN(central_ops, scx_central);
 	skel->rodata->central_cpu = 0;
 	skel->rodata->nr_cpu_ids = libbpf_num_possible_cpus();
 	while ((opt = getopt(argc, argv, "s:c:pvh")) != -1) {
 		switch (opt) {
 		case 's':
 			skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000;
 			break;
 		case 'c':
 			skel->rodata->central_cpu = strtoul(optarg, NULL, 0);
 			break;
 		case 'v':
 			verbose = true;
 			break;
 		default:
 			fprintf(stderr, help_fmt, basename(argv[0]));
 			return opt != 'h';
 		}
 	}
 	/* Resize arrays so their element count is equal to cpu count. */
 	RESIZE_ARRAY(skel, data, cpu_gimme_task, skel->rodata->nr_cpu_ids);
 	RESIZE_ARRAY(skel, data, cpu_started_at, skel->rodata->nr_cpu_ids);
 	SCX_OPS_LOAD(skel, central_ops, scx_central, uei);
 	/*
 	 * Affinitize the loading thread to the central CPU, as:
 	 * - That's where the BPF timer is first invoked in the BPF program.
 	 * - We probably don't want this user space component to take up a core
 	 *   from a task that would benefit from avoiding preemption on one of
 	 *   the tickless cores.
 	 *
 	 * Until BPF supports pinning the timer, it's not guaranteed that it
 	 * will always be invoked on the central CPU. In practice, this
 	 * suffices the majority of the time.
 	 */
 	cpuset = CPU_ALLOC(skel->rodata->nr_cpu_ids);
 	SCX_BUG_ON(!cpuset, "Failed to allocate cpuset");
 	CPU_ZERO(cpuset);
 	CPU_SET(skel->rodata->central_cpu, cpuset);
 	SCX_BUG_ON(sched_setaffinity(0, sizeof(cpuset), cpuset),
 		   "Failed to affinitize to central CPU %d (max %d)",
 		   skel->rodata->central_cpu, skel->rodata->nr_cpu_ids - 1);
 	CPU_FREE(cpuset);
 	link = SCX_OPS_ATTACH(skel, central_ops, scx_central);
 	if (!skel->data->timer_pinned)
 		printf("WARNING : BPF_F_TIMER_CPU_PIN not available, timer not pinned to central\n");
 	while (!exit_req && !UEI_EXITED(skel, uei)) {
 		printf("[SEQ %llu]\n", seq++);
 		printf("total   :%10" PRIu64 "    local:%10" PRIu64 "   queued:%10" PRIu64 "  lost:%10" PRIu64 "\n",
 		       skel->bss->nr_total,
 		       skel->bss->nr_locals,
 		       skel->bss->nr_queued,
 		       skel->bss->nr_lost_pids);
 		printf("timer   :%10" PRIu64 " dispatch:%10" PRIu64 " mismatch:%10" PRIu64 " retry:%10" PRIu64 "\n",
 		       skel->bss->nr_timers,
 		       skel->bss->nr_dispatches,
 		       skel->bss->nr_mismatches,
 		       skel->bss->nr_retries);
 		printf("overflow:%10" PRIu64 "\n",
 		       skel->bss->nr_overflows);
 		fflush(stdout);
 		sleep(1);
 	}
 	bpf_link__destroy(link);
 	ecode = UEI_REPORT(skel, uei);
 	scx_central__destroy(skel);
 	if (UEI_ECODE_RESTART(ecode))
 		goto restart;
 	return 0;
 }
--- a/tools/sched_ext/scx_flatcg.bpf.c
+++ b/tools/sched_ext/scx_flatcg.bpf.c
@ -0,0 +1,949 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * A demo sched_ext flattened cgroup hierarchy scheduler. It implements
 * hierarchical weight-based cgroup CPU control by flattening the cgroup
 * hierarchy into a single layer by compounding the active weight share at each
 * level. Consider the following hierarchy with weights in parentheses:
 *
 * R + A (100) + B (100)
 *   |         \ C (100)
 *   \ D (200)
 *
 * Ignoring the root and threaded cgroups, only B, C and D can contain tasks.
 * Let's say all three have runnable tasks. The total share that each of these
 * three cgroups is entitled to can be calculated by compounding its share at
 * each level.
 *
 * For example, B is competing against C and in that competition its share is
 * 100/(100+100) == 1/2. At its parent level, A is competing against D and A's
 * share in that competition is 100/(200+100) == 1/3. B's eventual share in the
 * system can be calculated by multiplying the two shares, 1/2 * 1/3 == 1/6. C's
 * eventual shaer is the same at 1/6. D is only competing at the top level and
 * its share is 200/(100+200) == 2/3.
 *
 * So, instead of hierarchically scheduling level-by-level, we can consider it
 * as B, C and D competing each other with respective share of 1/6, 1/6 and 2/3
 * and keep updating the eventual shares as the cgroups' runnable states change.
 *
 * This flattening of hierarchy can bring a substantial performance gain when
 * the cgroup hierarchy is nested multiple levels. in a simple benchmark using
 * wrk[8] on apache serving a CGI script calculating sha1sum of a small file, it
 * outperforms CFS by ~3% with CPU controller disabled and by ~10% with two
 * apache instances competing with 2:1 weight ratio nested four level deep.
 *
 * However, the gain comes at the cost of not being able to properly handle
 * thundering herd of cgroups. For example, if many cgroups which are nested
 * behind a low priority parent cgroup wake up around the same time, they may be
 * able to consume more CPU cycles than they are entitled to. In many use cases,
 * this isn't a real concern especially given the performance gain. Also, there
 * are ways to mitigate the problem further by e.g. introducing an extra
 * scheduling layer on cgroup delegation boundaries.
 *
 * The scheduler first picks the cgroup to run and then schedule the tasks
 * within by using nested weighted vtime scheduling by default. The
 * cgroup-internal scheduling can be switched to FIFO with the -f option.
 */
 #include <scx/common.bpf.h>
 #include "scx_flatcg.h"
 /*
 * Maximum amount of retries to find a valid cgroup.
 */
 #define CGROUP_MAX_RETRIES 1024
 char _license[] SEC("license") = "GPL";
 const volatile u32 nr_cpus = 32;	/* !0 for veristat, set during init */
 const volatile u64 cgrp_slice_ns = SCX_SLICE_DFL;
 const volatile bool fifo_sched;
 u64 cvtime_now;
 UEI_DEFINE(uei);
 struct {
 	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
 	__type(key, u32);
 	__type(value, u64);
 	__uint(max_entries, FCG_NR_STATS);
 } stats SEC(".maps");
 static void stat_inc(enum fcg_stat_idx idx)
 {
 	u32 idx_v = idx;
 	u64 *cnt_p = bpf_map_lookup_elem(&stats, &idx_v);
 	if (cnt_p)
 		(*cnt_p)++;
 }
 struct fcg_cpu_ctx {
 	u64			cur_cgid;
 	u64			cur_at;
 };
 struct {
 	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
 	__type(key, u32);
 	__type(value, struct fcg_cpu_ctx);
 	__uint(max_entries, 1);
 } cpu_ctx SEC(".maps");
 struct {
 	__uint(type, BPF_MAP_TYPE_CGRP_STORAGE);
 	__uint(map_flags, BPF_F_NO_PREALLOC);
 	__type(key, int);
 	__type(value, struct fcg_cgrp_ctx);
 } cgrp_ctx SEC(".maps");
 struct cgv_node {
 	struct bpf_rb_node	rb_node;
 	__u64			cvtime;
 	__u64			cgid;
 };
 private(CGV_TREE) struct bpf_spin_lock cgv_tree_lock;
 private(CGV_TREE) struct bpf_rb_root cgv_tree __contains(cgv_node, rb_node);
 struct cgv_node_stash {
 	struct cgv_node __kptr *node;
 };
 struct {
 	__uint(type, BPF_MAP_TYPE_HASH);
 	__uint(max_entries, 16384);
 	__type(key, __u64);
 	__type(value, struct cgv_node_stash);
 } cgv_node_stash SEC(".maps");
 struct fcg_task_ctx {
 	u64		bypassed_at;
 };
 struct {
 	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
 	__uint(map_flags, BPF_F_NO_PREALLOC);
 	__type(key, int);
 	__type(value, struct fcg_task_ctx);
 } task_ctx SEC(".maps");
 /* gets inc'd on weight tree changes to expire the cached hweights */
 u64 hweight_gen = 1;
 static u64 div_round_up(u64 dividend, u64 divisor)
 {
 	return (dividend + divisor - 1) / divisor;
 }
 static bool vtime_before(u64 a, u64 b)
 {
 	return (s64)(a - b) < 0;
 }
 static bool cgv_node_less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
 {
 	struct cgv_node *cgc_a, *cgc_b;
 	cgc_a = container_of(a, struct cgv_node, rb_node);
 	cgc_b = container_of(b, struct cgv_node, rb_node);
 	return cgc_a->cvtime < cgc_b->cvtime;
 }
 static struct fcg_cpu_ctx *find_cpu_ctx(void)
 {
 	struct fcg_cpu_ctx *cpuc;
 	u32 idx = 0;
 	cpuc = bpf_map_lookup_elem(&cpu_ctx, &idx);
 	if (!cpuc) {
 		scx_bpf_error("cpu_ctx lookup failed");
 		return NULL;
 	}
 	return cpuc;
 }
 static struct fcg_cgrp_ctx *find_cgrp_ctx(struct cgroup *cgrp)
 {
 	struct fcg_cgrp_ctx *cgc;
 	cgc = bpf_cgrp_storage_get(&cgrp_ctx, cgrp, 0, 0);
 	if (!cgc) {
 		scx_bpf_error("cgrp_ctx lookup failed for cgid %llu", cgrp->kn->id);
 		return NULL;
 	}
 	return cgc;
 }
 static struct fcg_cgrp_ctx *find_ancestor_cgrp_ctx(struct cgroup *cgrp, int level)
 {
 	struct fcg_cgrp_ctx *cgc;
 	cgrp = bpf_cgroup_ancestor(cgrp, level);
 	if (!cgrp) {
 		scx_bpf_error("ancestor cgroup lookup failed");
 		return NULL;
 	}
 	cgc = find_cgrp_ctx(cgrp);
 	if (!cgc)
 		scx_bpf_error("ancestor cgrp_ctx lookup failed");
 	bpf_cgroup_release(cgrp);
 	return cgc;
 }
 static void cgrp_refresh_hweight(struct cgroup *cgrp, struct fcg_cgrp_ctx *cgc)
 {
 	int level;
 	if (!cgc->nr_active) {
 		stat_inc(FCG_STAT_HWT_SKIP);
 		return;
 	}
 	if (cgc->hweight_gen == hweight_gen) {
 		stat_inc(FCG_STAT_HWT_CACHE);
 		return;
 	}
 	stat_inc(FCG_STAT_HWT_UPDATES);
 	bpf_for(level, 0, cgrp->level + 1) {
 		struct fcg_cgrp_ctx *cgc;
 		bool is_active;
 		cgc = find_ancestor_cgrp_ctx(cgrp, level);
 		if (!cgc)
 			break;
 		if (!level) {
 			cgc->hweight = FCG_HWEIGHT_ONE;
 			cgc->hweight_gen = hweight_gen;
 		} else {
 			struct fcg_cgrp_ctx *pcgc;
 			pcgc = find_ancestor_cgrp_ctx(cgrp, level - 1);
 			if (!pcgc)
 				break;
 			/*
 			 * We can be oppotunistic here and not grab the
 			 * cgv_tree_lock and deal with the occasional races.
 			 * However, hweight updates are already cached and
 			 * relatively low-frequency. Let's just do the
 			 * straightforward thing.
 			 */
 			bpf_spin_lock(&cgv_tree_lock);
 			is_active = cgc->nr_active;
 			if (is_active) {
 				cgc->hweight_gen = pcgc->hweight_gen;
 				cgc->hweight =
 					div_round_up(pcgc->hweight * cgc->weight,
 						     pcgc->child_weight_sum);
 			}
 			bpf_spin_unlock(&cgv_tree_lock);
 			if (!is_active) {
 				stat_inc(FCG_STAT_HWT_RACE);
 				break;
 			}
 		}
 	}
 }
 static void cgrp_cap_budget(struct cgv_node *cgv_node, struct fcg_cgrp_ctx *cgc)
 {
 	u64 delta, cvtime, max_budget;
 	/*
 	 * A node which is on the rbtree can't be pointed to from elsewhere yet
 	 * and thus can't be updated and repositioned. Instead, we collect the
 	 * vtime deltas separately and apply it asynchronously here.
 	 */
 	delta = cgc->cvtime_delta;
 	__sync_fetch_and_sub(&cgc->cvtime_delta, delta);
 	cvtime = cgv_node->cvtime + delta;
 	/*
 	 * Allow a cgroup to carry the maximum budget proportional to its
 	 * hweight such that a full-hweight cgroup can immediately take up half
 	 * of the CPUs at the most while staying at the front of the rbtree.
 	 */
 	max_budget = (cgrp_slice_ns * nr_cpus * cgc->hweight) /
 		(2 * FCG_HWEIGHT_ONE);
 	if (vtime_before(cvtime, cvtime_now - max_budget))
 		cvtime = cvtime_now - max_budget;
 	cgv_node->cvtime = cvtime;
 }
 static void cgrp_enqueued(struct cgroup *cgrp, struct fcg_cgrp_ctx *cgc)
 {
 	struct cgv_node_stash *stash;
 	struct cgv_node *cgv_node;
 	u64 cgid = cgrp->kn->id;
 	/* paired with cmpxchg in try_pick_next_cgroup() */
 	if (__sync_val_compare_and_swap(&cgc->queued, 0, 1)) {
 		stat_inc(FCG_STAT_ENQ_SKIP);
 		return;
 	}
 	stash = bpf_map_lookup_elem(&cgv_node_stash, &cgid);
 	if (!stash) {
 		scx_bpf_error("cgv_node lookup failed for cgid %llu", cgid);
 		return;
 	}
 	/* NULL if the node is already on the rbtree */
 	cgv_node = bpf_kptr_xchg(&stash->node, NULL);
 	if (!cgv_node) {
 		stat_inc(FCG_STAT_ENQ_RACE);
 		return;
 	}
 	bpf_spin_lock(&cgv_tree_lock);
 	cgrp_cap_budget(cgv_node, cgc);
 	bpf_rbtree_add(&cgv_tree, &cgv_node->rb_node, cgv_node_less);
 	bpf_spin_unlock(&cgv_tree_lock);
 }
 static void set_bypassed_at(struct task_struct *p, struct fcg_task_ctx *taskc)
 {
 	/*
 	 * Tell fcg_stopping() that this bypassed the regular scheduling path
 	 * and should be force charged to the cgroup. 0 is used to indicate that
 	 * the task isn't bypassing, so if the current runtime is 0, go back by
 	 * one nanosecond.
 	 */
 	taskc->bypassed_at = p->se.sum_exec_runtime ?: (u64)-1;
 }
 s32 BPF_STRUCT_OPS(fcg_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags)
 {
 	struct fcg_task_ctx *taskc;
 	bool is_idle = false;
 	s32 cpu;
 	cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &is_idle);
 	taskc = bpf_task_storage_get(&task_ctx, p, 0, 0);
 	if (!taskc) {
 		scx_bpf_error("task_ctx lookup failed");
 		return cpu;
 	}
 	/*
 	 * If select_cpu_dfl() is recommending local enqueue, the target CPU is
 	 * idle. Follow it and charge the cgroup later in fcg_stopping() after
 	 * the fact.
 	 */
 	if (is_idle) {
 		set_bypassed_at(p, taskc);
 		stat_inc(FCG_STAT_LOCAL);
 		scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
 	}
 	return cpu;
 }
 void BPF_STRUCT_OPS(fcg_enqueue, struct task_struct *p, u64 enq_flags)
 {
 	struct fcg_task_ctx *taskc;
 	struct cgroup *cgrp;
 	struct fcg_cgrp_ctx *cgc;
 	taskc = bpf_task_storage_get(&task_ctx, p, 0, 0);
 	if (!taskc) {
 		scx_bpf_error("task_ctx lookup failed");
 		return;
 	}
 	/*
 	 * Use the direct dispatching and force charging to deal with tasks with
 	 * custom affinities so that we don't have to worry about per-cgroup
 	 * dq's containing tasks that can't be executed from some CPUs.
 	 */
 	if (p->nr_cpus_allowed != nr_cpus) {
 		set_bypassed_at(p, taskc);
 		/*
 		 * The global dq is deprioritized as we don't want to let tasks
 		 * to boost themselves by constraining its cpumask. The
 		 * deprioritization is rather severe, so let's not apply that to
 		 * per-cpu kernel threads. This is ham-fisted. We probably wanna
 		 * implement per-cgroup fallback dq's instead so that we have
 		 * more control over when tasks with custom cpumask get issued.
 		 */
 		if (p->nr_cpus_allowed == 1 && (p->flags & PF_KTHREAD)) {
 			stat_inc(FCG_STAT_LOCAL);
 			scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, enq_flags);
 		} else {
 			stat_inc(FCG_STAT_GLOBAL);
 			scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
 		}
 		return;
 	}
 	cgrp = scx_bpf_task_cgroup(p);
 	cgc = find_cgrp_ctx(cgrp);
 	if (!cgc)
 		goto out_release;
 	if (fifo_sched) {
 		scx_bpf_dispatch(p, cgrp->kn->id, SCX_SLICE_DFL, enq_flags);
 	} else {
 		u64 tvtime = p->scx.dsq_vtime;
 		/*
 		 * Limit the amount of budget that an idling task can accumulate
 		 * to one slice.
 		 */
 		if (vtime_before(tvtime, cgc->tvtime_now - SCX_SLICE_DFL))
 			tvtime = cgc->tvtime_now - SCX_SLICE_DFL;
 		scx_bpf_dispatch_vtime(p, cgrp->kn->id, SCX_SLICE_DFL,
 				       tvtime, enq_flags);
 	}
 	cgrp_enqueued(cgrp, cgc);
 out_release:
 	bpf_cgroup_release(cgrp);
 }
 /*
 * Walk the cgroup tree to update the active weight sums as tasks wake up and
 * sleep. The weight sums are used as the base when calculating the proportion a
 * given cgroup or task is entitled to at each level.
 */
 static void update_active_weight_sums(struct cgroup *cgrp, bool runnable)
 {
 	struct fcg_cgrp_ctx *cgc;
 	bool updated = false;
 	int idx;
 	cgc = find_cgrp_ctx(cgrp);
 	if (!cgc)
 		return;
 	/*
 	 * In most cases, a hot cgroup would have multiple threads going to
 	 * sleep and waking up while the whole cgroup stays active. In leaf
 	 * cgroups, ->nr_runnable which is updated with __sync operations gates
 	 * ->nr_active updates, so that we don't have to grab the cgv_tree_lock
 	 * repeatedly for a busy cgroup which is staying active.
 	 */
 	if (runnable) {
 		if (__sync_fetch_and_add(&cgc->nr_runnable, 1))
 			return;
 		stat_inc(FCG_STAT_ACT);
 	} else {
 		if (__sync_sub_and_fetch(&cgc->nr_runnable, 1))
 			return;
 		stat_inc(FCG_STAT_DEACT);
 	}
 	/*
 	 * If @cgrp is becoming runnable, its hweight should be refreshed after
 	 * it's added to the weight tree so that enqueue has the up-to-date
 	 * value. If @cgrp is becoming quiescent, the hweight should be
 	 * refreshed before it's removed from the weight tree so that the usage
 	 * charging which happens afterwards has access to the latest value.
 	 */
 	if (!runnable)
 		cgrp_refresh_hweight(cgrp, cgc);
 	/* propagate upwards */
 	bpf_for(idx, 0, cgrp->level) {
 		int level = cgrp->level - idx;
 		struct fcg_cgrp_ctx *cgc, *pcgc = NULL;
 		bool propagate = false;
 		cgc = find_ancestor_cgrp_ctx(cgrp, level);
 		if (!cgc)
 			break;
 		if (level) {
 			pcgc = find_ancestor_cgrp_ctx(cgrp, level - 1);
 			if (!pcgc)
 				break;
 		}
 		/*
 		 * We need the propagation protected by a lock to synchronize
 		 * against weight changes. There's no reason to drop the lock at
 		 * each level but bpf_spin_lock() doesn't want any function
 		 * calls while locked.
 		 */
 		bpf_spin_lock(&cgv_tree_lock);
 		if (runnable) {
 			if (!cgc->nr_active++) {
 				updated = true;
 				if (pcgc) {
 					propagate = true;
 					pcgc->child_weight_sum += cgc->weight;
 				}
 			}
 		} else {
 			if (!--cgc->nr_active) {
 				updated = true;
 				if (pcgc) {
 					propagate = true;
 					pcgc->child_weight_sum -= cgc->weight;
 				}
 			}
 		}
 		bpf_spin_unlock(&cgv_tree_lock);
 		if (!propagate)
 			break;
 	}
 	if (updated)
 		__sync_fetch_and_add(&hweight_gen, 1);
 	if (runnable)
 		cgrp_refresh_hweight(cgrp, cgc);
 }
 void BPF_STRUCT_OPS(fcg_runnable, struct task_struct *p, u64 enq_flags)
 {
 	struct cgroup *cgrp;
 	cgrp = scx_bpf_task_cgroup(p);
 	update_active_weight_sums(cgrp, true);
 	bpf_cgroup_release(cgrp);
 }
 void BPF_STRUCT_OPS(fcg_running, struct task_struct *p)
 {
 	struct cgroup *cgrp;
 	struct fcg_cgrp_ctx *cgc;
 	if (fifo_sched)
 		return;
 	cgrp = scx_bpf_task_cgroup(p);
 	cgc = find_cgrp_ctx(cgrp);
 	if (cgc) {
 		/*
 		 * @cgc->tvtime_now always progresses forward as tasks start
 		 * executing. The test and update can be performed concurrently
 		 * from multiple CPUs and thus racy. Any error should be
 		 * contained and temporary. Let's just live with it.
 		 */
 		if (vtime_before(cgc->tvtime_now, p->scx.dsq_vtime))
 			cgc->tvtime_now = p->scx.dsq_vtime;
 	}
 	bpf_cgroup_release(cgrp);
 }
 void BPF_STRUCT_OPS(fcg_stopping, struct task_struct *p, bool runnable)
 {
 	struct fcg_task_ctx *taskc;
 	struct cgroup *cgrp;
 	struct fcg_cgrp_ctx *cgc;
 	/*
 	 * Scale the execution time by the inverse of the weight and charge.
 	 *
 	 * Note that the default yield implementation yields by setting
 	 * @p->scx.slice to zero and the following would treat the yielding task
 	 * as if it has consumed all its slice. If this penalizes yielding tasks
 	 * too much, determine the execution time by taking explicit timestamps
 	 * instead of depending on @p->scx.slice.
 	 */
 	if (!fifo_sched)
 		p->scx.dsq_vtime +=
 			(SCX_SLICE_DFL - p->scx.slice) * 100 / p->scx.weight;
 	taskc = bpf_task_storage_get(&task_ctx, p, 0, 0);
 	if (!taskc) {
 		scx_bpf_error("task_ctx lookup failed");
 		return;
 	}
 	if (!taskc->bypassed_at)
 		return;
 	cgrp = scx_bpf_task_cgroup(p);
 	cgc = find_cgrp_ctx(cgrp);
 	if (cgc) {
 		__sync_fetch_and_add(&cgc->cvtime_delta,
 				     p->se.sum_exec_runtime - taskc->bypassed_at);
 		taskc->bypassed_at = 0;
 	}
 	bpf_cgroup_release(cgrp);
 }
 void BPF_STRUCT_OPS(fcg_quiescent, struct task_struct *p, u64 deq_flags)
 {
 	struct cgroup *cgrp;
 	cgrp = scx_bpf_task_cgroup(p);
 	update_active_weight_sums(cgrp, false);
 	bpf_cgroup_release(cgrp);
 }
 void BPF_STRUCT_OPS(fcg_cgroup_set_weight, struct cgroup *cgrp, u32 weight)
 {
 	struct fcg_cgrp_ctx *cgc, *pcgc = NULL;
 	cgc = find_cgrp_ctx(cgrp);
 	if (!cgc)
 		return;
 	if (cgrp->level) {
 		pcgc = find_ancestor_cgrp_ctx(cgrp, cgrp->level - 1);
 		if (!pcgc)
 			return;
 	}
 	bpf_spin_lock(&cgv_tree_lock);
 	if (pcgc && cgc->nr_active)
 		pcgc->child_weight_sum += (s64)weight - cgc->weight;
 	cgc->weight = weight;
 	bpf_spin_unlock(&cgv_tree_lock);
 }
 static bool try_pick_next_cgroup(u64 *cgidp)
 {
 	struct bpf_rb_node *rb_node;
 	struct cgv_node_stash *stash;
 	struct cgv_node *cgv_node;
 	struct fcg_cgrp_ctx *cgc;
 	struct cgroup *cgrp;
 	u64 cgid;
 	/* pop the front cgroup and wind cvtime_now accordingly */
 	bpf_spin_lock(&cgv_tree_lock);
 	rb_node = bpf_rbtree_first(&cgv_tree);
 	if (!rb_node) {
 		bpf_spin_unlock(&cgv_tree_lock);
 		stat_inc(FCG_STAT_PNC_NO_CGRP);
 		*cgidp = 0;
 		return true;
 	}
 	rb_node = bpf_rbtree_remove(&cgv_tree, rb_node);
 	bpf_spin_unlock(&cgv_tree_lock);
 	if (!rb_node) {
 		/*
 		 * This should never happen. bpf_rbtree_first() was called
 		 * above while the tree lock was held, so the node should
 		 * always be present.
 		 */
 		scx_bpf_error("node could not be removed");
 		return true;
 	}
 	cgv_node = container_of(rb_node, struct cgv_node, rb_node);
 	cgid = cgv_node->cgid;
 	if (vtime_before(cvtime_now, cgv_node->cvtime))
 		cvtime_now = cgv_node->cvtime;
 	/*
 	 * If lookup fails, the cgroup's gone. Free and move on. See
 	 * fcg_cgroup_exit().
 	 */
 	cgrp = bpf_cgroup_from_id(cgid);
 	if (!cgrp) {
 		stat_inc(FCG_STAT_PNC_GONE);
 		goto out_free;
 	}
 	cgc = bpf_cgrp_storage_get(&cgrp_ctx, cgrp, 0, 0);
 	if (!cgc) {
 		bpf_cgroup_release(cgrp);
 		stat_inc(FCG_STAT_PNC_GONE);
 		goto out_free;
 	}
 	if (!scx_bpf_consume(cgid)) {
 		bpf_cgroup_release(cgrp);
 		stat_inc(FCG_STAT_PNC_EMPTY);
 		goto out_stash;
 	}
 	/*
 	 * Successfully consumed from the cgroup. This will be our current
 	 * cgroup for the new slice. Refresh its hweight.
 	 */
 	cgrp_refresh_hweight(cgrp, cgc);
 	bpf_cgroup_release(cgrp);
 	/*
 	 * As the cgroup may have more tasks, add it back to the rbtree. Note
 	 * that here we charge the full slice upfront and then exact later
 	 * according to the actual consumption. This prevents lowpri thundering
 	 * herd from saturating the machine.
 	 */
 	bpf_spin_lock(&cgv_tree_lock);
 	cgv_node->cvtime += cgrp_slice_ns * FCG_HWEIGHT_ONE / (cgc->hweight ?: 1);
 	cgrp_cap_budget(cgv_node, cgc);
 	bpf_rbtree_add(&cgv_tree, &cgv_node->rb_node, cgv_node_less);
 	bpf_spin_unlock(&cgv_tree_lock);
 	*cgidp = cgid;
 	stat_inc(FCG_STAT_PNC_NEXT);
 	return true;
 out_stash:
 	stash = bpf_map_lookup_elem(&cgv_node_stash, &cgid);
 	if (!stash) {
 		stat_inc(FCG_STAT_PNC_GONE);
 		goto out_free;
 	}
 	/*
 	 * Paired with cmpxchg in cgrp_enqueued(). If they see the following
 	 * transition, they'll enqueue the cgroup. If they are earlier, we'll
 	 * see their task in the dq below and requeue the cgroup.
 	 */
 	__sync_val_compare_and_swap(&cgc->queued, 1, 0);
 	if (scx_bpf_dsq_nr_queued(cgid)) {
 		bpf_spin_lock(&cgv_tree_lock);
 		bpf_rbtree_add(&cgv_tree, &cgv_node->rb_node, cgv_node_less);
 		bpf_spin_unlock(&cgv_tree_lock);
 		stat_inc(FCG_STAT_PNC_RACE);
 	} else {
 		cgv_node = bpf_kptr_xchg(&stash->node, cgv_node);
 		if (cgv_node) {
 			scx_bpf_error("unexpected !NULL cgv_node stash");
 			goto out_free;
 		}
 	}
 	return false;
 out_free:
 	bpf_obj_drop(cgv_node);
 	return false;
 }
 void BPF_STRUCT_OPS(fcg_dispatch, s32 cpu, struct task_struct *prev)
 {
 	struct fcg_cpu_ctx *cpuc;
 	struct fcg_cgrp_ctx *cgc;
 	struct cgroup *cgrp;
 	u64 now = bpf_ktime_get_ns();
 	bool picked_next = false;
 	cpuc = find_cpu_ctx();
 	if (!cpuc)
 		return;
 	if (!cpuc->cur_cgid)
 		goto pick_next_cgroup;
 	if (vtime_before(now, cpuc->cur_at + cgrp_slice_ns)) {
 		if (scx_bpf_consume(cpuc->cur_cgid)) {
 			stat_inc(FCG_STAT_CNS_KEEP);
 			return;
 		}
 		stat_inc(FCG_STAT_CNS_EMPTY);
 	} else {
 		stat_inc(FCG_STAT_CNS_EXPIRE);
 	}
 	/*
 	 * The current cgroup is expiring. It was already charged a full slice.
 	 * Calculate the actual usage and accumulate the delta.
 	 */
 	cgrp = bpf_cgroup_from_id(cpuc->cur_cgid);
 	if (!cgrp) {
 		stat_inc(FCG_STAT_CNS_GONE);
 		goto pick_next_cgroup;
 	}
 	cgc = bpf_cgrp_storage_get(&cgrp_ctx, cgrp, 0, 0);
 	if (cgc) {
 		/*
 		 * We want to update the vtime delta and then look for the next
 		 * cgroup to execute but the latter needs to be done in a loop
 		 * and we can't keep the lock held. Oh well...
 		 */
 		bpf_spin_lock(&cgv_tree_lock);
 		__sync_fetch_and_add(&cgc->cvtime_delta,
 				     (cpuc->cur_at + cgrp_slice_ns - now) *
 				     FCG_HWEIGHT_ONE / (cgc->hweight ?: 1));
 		bpf_spin_unlock(&cgv_tree_lock);
 	} else {
 		stat_inc(FCG_STAT_CNS_GONE);
 	}
 	bpf_cgroup_release(cgrp);
 pick_next_cgroup:
 	cpuc->cur_at = now;
 	if (scx_bpf_consume(SCX_DSQ_GLOBAL)) {
 		cpuc->cur_cgid = 0;
 		return;
 	}
 	bpf_repeat(CGROUP_MAX_RETRIES) {
 		if (try_pick_next_cgroup(&cpuc->cur_cgid)) {
 			picked_next = true;
 			break;
 		}
 	}
 	/*
 	 * This only happens if try_pick_next_cgroup() races against enqueue
 	 * path for more than CGROUP_MAX_RETRIES times, which is extremely
 	 * unlikely and likely indicates an underlying bug. There shouldn't be
 	 * any stall risk as the race is against enqueue.
 	 */
 	if (!picked_next)
 		stat_inc(FCG_STAT_PNC_FAIL);
 }
 s32 BPF_STRUCT_OPS(fcg_init_task, struct task_struct *p,
 		   struct scx_init_task_args *args)
 {
 	struct fcg_task_ctx *taskc;
 	struct fcg_cgrp_ctx *cgc;
 	/*
 	 * @p is new. Let's ensure that its task_ctx is available. We can sleep
 	 * in this function and the following will automatically use GFP_KERNEL.
 	 */
 	taskc = bpf_task_storage_get(&task_ctx, p, 0,
 				     BPF_LOCAL_STORAGE_GET_F_CREATE);
 	if (!taskc)
 		return -ENOMEM;
 	taskc->bypassed_at = 0;
 	if (!(cgc = find_cgrp_ctx(args->cgroup)))
 		return -ENOENT;
 	p->scx.dsq_vtime = cgc->tvtime_now;
 	return 0;
 }
 int BPF_STRUCT_OPS_SLEEPABLE(fcg_cgroup_init, struct cgroup *cgrp,
 			     struct scx_cgroup_init_args *args)
 {
 	struct fcg_cgrp_ctx *cgc;
 	struct cgv_node *cgv_node;
 	struct cgv_node_stash empty_stash = {}, *stash;
 	u64 cgid = cgrp->kn->id;
 	int ret;
 	/*
 	 * Technically incorrect as cgroup ID is full 64bit while dq ID is
 	 * 63bit. Should not be a problem in practice and easy to spot in the
 	 * unlikely case that it breaks.
 	 */
 	ret = scx_bpf_create_dsq(cgid, -1);
 	if (ret)
 		return ret;
 	cgc = bpf_cgrp_storage_get(&cgrp_ctx, cgrp, 0,
 				   BPF_LOCAL_STORAGE_GET_F_CREATE);
 	if (!cgc) {
 		ret = -ENOMEM;
 		goto err_destroy_dsq;
 	}
 	cgc->weight = args->weight;
 	cgc->hweight = FCG_HWEIGHT_ONE;
 	ret = bpf_map_update_elem(&cgv_node_stash, &cgid, &empty_stash,
 				  BPF_NOEXIST);
 	if (ret) {
 		if (ret != -ENOMEM)
 			scx_bpf_error("unexpected stash creation error (%d)",
 				      ret);
 		goto err_destroy_dsq;
 	}
 	stash = bpf_map_lookup_elem(&cgv_node_stash, &cgid);
 	if (!stash) {
 		scx_bpf_error("unexpected cgv_node stash lookup failure");
 		ret = -ENOENT;
 		goto err_destroy_dsq;
 	}
 	cgv_node = bpf_obj_new(struct cgv_node);
 	if (!cgv_node) {
 		ret = -ENOMEM;
 		goto err_del_cgv_node;
 	}
 	cgv_node->cgid = cgid;
 	cgv_node->cvtime = cvtime_now;
 	cgv_node = bpf_kptr_xchg(&stash->node, cgv_node);
 	if (cgv_node) {
 		scx_bpf_error("unexpected !NULL cgv_node stash");
 		ret = -EBUSY;
 		goto err_drop;
 	}
 	return 0;
 err_drop:
 	bpf_obj_drop(cgv_node);
 err_del_cgv_node:
 	bpf_map_delete_elem(&cgv_node_stash, &cgid);
 err_destroy_dsq:
 	scx_bpf_destroy_dsq(cgid);
 	return ret;
 }
 void BPF_STRUCT_OPS(fcg_cgroup_exit, struct cgroup *cgrp)
 {
 	u64 cgid = cgrp->kn->id;
 	/*
 	 * For now, there's no way find and remove the cgv_node if it's on the
 	 * cgv_tree. Let's drain them in the dispatch path as they get popped
 	 * off the front of the tree.
 	 */
 	bpf_map_delete_elem(&cgv_node_stash, &cgid);
 	scx_bpf_destroy_dsq(cgid);
 }
 void BPF_STRUCT_OPS(fcg_cgroup_move, struct task_struct *p,
 		    struct cgroup *from, struct cgroup *to)
 {
 	struct fcg_cgrp_ctx *from_cgc, *to_cgc;
 	s64 vtime_delta;
 	/* find_cgrp_ctx() triggers scx_ops_error() on lookup failures */
 	if (!(from_cgc = find_cgrp_ctx(from)) || !(to_cgc = find_cgrp_ctx(to)))
 		return;
 	vtime_delta = p->scx.dsq_vtime - from_cgc->tvtime_now;
 	p->scx.dsq_vtime = to_cgc->tvtime_now + vtime_delta;
 }
 void BPF_STRUCT_OPS(fcg_exit, struct scx_exit_info *ei)
 {
 	UEI_RECORD(uei, ei);
 }
 SCX_OPS_DEFINE(flatcg_ops,
 	       .select_cpu		= (void *)fcg_select_cpu,
 	       .enqueue			= (void *)fcg_enqueue,
 	       .dispatch		= (void *)fcg_dispatch,
 	       .runnable		= (void *)fcg_runnable,
 	       .running			= (void *)fcg_running,
 	       .stopping		= (void *)fcg_stopping,
 	       .quiescent		= (void *)fcg_quiescent,
 	       .init_task		= (void *)fcg_init_task,
 	       .cgroup_set_weight	= (void *)fcg_cgroup_set_weight,
 	       .cgroup_init		= (void *)fcg_cgroup_init,
 	       .cgroup_exit		= (void *)fcg_cgroup_exit,
 	       .cgroup_move		= (void *)fcg_cgroup_move,
 	       .exit			= (void *)fcg_exit,
 	       .flags			= SCX_OPS_HAS_CGROUP_WEIGHT | SCX_OPS_ENQ_EXITING,
 	       .name			= "flatcg");
--- a/tools/sched_ext/scx_flatcg.c
+++ b/tools/sched_ext/scx_flatcg.c
@ -0,0 +1,233 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2023 Tejun Heo <tj@kernel.org>
 * Copyright (c) 2023 David Vernet <dvernet@meta.com>
 */
 #include <stdio.h>
 #include <signal.h>
 #include <unistd.h>
 #include <libgen.h>
 #include <limits.h>
 #include <inttypes.h>
 #include <fcntl.h>
 #include <time.h>
 #include <bpf/bpf.h>
 #include <scx/common.h>
 #include "scx_flatcg.h"
 #include "scx_flatcg.bpf.skel.h"
 #ifndef FILEID_KERNFS
 #define FILEID_KERNFS		0xfe
 #endif
 const char help_fmt[] =
 "A flattened cgroup hierarchy sched_ext scheduler.\n"
 "\n"
 "See the top-level comment in .bpf.c for more details.\n"
 "\n"
 "Usage: %s [-s SLICE_US] [-i INTERVAL] [-f] [-v]\n"
 "\n"
 "  -s SLICE_US   Override slice duration\n"
 "  -i INTERVAL   Report interval\n"
 "  -f            Use FIFO scheduling instead of weighted vtime scheduling\n"
 "  -v            Print libbpf debug messages\n"
 "  -h            Display this help and exit\n";
 static bool verbose;
 static volatile int exit_req;
 static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
 {
 	if (level == LIBBPF_DEBUG && !verbose)
 		return 0;
 	return vfprintf(stderr, format, args);
 }
 static void sigint_handler(int dummy)
 {
 	exit_req = 1;
 }
 static float read_cpu_util(__u64 *last_sum, __u64 *last_idle)
 {
 	FILE *fp;
 	char buf[4096];
 	char *line, *cur = NULL, *tok;
 	__u64 sum = 0, idle = 0;
 	__u64 delta_sum, delta_idle;
 	int idx;
 	fp = fopen("/proc/stat", "r");
 	if (!fp) {
 		perror("fopen(\"/proc/stat\")");
 		return 0.0;
 	}
 	if (!fgets(buf, sizeof(buf), fp)) {
 		perror("fgets(\"/proc/stat\")");
 		fclose(fp);
 		return 0.0;
 	}
 	fclose(fp);
 	line = buf;
 	for (idx = 0; (tok = strtok_r(line, " \n", &cur)); idx++) {
 		char *endp = NULL;
 		__u64 v;
 		if (idx == 0) {
 			line = NULL;
 			continue;
 		}
 		v = strtoull(tok, &endp, 0);
 		if (!endp || *endp != '\0') {
 			fprintf(stderr, "failed to parse %dth field of /proc/stat (\"%s\")\n",
 				idx, tok);
 			continue;
 		}
 		sum += v;
 		if (idx == 4)
 			idle = v;
 	}
 	delta_sum = sum - *last_sum;
 	delta_idle = idle - *last_idle;
 	*last_sum = sum;
 	*last_idle = idle;
 	return delta_sum ? (float)(delta_sum - delta_idle) / delta_sum : 0.0;
 }
 static void fcg_read_stats(struct scx_flatcg *skel, __u64 *stats)
 {
 	__u64 cnts[FCG_NR_STATS][skel->rodata->nr_cpus];
 	__u32 idx;
 	memset(stats, 0, sizeof(stats[0]) * FCG_NR_STATS);
 	for (idx = 0; idx < FCG_NR_STATS; idx++) {
 		int ret, cpu;
 		ret = bpf_map_lookup_elem(bpf_map__fd(skel->maps.stats),
 					  &idx, cnts[idx]);
 		if (ret < 0)
 			continue;
 		for (cpu = 0; cpu < skel->rodata->nr_cpus; cpu++)
 			stats[idx] += cnts[idx][cpu];
 	}
 }
 int main(int argc, char **argv)
 {
 	struct scx_flatcg *skel;
 	struct bpf_link *link;
 	struct timespec intv_ts = { .tv_sec = 2, .tv_nsec = 0 };
 	bool dump_cgrps = false;
 	__u64 last_cpu_sum = 0, last_cpu_idle = 0;
 	__u64 last_stats[FCG_NR_STATS] = {};
 	unsigned long seq = 0;
 	__s32 opt;
 	__u64 ecode;
 	libbpf_set_print(libbpf_print_fn);
 	signal(SIGINT, sigint_handler);
 	signal(SIGTERM, sigint_handler);
 restart:
 	skel = SCX_OPS_OPEN(flatcg_ops, scx_flatcg);
 	skel->rodata->nr_cpus = libbpf_num_possible_cpus();
 	while ((opt = getopt(argc, argv, "s:i:dfvh")) != -1) {
 		double v;
 		switch (opt) {
 		case 's':
 			v = strtod(optarg, NULL);
 			skel->rodata->cgrp_slice_ns = v * 1000;
 			break;
 		case 'i':
 			v = strtod(optarg, NULL);
 			intv_ts.tv_sec = v;
 			intv_ts.tv_nsec = (v - (float)intv_ts.tv_sec) * 1000000000;
 			break;
 		case 'd':
 			dump_cgrps = true;
 			break;
 		case 'f':
 			skel->rodata->fifo_sched = true;
 			break;
 		case 'v':
 			verbose = true;
 			break;
 		case 'h':
 		default:
 			fprintf(stderr, help_fmt, basename(argv[0]));
 			return opt != 'h';
 		}
 	}
 	printf("slice=%.1lfms intv=%.1lfs dump_cgrps=%d",
 	       (double)skel->rodata->cgrp_slice_ns / 1000000.0,
 	       (double)intv_ts.tv_sec + (double)intv_ts.tv_nsec / 1000000000.0,
 	       dump_cgrps);
 	SCX_OPS_LOAD(skel, flatcg_ops, scx_flatcg, uei);
 	link = SCX_OPS_ATTACH(skel, flatcg_ops, scx_flatcg);
 	while (!exit_req && !UEI_EXITED(skel, uei)) {
 		__u64 acc_stats[FCG_NR_STATS];
 		__u64 stats[FCG_NR_STATS];
 		float cpu_util;
 		int i;
 		cpu_util = read_cpu_util(&last_cpu_sum, &last_cpu_idle);
 		fcg_read_stats(skel, acc_stats);
 		for (i = 0; i < FCG_NR_STATS; i++)
 			stats[i] = acc_stats[i] - last_stats[i];
 		memcpy(last_stats, acc_stats, sizeof(acc_stats));
 		printf("\n[SEQ %6lu cpu=%5.1lf hweight_gen=%" PRIu64 "]\n",
 		       seq++, cpu_util * 100.0, skel->data->hweight_gen);
 		printf("       act:%6llu  deact:%6llu global:%6llu local:%6llu\n",
 		       stats[FCG_STAT_ACT],
 		       stats[FCG_STAT_DEACT],
 		       stats[FCG_STAT_GLOBAL],
 		       stats[FCG_STAT_LOCAL]);
 		printf("HWT  cache:%6llu update:%6llu   skip:%6llu  race:%6llu\n",
 		       stats[FCG_STAT_HWT_CACHE],
 		       stats[FCG_STAT_HWT_UPDATES],
 		       stats[FCG_STAT_HWT_SKIP],
 		       stats[FCG_STAT_HWT_RACE]);
 		printf("ENQ   skip:%6llu   race:%6llu\n",
 		       stats[FCG_STAT_ENQ_SKIP],
 		       stats[FCG_STAT_ENQ_RACE]);
 		printf("CNS   keep:%6llu expire:%6llu  empty:%6llu  gone:%6llu\n",
 		       stats[FCG_STAT_CNS_KEEP],
 		       stats[FCG_STAT_CNS_EXPIRE],
 		       stats[FCG_STAT_CNS_EMPTY],
 		       stats[FCG_STAT_CNS_GONE]);
 		printf("PNC   next:%6llu  empty:%6llu nocgrp:%6llu  gone:%6llu race:%6llu fail:%6llu\n",
 		       stats[FCG_STAT_PNC_NEXT],
 		       stats[FCG_STAT_PNC_EMPTY],
 		       stats[FCG_STAT_PNC_NO_CGRP],
 		       stats[FCG_STAT_PNC_GONE],
 		       stats[FCG_STAT_PNC_RACE],
 		       stats[FCG_STAT_PNC_FAIL]);
 		printf("BAD remove:%6llu\n",
 		       acc_stats[FCG_STAT_BAD_REMOVAL]);
 		fflush(stdout);
 		nanosleep(&intv_ts, NULL);
 	}
 	bpf_link__destroy(link);
 	ecode = UEI_REPORT(skel, uei);
 	scx_flatcg__destroy(skel);
 	if (UEI_ECODE_RESTART(ecode))
 		goto restart;
 	return 0;
 }
--- a/tools/sched_ext/scx_flatcg.h
+++ b/tools/sched_ext/scx_flatcg.h
@ -0,0 +1,51 @@
 #ifndef __SCX_EXAMPLE_FLATCG_H
 #define __SCX_EXAMPLE_FLATCG_H
 enum {
 	FCG_HWEIGHT_ONE		= 1LLU << 16,
 };
 enum fcg_stat_idx {
 	FCG_STAT_ACT,
 	FCG_STAT_DEACT,
 	FCG_STAT_LOCAL,
 	FCG_STAT_GLOBAL,
 	FCG_STAT_HWT_UPDATES,
 	FCG_STAT_HWT_CACHE,
 	FCG_STAT_HWT_SKIP,
 	FCG_STAT_HWT_RACE,
 	FCG_STAT_ENQ_SKIP,
 	FCG_STAT_ENQ_RACE,
 	FCG_STAT_CNS_KEEP,
 	FCG_STAT_CNS_EXPIRE,
 	FCG_STAT_CNS_EMPTY,
 	FCG_STAT_CNS_GONE,
 	FCG_STAT_PNC_NO_CGRP,
 	FCG_STAT_PNC_NEXT,
 	FCG_STAT_PNC_EMPTY,
 	FCG_STAT_PNC_GONE,
 	FCG_STAT_PNC_RACE,
 	FCG_STAT_PNC_FAIL,
 	FCG_STAT_BAD_REMOVAL,
 	FCG_NR_STATS,
 };
 struct fcg_cgrp_ctx {
 	u32			nr_active;
 	u32			nr_runnable;
 	u32			queued;
 	u32			weight;
 	u32			hweight;
 	u64			child_weight_sum;
 	u64			hweight_gen;
 	s64			cvtime_delta;
 	u64			tvtime_now;
 };
 #endif /* __SCX_EXAMPLE_FLATCG_H */
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@ -0,0 +1,827 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * A simple five-level FIFO queue scheduler.
 *
 * There are five FIFOs implemented using BPF_MAP_TYPE_QUEUE. A task gets
 * assigned to one depending on its compound weight. Each CPU round robins
 * through the FIFOs and dispatches more from FIFOs with higher indices - 1 from
 * queue0, 2 from queue1, 4 from queue2 and so on.
 *
 * This scheduler demonstrates:
 *
 * - BPF-side queueing using PIDs.
 * - Sleepable per-task storage allocation using ops.prep_enable().
 * - Using ops.cpu_release() to handle a higher priority scheduling class taking
 *   the CPU away.
 * - Core-sched support.
 *
 * This scheduler is primarily for demonstration and testing of sched_ext
 * features and unlikely to be useful for actual workloads.
 *
 * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2022 Tejun Heo <tj@kernel.org>
 * Copyright (c) 2022 David Vernet <dvernet@meta.com>
 */
 #include <scx/common.bpf.h>
 enum consts {
 	ONE_SEC_IN_NS		= 1000000000,
 	SHARED_DSQ		= 0,
 	HIGHPRI_DSQ		= 1,
 	HIGHPRI_WEIGHT		= 8668,		/* this is what -20 maps to */
 };
 char _license[] SEC("license") = "GPL";
 const volatile u64 slice_ns = SCX_SLICE_DFL;
 const volatile u32 stall_user_nth;
 const volatile u32 stall_kernel_nth;
 const volatile u32 dsp_inf_loop_after;
 const volatile u32 dsp_batch;
 const volatile bool highpri_boosting;
 const volatile bool print_shared_dsq;
 const volatile s32 disallow_tgid;
 const volatile bool suppress_dump;
 u64 nr_highpri_queued;
 u32 test_error_cnt;
 UEI_DEFINE(uei);
 struct qmap {
 	__uint(type, BPF_MAP_TYPE_QUEUE);
 	__uint(max_entries, 4096);
 	__type(value, u32);
 } queue0 SEC(".maps"),
  queue1 SEC(".maps"),
  queue2 SEC(".maps"),
  queue3 SEC(".maps"),
  queue4 SEC(".maps");
 struct {
 	__uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
 	__uint(max_entries, 5);
 	__type(key, int);
 	__array(values, struct qmap);
 } queue_arr SEC(".maps") = {
 	.values = {
 		[0] = &queue0,
 		[1] = &queue1,
 		[2] = &queue2,
 		[3] = &queue3,
 		[4] = &queue4,
 	},
 };
 /*
 * If enabled, CPU performance target is set according to the queue index
 * according to the following table.
 */
 static const u32 qidx_to_cpuperf_target[] = {
 	[0] = SCX_CPUPERF_ONE * 0 / 4,
 	[1] = SCX_CPUPERF_ONE * 1 / 4,
 	[2] = SCX_CPUPERF_ONE * 2 / 4,
 	[3] = SCX_CPUPERF_ONE * 3 / 4,
 	[4] = SCX_CPUPERF_ONE * 4 / 4,
 };
 /*
 * Per-queue sequence numbers to implement core-sched ordering.
 *
 * Tail seq is assigned to each queued task and incremented. Head seq tracks the
 * sequence number of the latest dispatched task. The distance between the a
 * task's seq and the associated queue's head seq is called the queue distance
 * and used when comparing two tasks for ordering. See qmap_core_sched_before().
 */
 static u64 core_sched_head_seqs[5];
 static u64 core_sched_tail_seqs[5];
 /* Per-task scheduling context */
 struct task_ctx {
 	bool	force_local;	/* Dispatch directly to local_dsq */
 	bool	highpri;
 	u64	core_sched_seq;
 };
 struct {
 	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
 	__uint(map_flags, BPF_F_NO_PREALLOC);
 	__type(key, int);
 	__type(value, struct task_ctx);
 } task_ctx_stor SEC(".maps");
 struct cpu_ctx {
 	u64	dsp_idx;	/* dispatch index */
 	u64	dsp_cnt;	/* remaining count */
 	u32	avg_weight;
 	u32	cpuperf_target;
 };
 struct {
 	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
 	__uint(max_entries, 1);
 	__type(key, u32);
 	__type(value, struct cpu_ctx);
 } cpu_ctx_stor SEC(".maps");
 /* Statistics */
 u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued, nr_ddsp_from_enq;
 u64 nr_core_sched_execed;
 u64 nr_expedited_local, nr_expedited_remote, nr_expedited_lost, nr_expedited_from_timer;
 u32 cpuperf_min, cpuperf_avg, cpuperf_max;
 u32 cpuperf_target_min, cpuperf_target_avg, cpuperf_target_max;
 static s32 pick_direct_dispatch_cpu(struct task_struct *p, s32 prev_cpu)
 {
 	s32 cpu;
 	if (p->nr_cpus_allowed == 1 ||
 	    scx_bpf_test_and_clear_cpu_idle(prev_cpu))
 		return prev_cpu;
 	cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0);
 	if (cpu >= 0)
 		return cpu;
 	return -1;
 }
 static struct task_ctx *lookup_task_ctx(struct task_struct *p)
 {
 	struct task_ctx *tctx;
 	if (!(tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0))) {
 		scx_bpf_error("task_ctx lookup failed");
 		return NULL;
 	}
 	return tctx;
 }
 s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p,
 		   s32 prev_cpu, u64 wake_flags)
 {
 	struct task_ctx *tctx;
 	s32 cpu;
 	if (!(tctx = lookup_task_ctx(p)))
 		return -ESRCH;
 	cpu = pick_direct_dispatch_cpu(p, prev_cpu);
 	if (cpu >= 0) {
 		tctx->force_local = true;
 		return cpu;
 	} else {
 		return prev_cpu;
 	}
 }
 static int weight_to_idx(u32 weight)
 {
 	/* Coarsely map the compound weight to a FIFO. */
 	if (weight <= 25)
 		return 0;
 	else if (weight <= 50)
 		return 1;
 	else if (weight < 200)
 		return 2;
 	else if (weight < 400)
 		return 3;
 	else
 		return 4;
 }
 void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
 {
 	static u32 user_cnt, kernel_cnt;
 	struct task_ctx *tctx;
 	u32 pid = p->pid;
 	int idx = weight_to_idx(p->scx.weight);
 	void *ring;
 	s32 cpu;
 	if (p->flags & PF_KTHREAD) {
 		if (stall_kernel_nth && !(++kernel_cnt % stall_kernel_nth))
 			return;
 	} else {
 		if (stall_user_nth && !(++user_cnt % stall_user_nth))
 			return;
 	}
 	if (test_error_cnt && !--test_error_cnt)
 		scx_bpf_error("test triggering error");
 	if (!(tctx = lookup_task_ctx(p)))
 		return;
 	/*
 	 * All enqueued tasks must have their core_sched_seq updated for correct
 	 * core-sched ordering. Also, take a look at the end of qmap_dispatch().
 	 */
 	tctx->core_sched_seq = core_sched_tail_seqs[idx]++;
 	/*
 	 * If qmap_select_cpu() is telling us to or this is the last runnable
 	 * task on the CPU, enqueue locally.
 	 */
 	if (tctx->force_local) {
 		tctx->force_local = false;
 		scx_bpf_dispatch(p, SCX_DSQ_LOCAL, slice_ns, enq_flags);
 		return;
 	}
 	/* if !WAKEUP, select_cpu() wasn't called, try direct dispatch */
 	if (!(enq_flags & SCX_ENQ_WAKEUP) &&
 	    (cpu = pick_direct_dispatch_cpu(p, scx_bpf_task_cpu(p))) >= 0) {
 		__sync_fetch_and_add(&nr_ddsp_from_enq, 1);
 		scx_bpf_dispatch(p, SCX_DSQ_LOCAL_ON | cpu, slice_ns, enq_flags);
 		return;
 	}
 	/*
 	 * If the task was re-enqueued due to the CPU being preempted by a
 	 * higher priority scheduling class, just re-enqueue the task directly
 	 * on the global DSQ. As we want another CPU to pick it up, find and
 	 * kick an idle CPU.
 	 */
 	if (enq_flags & SCX_ENQ_REENQ) {
 		s32 cpu;
 		scx_bpf_dispatch(p, SHARED_DSQ, 0, enq_flags);
 		cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0);
 		if (cpu >= 0)
 			scx_bpf_kick_cpu(cpu, SCX_KICK_IDLE);
 		return;
 	}
 	ring = bpf_map_lookup_elem(&queue_arr, &idx);
 	if (!ring) {
 		scx_bpf_error("failed to find ring %d", idx);
 		return;
 	}
 	/* Queue on the selected FIFO. If the FIFO overflows, punt to global. */
 	if (bpf_map_push_elem(ring, &pid, 0)) {
 		scx_bpf_dispatch(p, SHARED_DSQ, slice_ns, enq_flags);
 		return;
 	}
 	if (highpri_boosting && p->scx.weight >= HIGHPRI_WEIGHT) {
 		tctx->highpri = true;
 		__sync_fetch_and_add(&nr_highpri_queued, 1);
 	}
 	__sync_fetch_and_add(&nr_enqueued, 1);
 }
 /*
 * The BPF queue map doesn't support removal and sched_ext can handle spurious
 * dispatches. qmap_dequeue() is only used to collect statistics.
 */
 void BPF_STRUCT_OPS(qmap_dequeue, struct task_struct *p, u64 deq_flags)
 {
 	__sync_fetch_and_add(&nr_dequeued, 1);
 	if (deq_flags & SCX_DEQ_CORE_SCHED_EXEC)
 		__sync_fetch_and_add(&nr_core_sched_execed, 1);
 }
 static void update_core_sched_head_seq(struct task_struct *p)
 {
 	int idx = weight_to_idx(p->scx.weight);
 	struct task_ctx *tctx;
 	if ((tctx = lookup_task_ctx(p)))
 		core_sched_head_seqs[idx] = tctx->core_sched_seq;
 }
 /*
 * To demonstrate the use of scx_bpf_dispatch_from_dsq(), implement silly
 * selective priority boosting mechanism by scanning SHARED_DSQ looking for
 * highpri tasks, moving them to HIGHPRI_DSQ and then consuming them first. This
 * makes minor difference only when dsp_batch is larger than 1.
 *
 * scx_bpf_dispatch[_vtime]_from_dsq() are allowed both from ops.dispatch() and
 * non-rq-lock holding BPF programs. As demonstration, this function is called
 * from qmap_dispatch() and monitor_timerfn().
 */
 static bool dispatch_highpri(bool from_timer)
 {
 	struct task_struct *p;
 	s32 this_cpu = bpf_get_smp_processor_id();
 	/* scan SHARED_DSQ and move highpri tasks to HIGHPRI_DSQ */
 	bpf_for_each(scx_dsq, p, SHARED_DSQ, 0) {
 		static u64 highpri_seq;
 		struct task_ctx *tctx;
 		if (!(tctx = lookup_task_ctx(p)))
 			return false;
 		if (tctx->highpri) {
 			/* exercise the set_*() and vtime interface too */
 			scx_bpf_dispatch_from_dsq_set_slice(
 				BPF_FOR_EACH_ITER, slice_ns * 2);
 			scx_bpf_dispatch_from_dsq_set_vtime(
 				BPF_FOR_EACH_ITER, highpri_seq++);
 			scx_bpf_dispatch_vtime_from_dsq(
 				BPF_FOR_EACH_ITER, p, HIGHPRI_DSQ, 0);
 		}
 	}
 	/*
 	 * Scan HIGHPRI_DSQ and dispatch until a task that can run on this CPU
 	 * is found.
 	 */
 	bpf_for_each(scx_dsq, p, HIGHPRI_DSQ, 0) {
 		bool dispatched = false;
 		s32 cpu;
 		if (bpf_cpumask_test_cpu(this_cpu, p->cpus_ptr))
 			cpu = this_cpu;
 		else
 			cpu = scx_bpf_pick_any_cpu(p->cpus_ptr, 0);
 		if (scx_bpf_dispatch_from_dsq(BPF_FOR_EACH_ITER, p,
 					      SCX_DSQ_LOCAL_ON | cpu,
 					      SCX_ENQ_PREEMPT)) {
 			if (cpu == this_cpu) {
 				dispatched = true;
 				__sync_fetch_and_add(&nr_expedited_local, 1);
 			} else {
 				__sync_fetch_and_add(&nr_expedited_remote, 1);
 			}
 			if (from_timer)
 				__sync_fetch_and_add(&nr_expedited_from_timer, 1);
 		} else {
 			__sync_fetch_and_add(&nr_expedited_lost, 1);
 		}
 		if (dispatched)
 			return true;
 	}
 	return false;
 }
 void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
 {
 	struct task_struct *p;
 	struct cpu_ctx *cpuc;
 	struct task_ctx *tctx;
 	u32 zero = 0, batch = dsp_batch ?: 1;
 	void *fifo;
 	s32 i, pid;
 	if (dispatch_highpri(false))
 		return;
 	if (!nr_highpri_queued && scx_bpf_consume(SHARED_DSQ))
 		return;
 	if (dsp_inf_loop_after && nr_dispatched > dsp_inf_loop_after) {
 		/*
 		 * PID 2 should be kthreadd which should mostly be idle and off
 		 * the scheduler. Let's keep dispatching it to force the kernel
 		 * to call this function over and over again.
 		 */
 		p = bpf_task_from_pid(2);
 		if (p) {
 			scx_bpf_dispatch(p, SCX_DSQ_LOCAL, slice_ns, 0);
 			bpf_task_release(p);
 			return;
 		}
 	}
 	if (!(cpuc = bpf_map_lookup_elem(&cpu_ctx_stor, &zero))) {
 		scx_bpf_error("failed to look up cpu_ctx");
 		return;
 	}
 	for (i = 0; i < 5; i++) {
 		/* Advance the dispatch cursor and pick the fifo. */
 		if (!cpuc->dsp_cnt) {
 			cpuc->dsp_idx = (cpuc->dsp_idx + 1) % 5;
 			cpuc->dsp_cnt = 1 << cpuc->dsp_idx;
 		}
 		fifo = bpf_map_lookup_elem(&queue_arr, &cpuc->dsp_idx);
 		if (!fifo) {
 			scx_bpf_error("failed to find ring %llu", cpuc->dsp_idx);
 			return;
 		}
 		/* Dispatch or advance. */
 		bpf_repeat(BPF_MAX_LOOPS) {
 			struct task_ctx *tctx;
 			if (bpf_map_pop_elem(fifo, &pid))
 				break;
 			p = bpf_task_from_pid(pid);
 			if (!p)
 				continue;
 			if (!(tctx = lookup_task_ctx(p))) {
 				bpf_task_release(p);
 				return;
 			}
 			if (tctx->highpri)
 				__sync_fetch_and_sub(&nr_highpri_queued, 1);
 			update_core_sched_head_seq(p);
 			__sync_fetch_and_add(&nr_dispatched, 1);
 			scx_bpf_dispatch(p, SHARED_DSQ, slice_ns, 0);
 			bpf_task_release(p);
 			batch--;
 			cpuc->dsp_cnt--;
 			if (!batch || !scx_bpf_dispatch_nr_slots()) {
 				if (dispatch_highpri(false))
 					return;
 				scx_bpf_consume(SHARED_DSQ);
 				return;
 			}
 			if (!cpuc->dsp_cnt)
 				break;
 		}
 		cpuc->dsp_cnt = 0;
 	}
 	/*
 	 * No other tasks. @prev will keep running. Update its core_sched_seq as
 	 * if the task were enqueued and dispatched immediately.
 	 */
 	if (prev) {
 		tctx = bpf_task_storage_get(&task_ctx_stor, prev, 0, 0);
 		if (!tctx) {
 			scx_bpf_error("task_ctx lookup failed");
 			return;
 		}
 		tctx->core_sched_seq =
 			core_sched_tail_seqs[weight_to_idx(prev->scx.weight)]++;
 	}
 }
 void BPF_STRUCT_OPS(qmap_tick, struct task_struct *p)
 {
 	struct cpu_ctx *cpuc;
 	u32 zero = 0;
 	int idx;
 	if (!(cpuc = bpf_map_lookup_elem(&cpu_ctx_stor, &zero))) {
 		scx_bpf_error("failed to look up cpu_ctx");
 		return;
 	}
 	/*
 	 * Use the running avg of weights to select the target cpuperf level.
 	 * This is a demonstration of the cpuperf feature rather than a
 	 * practical strategy to regulate CPU frequency.
 	 */
 	cpuc->avg_weight = cpuc->avg_weight * 3 / 4 + p->scx.weight / 4;
 	idx = weight_to_idx(cpuc->avg_weight);
 	cpuc->cpuperf_target = qidx_to_cpuperf_target[idx];
 	scx_bpf_cpuperf_set(scx_bpf_task_cpu(p), cpuc->cpuperf_target);
 }
 /*
 * The distance from the head of the queue scaled by the weight of the queue.
 * The lower the number, the older the task and the higher the priority.
 */
 static s64 task_qdist(struct task_struct *p)
 {
 	int idx = weight_to_idx(p->scx.weight);
 	struct task_ctx *tctx;
 	s64 qdist;
 	tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
 	if (!tctx) {
 		scx_bpf_error("task_ctx lookup failed");
 		return 0;
 	}
 	qdist = tctx->core_sched_seq - core_sched_head_seqs[idx];
 	/*
 	 * As queue index increments, the priority doubles. The queue w/ index 3
 	 * is dispatched twice more frequently than 2. Reflect the difference by
 	 * scaling qdists accordingly. Note that the shift amount needs to be
 	 * flipped depending on the sign to avoid flipping priority direction.
 	 */
 	if (qdist >= 0)
 		return qdist << (4 - idx);
 	else
 		return qdist << idx;
 }
 /*
 * This is called to determine the task ordering when core-sched is picking
 * tasks to execute on SMT siblings and should encode about the same ordering as
 * the regular scheduling path. Use the priority-scaled distances from the head
 * of the queues to compare the two tasks which should be consistent with the
 * dispatch path behavior.
 */
 bool BPF_STRUCT_OPS(qmap_core_sched_before,
 		    struct task_struct *a, struct task_struct *b)
 {
 	return task_qdist(a) > task_qdist(b);
 }
 void BPF_STRUCT_OPS(qmap_cpu_release, s32 cpu, struct scx_cpu_release_args *args)
 {
 	u32 cnt;
 	/*
 	 * Called when @cpu is taken by a higher priority scheduling class. This
 	 * makes @cpu no longer available for executing sched_ext tasks. As we
 	 * don't want the tasks in @cpu's local dsq to sit there until @cpu
 	 * becomes available again, re-enqueue them into the global dsq. See
 	 * %SCX_ENQ_REENQ handling in qmap_enqueue().
 	 */
 	cnt = scx_bpf_reenqueue_local();
 	if (cnt)
 		__sync_fetch_and_add(&nr_reenqueued, cnt);
 }
 s32 BPF_STRUCT_OPS(qmap_init_task, struct task_struct *p,
 		   struct scx_init_task_args *args)
 {
 	if (p->tgid == disallow_tgid)
 		p->scx.disallow = true;
 	/*
 	 * @p is new. Let's ensure that its task_ctx is available. We can sleep
 	 * in this function and the following will automatically use GFP_KERNEL.
 	 */
 	if (bpf_task_storage_get(&task_ctx_stor, p, 0,
 				 BPF_LOCAL_STORAGE_GET_F_CREATE))
 		return 0;
 	else
 		return -ENOMEM;
 }
 void BPF_STRUCT_OPS(qmap_dump, struct scx_dump_ctx *dctx)
 {
 	s32 i, pid;
 	if (suppress_dump)
 		return;
 	bpf_for(i, 0, 5) {
 		void *fifo;
 		if (!(fifo = bpf_map_lookup_elem(&queue_arr, &i)))
 			return;
 		scx_bpf_dump("QMAP FIFO[%d]:", i);
 		bpf_repeat(4096) {
 			if (bpf_map_pop_elem(fifo, &pid))
 				break;
 			scx_bpf_dump(" %d", pid);
 		}
 		scx_bpf_dump("\n");
 	}
 }
 void BPF_STRUCT_OPS(qmap_dump_cpu, struct scx_dump_ctx *dctx, s32 cpu, bool idle)
 {
 	u32 zero = 0;
 	struct cpu_ctx *cpuc;
 	if (suppress_dump || idle)
 		return;
 	if (!(cpuc = bpf_map_lookup_percpu_elem(&cpu_ctx_stor, &zero, cpu)))
 		return;
 	scx_bpf_dump("QMAP: dsp_idx=%llu dsp_cnt=%llu avg_weight=%u cpuperf_target=%u",
 		     cpuc->dsp_idx, cpuc->dsp_cnt, cpuc->avg_weight,
 		     cpuc->cpuperf_target);
 }
 void BPF_STRUCT_OPS(qmap_dump_task, struct scx_dump_ctx *dctx, struct task_struct *p)
 {
 	struct task_ctx *taskc;
 	if (suppress_dump)
 		return;
 	if (!(taskc = bpf_task_storage_get(&task_ctx_stor, p, 0, 0)))
 		return;
 	scx_bpf_dump("QMAP: force_local=%d core_sched_seq=%llu",
 		     taskc->force_local, taskc->core_sched_seq);
 }
 /*
 * Print out the online and possible CPU map using bpf_printk() as a
 * demonstration of using the cpumask kfuncs and ops.cpu_on/offline().
 */
 static void print_cpus(void)
 {
 	const struct cpumask *possible, *online;
 	s32 cpu;
 	char buf[128] = "", *p;
 	int idx;
 	possible = scx_bpf_get_possible_cpumask();
 	online = scx_bpf_get_online_cpumask();
 	idx = 0;
 	bpf_for(cpu, 0, scx_bpf_nr_cpu_ids()) {
 		if (!(p = MEMBER_VPTR(buf, [idx++])))
 			break;
 		if (bpf_cpumask_test_cpu(cpu, online))
 			*p++ = 'O';
 		else if (bpf_cpumask_test_cpu(cpu, possible))
 			*p++ = 'X';
 		else
 			*p++ = ' ';
 		if ((cpu & 7) == 7) {
 			if (!(p = MEMBER_VPTR(buf, [idx++])))
 				break;
 			*p++ = '|';
 		}
 	}
 	buf[sizeof(buf) - 1] = '\0';
 	scx_bpf_put_cpumask(online);
 	scx_bpf_put_cpumask(possible);
 	bpf_printk("CPUS: |%s", buf);
 }
 void BPF_STRUCT_OPS(qmap_cpu_online, s32 cpu)
 {
 	bpf_printk("CPU %d coming online", cpu);
 	/* @cpu is already online at this point */
 	print_cpus();
 }
 void BPF_STRUCT_OPS(qmap_cpu_offline, s32 cpu)
 {
 	bpf_printk("CPU %d going offline", cpu);
 	/* @cpu is still online at this point */
 	print_cpus();
 }
 struct monitor_timer {
 	struct bpf_timer timer;
 };
 struct {
 	__uint(type, BPF_MAP_TYPE_ARRAY);
 	__uint(max_entries, 1);
 	__type(key, u32);
 	__type(value, struct monitor_timer);
 } monitor_timer SEC(".maps");
 /*
 * Print out the min, avg and max performance levels of CPUs every second to
 * demonstrate the cpuperf interface.
 */
 static void monitor_cpuperf(void)
 {
 	u32 zero = 0, nr_cpu_ids;
 	u64 cap_sum = 0, cur_sum = 0, cur_min = SCX_CPUPERF_ONE, cur_max = 0;
 	u64 target_sum = 0, target_min = SCX_CPUPERF_ONE, target_max = 0;
 	const struct cpumask *online;
 	int i, nr_online_cpus = 0;
 	nr_cpu_ids = scx_bpf_nr_cpu_ids();
 	online = scx_bpf_get_online_cpumask();
 	bpf_for(i, 0, nr_cpu_ids) {
 		struct cpu_ctx *cpuc;
 		u32 cap, cur;
 		if (!bpf_cpumask_test_cpu(i, online))
 			continue;
 		nr_online_cpus++;
 		/* collect the capacity and current cpuperf */
 		cap = scx_bpf_cpuperf_cap(i);
 		cur = scx_bpf_cpuperf_cur(i);
 		cur_min = cur < cur_min ? cur : cur_min;
 		cur_max = cur > cur_max ? cur : cur_max;
 		/*
 		 * $cur is relative to $cap. Scale it down accordingly so that
 		 * it's in the same scale as other CPUs and $cur_sum/$cap_sum
 		 * makes sense.
 		 */
 		cur_sum += cur * cap / SCX_CPUPERF_ONE;
 		cap_sum += cap;
 		if (!(cpuc = bpf_map_lookup_percpu_elem(&cpu_ctx_stor, &zero, i))) {
 			scx_bpf_error("failed to look up cpu_ctx");
 			goto out;
 		}
 		/* collect target */
 		cur = cpuc->cpuperf_target;
 		target_sum += cur;
 		target_min = cur < target_min ? cur : target_min;
 		target_max = cur > target_max ? cur : target_max;
 	}
 	cpuperf_min = cur_min;
 	cpuperf_avg = cur_sum * SCX_CPUPERF_ONE / cap_sum;
 	cpuperf_max = cur_max;
 	cpuperf_target_min = target_min;
 	cpuperf_target_avg = target_sum / nr_online_cpus;
 	cpuperf_target_max = target_max;
 out:
 	scx_bpf_put_cpumask(online);
 }
 /*
 * Dump the currently queued tasks in the shared DSQ to demonstrate the usage of
 * scx_bpf_dsq_nr_queued() and DSQ iterator. Raise the dispatch batch count to
 * see meaningful dumps in the trace pipe.
 */
 static void dump_shared_dsq(void)
 {
 	struct task_struct *p;
 	s32 nr;
 	if (!(nr = scx_bpf_dsq_nr_queued(SHARED_DSQ)))
 		return;
 	bpf_printk("Dumping %d tasks in SHARED_DSQ in reverse order", nr);
 	bpf_rcu_read_lock();
 	bpf_for_each(scx_dsq, p, SHARED_DSQ, SCX_DSQ_ITER_REV)
 		bpf_printk("%s[%d]", p->comm, p->pid);
 	bpf_rcu_read_unlock();
 }
 static int monitor_timerfn(void *map, int *key, struct bpf_timer *timer)
 {
 	bpf_rcu_read_lock();
 	dispatch_highpri(true);
 	bpf_rcu_read_unlock();
 	monitor_cpuperf();
 	if (print_shared_dsq)
 		dump_shared_dsq();
 	bpf_timer_start(timer, ONE_SEC_IN_NS, 0);
 	return 0;
 }
 s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init)
 {
 	u32 key = 0;
 	struct bpf_timer *timer;
 	s32 ret;
 	print_cpus();
 	ret = scx_bpf_create_dsq(SHARED_DSQ, -1);
 	if (ret)
 		return ret;
 	ret = scx_bpf_create_dsq(HIGHPRI_DSQ, -1);
 	if (ret)
 		return ret;
 	timer = bpf_map_lookup_elem(&monitor_timer, &key);
 	if (!timer)
 		return -ESRCH;
 	bpf_timer_init(timer, &monitor_timer, CLOCK_MONOTONIC);
 	bpf_timer_set_callback(timer, monitor_timerfn);
 	return bpf_timer_start(timer, ONE_SEC_IN_NS, 0);
 }
 void BPF_STRUCT_OPS(qmap_exit, struct scx_exit_info *ei)
 {
 	UEI_RECORD(uei, ei);
 }
 SCX_OPS_DEFINE(qmap_ops,
 	       .select_cpu		= (void *)qmap_select_cpu,
 	       .enqueue			= (void *)qmap_enqueue,
 	       .dequeue			= (void *)qmap_dequeue,
 	       .dispatch		= (void *)qmap_dispatch,
 	       .tick			= (void *)qmap_tick,
 	       .core_sched_before	= (void *)qmap_core_sched_before,
 	       .cpu_release		= (void *)qmap_cpu_release,
 	       .init_task		= (void *)qmap_init_task,
 	       .dump			= (void *)qmap_dump,
 	       .dump_cpu		= (void *)qmap_dump_cpu,
 	       .dump_task		= (void *)qmap_dump_task,
 	       .cpu_online		= (void *)qmap_cpu_online,
 	       .cpu_offline		= (void *)qmap_cpu_offline,
 	       .init			= (void *)qmap_init,
 	       .exit			= (void *)qmap_exit,
 	       .timeout_ms		= 5000U,
 	       .name			= "qmap");
--- a/tools/sched_ext/scx_qmap.c
+++ b/tools/sched_ext/scx_qmap.c
@ -0,0 +1,153 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2022 Tejun Heo <tj@kernel.org>
 * Copyright (c) 2022 David Vernet <dvernet@meta.com>
 */
 #include <stdio.h>
 #include <stdlib.h>
 #include <unistd.h>
 #include <inttypes.h>
 #include <signal.h>
 #include <libgen.h>
 #include <bpf/bpf.h>
 #include <scx/common.h>
 #include "scx_qmap.bpf.skel.h"
 const char help_fmt[] =
 "A simple five-level FIFO queue sched_ext scheduler.\n"
 "\n"
 "See the top-level comment in .bpf.c for more details.\n"
 "\n"
 "Usage: %s [-s SLICE_US] [-e COUNT] [-t COUNT] [-T COUNT] [-l COUNT] [-b COUNT]\n"
 "       [-P] [-d PID] [-D LEN] [-p] [-v]\n"
 "\n"
 "  -s SLICE_US   Override slice duration\n"
 "  -e COUNT      Trigger scx_bpf_error() after COUNT enqueues\n"
 "  -t COUNT      Stall every COUNT'th user thread\n"
 "  -T COUNT      Stall every COUNT'th kernel thread\n"
 "  -l COUNT      Trigger dispatch infinite looping after COUNT dispatches\n"
 "  -b COUNT      Dispatch upto COUNT tasks together\n"
 "  -P            Print out DSQ content to trace_pipe every second, use with -b\n"
 "  -H            Boost nice -20 tasks in SHARED_DSQ, use with -b\n"
 "  -d PID        Disallow a process from switching into SCHED_EXT (-1 for self)\n"
 "  -D LEN        Set scx_exit_info.dump buffer length\n"
 "  -S            Suppress qmap-specific debug dump\n"
 "  -p            Switch only tasks on SCHED_EXT policy instead of all\n"
 "  -v            Print libbpf debug messages\n"
 "  -h            Display this help and exit\n";
 static bool verbose;
 static volatile int exit_req;
 static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
 {
 	if (level == LIBBPF_DEBUG && !verbose)
 		return 0;
 	return vfprintf(stderr, format, args);
 }
 static void sigint_handler(int dummy)
 {
 	exit_req = 1;
 }
 int main(int argc, char **argv)
 {
 	struct scx_qmap *skel;
 	struct bpf_link *link;
 	int opt;
 	libbpf_set_print(libbpf_print_fn);
 	signal(SIGINT, sigint_handler);
 	signal(SIGTERM, sigint_handler);
 	skel = SCX_OPS_OPEN(qmap_ops, scx_qmap);
 	while ((opt = getopt(argc, argv, "s:e:t:T:l:b:PHd:D:Spvh")) != -1) {
 		switch (opt) {
 		case 's':
 			skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000;
 			break;
 		case 'e':
 			skel->bss->test_error_cnt = strtoul(optarg, NULL, 0);
 			break;
 		case 't':
 			skel->rodata->stall_user_nth = strtoul(optarg, NULL, 0);
 			break;
 		case 'T':
 			skel->rodata->stall_kernel_nth = strtoul(optarg, NULL, 0);
 			break;
 		case 'l':
 			skel->rodata->dsp_inf_loop_after = strtoul(optarg, NULL, 0);
 			break;
 		case 'b':
 			skel->rodata->dsp_batch = strtoul(optarg, NULL, 0);
 			break;
 		case 'P':
 			skel->rodata->print_shared_dsq = true;
 			break;
 		case 'H':
 			skel->rodata->highpri_boosting = true;
 			break;
 		case 'd':
 			skel->rodata->disallow_tgid = strtol(optarg, NULL, 0);
 			if (skel->rodata->disallow_tgid < 0)
 				skel->rodata->disallow_tgid = getpid();
 			break;
 		case 'D':
 			skel->struct_ops.qmap_ops->exit_dump_len = strtoul(optarg, NULL, 0);
 			break;
 		case 'S':
 			skel->rodata->suppress_dump = true;
 			break;
 		case 'p':
 			skel->struct_ops.qmap_ops->flags |= SCX_OPS_SWITCH_PARTIAL;
 			break;
 		case 'v':
 			verbose = true;
 			break;
 		default:
 			fprintf(stderr, help_fmt, basename(argv[0]));
 			return opt != 'h';
 		}
 	}
 	SCX_OPS_LOAD(skel, qmap_ops, scx_qmap, uei);
 	link = SCX_OPS_ATTACH(skel, qmap_ops, scx_qmap);
 	while (!exit_req && !UEI_EXITED(skel, uei)) {
 		long nr_enqueued = skel->bss->nr_enqueued;
 		long nr_dispatched = skel->bss->nr_dispatched;
 		printf("stats  : enq=%lu dsp=%lu delta=%ld reenq=%"PRIu64" deq=%"PRIu64" core=%"PRIu64" enq_ddsp=%"PRIu64"\n",
 		       nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched,
 		       skel->bss->nr_reenqueued, skel->bss->nr_dequeued,
 		       skel->bss->nr_core_sched_execed,
 		       skel->bss->nr_ddsp_from_enq);
 		printf("         exp_local=%"PRIu64" exp_remote=%"PRIu64" exp_timer=%"PRIu64" exp_lost=%"PRIu64"\n",
 		       skel->bss->nr_expedited_local,
 		       skel->bss->nr_expedited_remote,
 		       skel->bss->nr_expedited_from_timer,
 		       skel->bss->nr_expedited_lost);
 		if (__COMPAT_has_ksym("scx_bpf_cpuperf_cur"))
 			printf("cpuperf: cur min/avg/max=%u/%u/%u target min/avg/max=%u/%u/%u\n",
 			       skel->bss->cpuperf_min,
 			       skel->bss->cpuperf_avg,
 			       skel->bss->cpuperf_max,
 			       skel->bss->cpuperf_target_min,
 			       skel->bss->cpuperf_target_avg,
 			       skel->bss->cpuperf_target_max);
 		fflush(stdout);
 		sleep(1);
 	}
 	bpf_link__destroy(link);
 	UEI_REPORT(skel, uei);
 	scx_qmap__destroy(skel);
 	/*
 	 * scx_qmap implements ops.cpu_on/offline() and doesn't need to restart
 	 * on CPU hotplug events.
 	 */
 	return 0;
 }
--- a/tools/sched_ext/scx_show_state.py
+++ b/tools/sched_ext/scx_show_state.py
@ -0,0 +1,39 @@
 #!/usr/bin/env drgn
 #
 # Copyright (C) 2024 Tejun Heo <tj@kernel.org>
 # Copyright (C) 2024 Meta Platforms, Inc. and affiliates.
 desc = """
 This is a drgn script to show the current sched_ext state.
 For more info on drgn, visit https://github.com/osandov/drgn.
 """
 import drgn
 import sys
 def err(s):
    print(s, file=sys.stderr, flush=True)
    sys.exit(1)
 def read_int(name):
    return int(prog[name].value_())
 def read_atomic(name):
    return prog[name].counter.value_()
 def read_static_key(name):
    return prog[name].key.enabled.counter.value_()
 def ops_state_str(state):
    return prog['scx_ops_enable_state_str'][state].string_().decode()
 ops = prog['scx_ops']
 enable_state = read_atomic("scx_ops_enable_state_var")
 print(f'ops           : {ops.name.string_().decode()}')
 print(f'enabled       : {read_static_key("__scx_ops_enabled")}')
 print(f'switching_all : {read_int("scx_switching_all")}')
 print(f'switched_all  : {read_static_key("__scx_switched_all")}')
 print(f'enable_state  : {ops_state_str(enable_state)} ({enable_state})')
 print(f'bypass_depth  : {read_atomic("scx_ops_bypass_depth")}')
 print(f'nr_rejected   : {read_atomic("scx_nr_rejected")}')
--- a/tools/sched_ext/scx_simple.bpf.c
+++ b/tools/sched_ext/scx_simple.bpf.c
@ -0,0 +1,156 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * A simple scheduler.
 *
 * By default, it operates as a simple global weighted vtime scheduler and can
 * be switched to FIFO scheduling. It also demonstrates the following niceties.
 *
 * - Statistics tracking how many tasks are queued to local and global dsq's.
 * - Termination notification for userspace.
 *
 * While very simple, this scheduler should work reasonably well on CPUs with a
 * uniform L3 cache topology. While preemption is not implemented, the fact that
 * the scheduling queue is shared across all CPUs means that whatever is at the
 * front of the queue is likely to be executed fairly quickly given enough
 * number of CPUs. The FIFO scheduling mode may be beneficial to some workloads
 * but comes with the usual problems with FIFO scheduling where saturating
 * threads can easily drown out interactive ones.
 *
 * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2022 Tejun Heo <tj@kernel.org>
 * Copyright (c) 2022 David Vernet <dvernet@meta.com>
 */
 #include <scx/common.bpf.h>
 char _license[] SEC("license") = "GPL";
 const volatile bool fifo_sched;
 static u64 vtime_now;
 UEI_DEFINE(uei);
 /*
 * Built-in DSQs such as SCX_DSQ_GLOBAL cannot be used as priority queues
 * (meaning, cannot be dispatched to with scx_bpf_dispatch_vtime()). We
 * therefore create a separate DSQ with ID 0 that we dispatch to and consume
 * from. If scx_simple only supported global FIFO scheduling, then we could
 * just use SCX_DSQ_GLOBAL.
 */
 #define SHARED_DSQ 0
 struct {
 	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
 	__uint(key_size, sizeof(u32));
 	__uint(value_size, sizeof(u64));
 	__uint(max_entries, 2);			/* [local, global] */
 } stats SEC(".maps");
 static void stat_inc(u32 idx)
 {
 	u64 *cnt_p = bpf_map_lookup_elem(&stats, &idx);
 	if (cnt_p)
 		(*cnt_p)++;
 }
 static inline bool vtime_before(u64 a, u64 b)
 {
 	return (s64)(a - b) < 0;
 }
 s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags)
 {
 	bool is_idle = false;
 	s32 cpu;
 	cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &is_idle);
 	if (is_idle) {
 		stat_inc(0);	/* count local queueing */
 		scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
 	}
 	return cpu;
 }
 void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags)
 {
 	stat_inc(1);	/* count global queueing */
 	if (fifo_sched) {
 		scx_bpf_dispatch(p, SHARED_DSQ, SCX_SLICE_DFL, enq_flags);
 	} else {
 		u64 vtime = p->scx.dsq_vtime;
 		/*
 		 * Limit the amount of budget that an idling task can accumulate
 		 * to one slice.
 		 */
 		if (vtime_before(vtime, vtime_now - SCX_SLICE_DFL))
 			vtime = vtime_now - SCX_SLICE_DFL;
 		scx_bpf_dispatch_vtime(p, SHARED_DSQ, SCX_SLICE_DFL, vtime,
 				       enq_flags);
 	}
 }
 void BPF_STRUCT_OPS(simple_dispatch, s32 cpu, struct task_struct *prev)
 {
 	scx_bpf_consume(SHARED_DSQ);
 }
 void BPF_STRUCT_OPS(simple_running, struct task_struct *p)
 {
 	if (fifo_sched)
 		return;
 	/*
 	 * Global vtime always progresses forward as tasks start executing. The
 	 * test and update can be performed concurrently from multiple CPUs and
 	 * thus racy. Any error should be contained and temporary. Let's just
 	 * live with it.
 	 */
 	if (vtime_before(vtime_now, p->scx.dsq_vtime))
 		vtime_now = p->scx.dsq_vtime;
 }
 void BPF_STRUCT_OPS(simple_stopping, struct task_struct *p, bool runnable)
 {
 	if (fifo_sched)
 		return;
 	/*
 	 * Scale the execution time by the inverse of the weight and charge.
 	 *
 	 * Note that the default yield implementation yields by setting
 	 * @p->scx.slice to zero and the following would treat the yielding task
 	 * as if it has consumed all its slice. If this penalizes yielding tasks
 	 * too much, determine the execution time by taking explicit timestamps
 	 * instead of depending on @p->scx.slice.
 	 */
 	p->scx.dsq_vtime += (SCX_SLICE_DFL - p->scx.slice) * 100 / p->scx.weight;
 }
 void BPF_STRUCT_OPS(simple_enable, struct task_struct *p)
 {
 	p->scx.dsq_vtime = vtime_now;
 }
 s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init)
 {
 	return scx_bpf_create_dsq(SHARED_DSQ, -1);
 }
 void BPF_STRUCT_OPS(simple_exit, struct scx_exit_info *ei)
 {
 	UEI_RECORD(uei, ei);
 }
 SCX_OPS_DEFINE(simple_ops,
 	       .select_cpu		= (void *)simple_select_cpu,
 	       .enqueue			= (void *)simple_enqueue,
 	       .dispatch		= (void *)simple_dispatch,
 	       .running			= (void *)simple_running,
 	       .stopping		= (void *)simple_stopping,
 	       .enable			= (void *)simple_enable,
 	       .init			= (void *)simple_init,
 	       .exit			= (void *)simple_exit,
 	       .name			= "simple");
--- a/tools/sched_ext/scx_simple.c
+++ b/tools/sched_ext/scx_simple.c
@ -0,0 +1,107 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2022 Tejun Heo <tj@kernel.org>
 * Copyright (c) 2022 David Vernet <dvernet@meta.com>
 */
 #include <stdio.h>
 #include <unistd.h>
 #include <signal.h>
 #include <libgen.h>
 #include <bpf/bpf.h>
 #include <scx/common.h>
 #include "scx_simple.bpf.skel.h"
 const char help_fmt[] =
 "A simple sched_ext scheduler.\n"
 "\n"
 "See the top-level comment in .bpf.c for more details.\n"
 "\n"
 "Usage: %s [-f] [-v]\n"
 "\n"
 "  -f            Use FIFO scheduling instead of weighted vtime scheduling\n"
 "  -v            Print libbpf debug messages\n"
 "  -h            Display this help and exit\n";
 static bool verbose;
 static volatile int exit_req;
 static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
 {
 	if (level == LIBBPF_DEBUG && !verbose)
 		return 0;
 	return vfprintf(stderr, format, args);
 }
 static void sigint_handler(int simple)
 {
 	exit_req = 1;
 }
 static void read_stats(struct scx_simple *skel, __u64 *stats)
 {
 	int nr_cpus = libbpf_num_possible_cpus();
 	__u64 cnts[2][nr_cpus];
 	__u32 idx;
 	memset(stats, 0, sizeof(stats[0]) * 2);
 	for (idx = 0; idx < 2; idx++) {
 		int ret, cpu;
 		ret = bpf_map_lookup_elem(bpf_map__fd(skel->maps.stats),
 					  &idx, cnts[idx]);
 		if (ret < 0)
 			continue;
 		for (cpu = 0; cpu < nr_cpus; cpu++)
 			stats[idx] += cnts[idx][cpu];
 	}
 }
 int main(int argc, char **argv)
 {
 	struct scx_simple *skel;
 	struct bpf_link *link;
 	__u32 opt;
 	__u64 ecode;
 	libbpf_set_print(libbpf_print_fn);
 	signal(SIGINT, sigint_handler);
 	signal(SIGTERM, sigint_handler);
 restart:
 	skel = SCX_OPS_OPEN(simple_ops, scx_simple);
 	while ((opt = getopt(argc, argv, "fvh")) != -1) {
 		switch (opt) {
 		case 'f':
 			skel->rodata->fifo_sched = true;
 			break;
 		case 'v':
 			verbose = true;
 			break;
 		default:
 			fprintf(stderr, help_fmt, basename(argv[0]));
 			return opt != 'h';
 		}
 	}
 	SCX_OPS_LOAD(skel, simple_ops, scx_simple, uei);
 	link = SCX_OPS_ATTACH(skel, simple_ops, scx_simple);
 	while (!exit_req && !UEI_EXITED(skel, uei)) {
 		__u64 stats[2];
 		read_stats(skel, stats);
 		printf("local=%llu global=%llu\n", stats[0], stats[1]);
 		fflush(stdout);
 		sleep(1);
 	}
 	bpf_link__destroy(link);
 	ecode = UEI_REPORT(skel, uei);
 	scx_simple__destroy(skel);
 	if (UEI_ECODE_RESTART(ecode))
 		goto restart;
 	return 0;
 }
--- a/tools/testing/selftests/sched_ext/.gitignore
+++ b/tools/testing/selftests/sched_ext/.gitignore
@ -0,0 +1,6 @@
 *
 !*.c
 !*.h
 !Makefile
 !.gitignore
 !config
--- a/tools/testing/selftests/sched_ext/Makefile
+++ b/tools/testing/selftests/sched_ext/Makefile
@ -0,0 +1,218 @@
 # SPDX-License-Identifier: GPL-2.0
 # Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
 include ../../../build/Build.include
 include ../../../scripts/Makefile.arch
 include ../../../scripts/Makefile.include
 include ../lib.mk
 ifneq ($(LLVM),)
 ifneq ($(filter %/,$(LLVM)),)
 LLVM_PREFIX := $(LLVM)
 else ifneq ($(filter -%,$(LLVM)),)
 LLVM_SUFFIX := $(LLVM)
 endif
 CC := $(LLVM_PREFIX)clang$(LLVM_SUFFIX) $(CLANG_FLAGS) -fintegrated-as
 else
 CC := gcc
 endif # LLVM
 ifneq ($(CROSS_COMPILE),)
 $(error CROSS_COMPILE not supported for scx selftests)
 endif # CROSS_COMPILE
 CURDIR := $(abspath .)
 REPOROOT := $(abspath ../../../..)
 TOOLSDIR := $(REPOROOT)/tools
 LIBDIR := $(TOOLSDIR)/lib
 BPFDIR := $(LIBDIR)/bpf
 TOOLSINCDIR := $(TOOLSDIR)/include
 BPFTOOLDIR := $(TOOLSDIR)/bpf/bpftool
 APIDIR := $(TOOLSINCDIR)/uapi
 GENDIR := $(REPOROOT)/include/generated
 GENHDR := $(GENDIR)/autoconf.h
 SCXTOOLSDIR := $(TOOLSDIR)/sched_ext
 SCXTOOLSINCDIR := $(TOOLSDIR)/sched_ext/include
 OUTPUT_DIR := $(CURDIR)/build
 OBJ_DIR := $(OUTPUT_DIR)/obj
 INCLUDE_DIR := $(OUTPUT_DIR)/include
 BPFOBJ_DIR := $(OBJ_DIR)/libbpf
 SCXOBJ_DIR := $(OBJ_DIR)/sched_ext
 BPFOBJ := $(BPFOBJ_DIR)/libbpf.a
 LIBBPF_OUTPUT := $(OBJ_DIR)/libbpf/libbpf.a
 DEFAULT_BPFTOOL := $(OUTPUT_DIR)/sbin/bpftool
 HOST_BUILD_DIR := $(OBJ_DIR)
 HOST_OUTPUT_DIR := $(OUTPUT_DIR)
 VMLINUX_BTF_PATHS ?= ../../../../vmlinux					\
 		     /sys/kernel/btf/vmlinux					\
 		     /boot/vmlinux-$(shell uname -r)
 VMLINUX_BTF ?= $(abspath $(firstword $(wildcard $(VMLINUX_BTF_PATHS))))
 ifeq ($(VMLINUX_BTF),)
 $(error Cannot find a vmlinux for VMLINUX_BTF at any of "$(VMLINUX_BTF_PATHS)")
 endif
 BPFTOOL ?= $(DEFAULT_BPFTOOL)
 ifneq ($(wildcard $(GENHDR)),)
  GENFLAGS := -DHAVE_GENHDR
 endif
 CFLAGS += -g -O2 -rdynamic -pthread -Wall -Werror $(GENFLAGS)			\
 	  -I$(INCLUDE_DIR) -I$(GENDIR) -I$(LIBDIR)				\
 	  -I$(TOOLSINCDIR) -I$(APIDIR) -I$(CURDIR)/include -I$(SCXTOOLSINCDIR)
 # Silence some warnings when compiled with clang
 ifneq ($(LLVM),)
 CFLAGS += -Wno-unused-command-line-argument
 endif
 LDFLAGS = -lelf -lz -lpthread -lzstd
 IS_LITTLE_ENDIAN = $(shell $(CC) -dM -E - </dev/null |				\
 			grep 'define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__')
 # Get Clang's default includes on this system, as opposed to those seen by
 # '-target bpf'. This fixes "missing" files on some architectures/distros,
 # such as asm/byteorder.h, asm/socket.h, asm/sockios.h, sys/cdefs.h etc.
 #
 # Use '-idirafter': Don't interfere with include mechanics except where the
 # build would have failed anyways.
 define get_sys_includes
 $(shell $(1) -v -E - </dev/null 2>&1 \
 	| sed -n '/<...> search starts here:/,/End of search list./{ s| \(/.*\)|-idirafter \1|p }') \
 $(shell $(1) -dM -E - </dev/null | grep '__riscv_xlen ' | awk '{printf("-D__riscv_xlen=%d -D__BITS_PER_LONG=%d", $$3, $$3)}')
 endef
 BPF_CFLAGS = -g -D__TARGET_ARCH_$(SRCARCH)					\
 	     $(if $(IS_LITTLE_ENDIAN),-mlittle-endian,-mbig-endian)		\
 	     -I$(CURDIR)/include -I$(CURDIR)/include/bpf-compat			\
 	     -I$(INCLUDE_DIR) -I$(APIDIR) -I$(SCXTOOLSINCDIR)			\
 	     -I$(REPOROOT)/include						\
 	     $(call get_sys_includes,$(CLANG))					\
 	     -Wall -Wno-compare-distinct-pointer-types				\
 	     -Wno-incompatible-function-pointer-types				\
 	     -O2 -mcpu=v3
 # sort removes libbpf duplicates when not cross-building
 MAKE_DIRS := $(sort $(OBJ_DIR)/libbpf $(OBJ_DIR)/libbpf				\
 	       $(OBJ_DIR)/bpftool $(OBJ_DIR)/resolve_btfids			\
 	       $(INCLUDE_DIR) $(SCXOBJ_DIR))
 $(MAKE_DIRS):
 	$(call msg,MKDIR,,$@)
 	$(Q)mkdir -p $@
 $(BPFOBJ): $(wildcard $(BPFDIR)/*.[ch] $(BPFDIR)/Makefile)			\
 	   $(APIDIR)/linux/bpf.h						\
 	   | $(OBJ_DIR)/libbpf
 	$(Q)$(MAKE) $(submake_extras) -C $(BPFDIR) OUTPUT=$(OBJ_DIR)/libbpf/	\
 		    EXTRA_CFLAGS='-g -O0 -fPIC'					\
 		    DESTDIR=$(OUTPUT_DIR) prefix= all install_headers
 $(DEFAULT_BPFTOOL): $(wildcard $(BPFTOOLDIR)/*.[ch] $(BPFTOOLDIR)/Makefile)	\
 		    $(LIBBPF_OUTPUT) | $(OBJ_DIR)/bpftool
 	$(Q)$(MAKE) $(submake_extras)  -C $(BPFTOOLDIR)				\
 		    ARCH= CROSS_COMPILE= CC=$(HOSTCC) LD=$(HOSTLD)		\
 		    EXTRA_CFLAGS='-g -O0'					\
 		    OUTPUT=$(OBJ_DIR)/bpftool/					\
 		    LIBBPF_OUTPUT=$(OBJ_DIR)/libbpf/				\
 		    LIBBPF_DESTDIR=$(OUTPUT_DIR)/				\
 		    prefix= DESTDIR=$(OUTPUT_DIR)/ install-bin
 $(INCLUDE_DIR)/vmlinux.h: $(VMLINUX_BTF) $(BPFTOOL) | $(INCLUDE_DIR)
 ifeq ($(VMLINUX_H),)
 	$(call msg,GEN,,$@)
 	$(Q)$(BPFTOOL) btf dump file $(VMLINUX_BTF) format c > $@
 else
 	$(call msg,CP,,$@)
 	$(Q)cp "$(VMLINUX_H)" $@
 endif
 $(SCXOBJ_DIR)/%.bpf.o: %.bpf.c $(INCLUDE_DIR)/vmlinux.h	| $(BPFOBJ) $(SCXOBJ_DIR)
 	$(call msg,CLNG-BPF,,$(notdir $@))
 	$(Q)$(CLANG) $(BPF_CFLAGS) -target bpf -c $< -o $@
 $(INCLUDE_DIR)/%.bpf.skel.h: $(SCXOBJ_DIR)/%.bpf.o $(INCLUDE_DIR)/vmlinux.h $(BPFTOOL) | $(INCLUDE_DIR)
 	$(eval sched=$(notdir $@))
 	$(call msg,GEN-SKEL,,$(sched))
 	$(Q)$(BPFTOOL) gen object $(<:.o=.linked1.o) $<
 	$(Q)$(BPFTOOL) gen object $(<:.o=.linked2.o) $(<:.o=.linked1.o)
 	$(Q)$(BPFTOOL) gen object $(<:.o=.linked3.o) $(<:.o=.linked2.o)
 	$(Q)diff $(<:.o=.linked2.o) $(<:.o=.linked3.o)
 	$(Q)$(BPFTOOL) gen skeleton $(<:.o=.linked3.o) name $(subst .bpf.skel.h,,$(sched)) > $@
 	$(Q)$(BPFTOOL) gen subskeleton $(<:.o=.linked3.o) name $(subst .bpf.skel.h,,$(sched)) > $(@:.skel.h=.subskel.h)
 ################
 # C schedulers #
 ################
 override define CLEAN
 	rm -rf $(OUTPUT_DIR)
 	rm -f *.o *.bpf.o *.bpf.skel.h *.bpf.subskel.h
 	rm -f $(TEST_GEN_PROGS)
 	rm -f runner
 endef
 # Every testcase takes all of the BPF progs are dependencies by default. This
 # allows testcases to load any BPF scheduler, which is useful for testcases
 # that don't need their own prog to run their test.
 all_test_bpfprogs := $(foreach prog,$(wildcard *.bpf.c),$(INCLUDE_DIR)/$(patsubst %.c,%.skel.h,$(prog)))
 auto-test-targets :=			\
 	create_dsq			\
 	enq_last_no_enq_fails		\
 	enq_select_cpu_fails		\
 	ddsp_bogus_dsq_fail		\
 	ddsp_vtimelocal_fail		\
 	dsp_local_on			\
 	exit				\
 	hotplug				\
 	init_enable_count		\
 	maximal				\
 	maybe_null			\
 	minimal				\
 	prog_run			\
 	reload_loop			\
 	select_cpu_dfl			\
 	select_cpu_dfl_nodispatch	\
 	select_cpu_dispatch		\
 	select_cpu_dispatch_bad_dsq	\
 	select_cpu_dispatch_dbl_dsp	\
 	select_cpu_vtime		\
 	test_example			\
 testcase-targets := $(addsuffix .o,$(addprefix $(SCXOBJ_DIR)/,$(auto-test-targets)))
 $(SCXOBJ_DIR)/runner.o: runner.c | $(SCXOBJ_DIR)
 	$(CC) $(CFLAGS) -c $< -o $@
 # Create all of the test targets object files, whose testcase objects will be
 # registered into the runner in ELF constructors.
 #
 # Note that we must do double expansion here in order to support conditionally
 # compiling BPF object files only if one is present, as the wildcard Make
 # function doesn't support using implicit rules otherwise.
 $(testcase-targets): $(SCXOBJ_DIR)/%.o: %.c $(SCXOBJ_DIR)/runner.o $(all_test_bpfprogs) | $(SCXOBJ_DIR)
 	$(eval test=$(patsubst %.o,%.c,$(notdir $@)))
 	$(CC) $(CFLAGS) -c $< -o $@ $(SCXOBJ_DIR)/runner.o
 $(SCXOBJ_DIR)/util.o: util.c | $(SCXOBJ_DIR)
 	$(CC) $(CFLAGS) -c $< -o $@
 runner: $(SCXOBJ_DIR)/runner.o $(SCXOBJ_DIR)/util.o $(BPFOBJ) $(testcase-targets)
 	@echo "$(testcase-targets)"
 	$(CC) $(CFLAGS) -o $@ $^ $(LDFLAGS)
 TEST_GEN_PROGS := runner
 all: runner
 .PHONY: all clean help
 .DEFAULT_GOAL := all
 .DELETE_ON_ERROR:
 .SECONDARY:
--- a/tools/testing/selftests/sched_ext/config
+++ b/tools/testing/selftests/sched_ext/config
@ -0,0 +1,9 @@
 CONFIG_SCHED_DEBUG=y
 CONFIG_SCHED_CLASS_EXT=y
 CONFIG_CGROUPS=y
 CONFIG_CGROUP_SCHED=y
 CONFIG_EXT_GROUP_SCHED=y
 CONFIG_BPF=y
 CONFIG_BPF_SYSCALL=y
 CONFIG_DEBUG_INFO=y
 CONFIG_DEBUG_INFO_BTF=y
--- a/tools/testing/selftests/sched_ext/create_dsq.bpf.c
+++ b/tools/testing/selftests/sched_ext/create_dsq.bpf.c
@ -0,0 +1,58 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Create and destroy DSQs in a loop.
 *
 * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2024 David Vernet <dvernet@meta.com>
 */
 #include <scx/common.bpf.h>
 char _license[] SEC("license") = "GPL";
 void BPF_STRUCT_OPS(create_dsq_exit_task, struct task_struct *p,
 		    struct scx_exit_task_args *args)
 {
 	scx_bpf_destroy_dsq(p->pid);
 }
 s32 BPF_STRUCT_OPS_SLEEPABLE(create_dsq_init_task, struct task_struct *p,
 			     struct scx_init_task_args *args)
 {
 	s32 err;
 	err = scx_bpf_create_dsq(p->pid, -1);
 	if (err)
 		scx_bpf_error("Failed to create DSQ for %s[%d]",
 			      p->comm, p->pid);
 	return err;
 }
 s32 BPF_STRUCT_OPS_SLEEPABLE(create_dsq_init)
 {
 	u32 i;
 	s32 err;
 	bpf_for(i, 0, 1024) {
 		err = scx_bpf_create_dsq(i, -1);
 		if (err) {
 			scx_bpf_error("Failed to create DSQ %d", i);
 			return 0;
 		}
 	}
 	bpf_for(i, 0, 1024) {
 		scx_bpf_destroy_dsq(i);
 	}
 	return 0;
 }
 SEC(".struct_ops.link")
 struct sched_ext_ops create_dsq_ops = {
 	.init_task		= create_dsq_init_task,
 	.exit_task		= create_dsq_exit_task,
 	.init			= create_dsq_init,
 	.name			= "create_dsq",
 };
--- a/tools/testing/selftests/sched_ext/create_dsq.c
+++ b/tools/testing/selftests/sched_ext/create_dsq.c
@ -0,0 +1,57 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2024 David Vernet <dvernet@meta.com>
 */
 #include <bpf/bpf.h>
 #include <scx/common.h>
 #include <sys/wait.h>
 #include <unistd.h>
 #include "create_dsq.bpf.skel.h"
 #include "scx_test.h"
 static enum scx_test_status setup(void **ctx)
 {
 	struct create_dsq *skel;
 	skel = create_dsq__open_and_load();
 	if (!skel) {
 		SCX_ERR("Failed to open and load skel");
 		return SCX_TEST_FAIL;
 	}
 	*ctx = skel;
 	return SCX_TEST_PASS;
 }
 static enum scx_test_status run(void *ctx)
 {
 	struct create_dsq *skel = ctx;
 	struct bpf_link *link;
 	link = bpf_map__attach_struct_ops(skel->maps.create_dsq_ops);
 	if (!link) {
 		SCX_ERR("Failed to attach scheduler");
 		return SCX_TEST_FAIL;
 	}
 	bpf_link__destroy(link);
 	return SCX_TEST_PASS;
 }
 static void cleanup(void *ctx)
 {
 	struct create_dsq *skel = ctx;
 	create_dsq__destroy(skel);
 }
 struct scx_test create_dsq = {
 	.name = "create_dsq",
 	.description = "Create and destroy a dsq in a loop",
 	.setup = setup,
 	.run = run,
 	.cleanup = cleanup,
 };
 REGISTER_SCX_TEST(&create_dsq)
--- a/tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.bpf.c
+++ b/tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.bpf.c
@ -0,0 +1,42 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2024 David Vernet <dvernet@meta.com>
 * Copyright (c) 2024 Tejun Heo <tj@kernel.org>
 */
 #include <scx/common.bpf.h>
 char _license[] SEC("license") = "GPL";
 UEI_DEFINE(uei);
 s32 BPF_STRUCT_OPS(ddsp_bogus_dsq_fail_select_cpu, struct task_struct *p,
 		   s32 prev_cpu, u64 wake_flags)
 {
 	s32 cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0);
 	if (cpu >= 0) {
 		/*
 		 * If we dispatch to a bogus DSQ that will fall back to the
 		 * builtin global DSQ, we fail gracefully.
 		 */
 		scx_bpf_dispatch_vtime(p, 0xcafef00d, SCX_SLICE_DFL,
 				       p->scx.dsq_vtime, 0);
 		return cpu;
 	}
 	return prev_cpu;
 }
 void BPF_STRUCT_OPS(ddsp_bogus_dsq_fail_exit, struct scx_exit_info *ei)
 {
 	UEI_RECORD(uei, ei);
 }
 SEC(".struct_ops.link")
 struct sched_ext_ops ddsp_bogus_dsq_fail_ops = {
 	.select_cpu		= ddsp_bogus_dsq_fail_select_cpu,
 	.exit			= ddsp_bogus_dsq_fail_exit,
 	.name			= "ddsp_bogus_dsq_fail",
 	.timeout_ms		= 1000U,
 };
--- a/tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.c
+++ b/tools/testing/selftests/sched_ext/ddsp_bogus_dsq_fail.c
@ -0,0 +1,57 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2024 David Vernet <dvernet@meta.com>
 * Copyright (c) 2024 Tejun Heo <tj@kernel.org>
 */
 #include <bpf/bpf.h>
 #include <scx/common.h>
 #include <sys/wait.h>
 #include <unistd.h>
 #include "ddsp_bogus_dsq_fail.bpf.skel.h"
 #include "scx_test.h"
 static enum scx_test_status setup(void **ctx)
 {
 	struct ddsp_bogus_dsq_fail *skel;
 	skel = ddsp_bogus_dsq_fail__open_and_load();
 	SCX_FAIL_IF(!skel, "Failed to open and load skel");
 	*ctx = skel;
 	return SCX_TEST_PASS;
 }
 static enum scx_test_status run(void *ctx)
 {
 	struct ddsp_bogus_dsq_fail *skel = ctx;
 	struct bpf_link *link;
 	link = bpf_map__attach_struct_ops(skel->maps.ddsp_bogus_dsq_fail_ops);
 	SCX_FAIL_IF(!link, "Failed to attach struct_ops");
 	sleep(1);
 	SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_ERROR));
 	bpf_link__destroy(link);
 	return SCX_TEST_PASS;
 }
 static void cleanup(void *ctx)
 {
 	struct ddsp_bogus_dsq_fail *skel = ctx;
 	ddsp_bogus_dsq_fail__destroy(skel);
 }
 struct scx_test ddsp_bogus_dsq_fail = {
 	.name = "ddsp_bogus_dsq_fail",
 	.description = "Verify we gracefully fail, and fall back to using a "
 		       "built-in DSQ, if we do a direct dispatch to an invalid"
 		       " DSQ in ops.select_cpu()",
 	.setup = setup,
 	.run = run,
 	.cleanup = cleanup,
 };
 REGISTER_SCX_TEST(&ddsp_bogus_dsq_fail)
--- a/tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.bpf.c
+++ b/tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.bpf.c
@ -0,0 +1,39 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2024 David Vernet <dvernet@meta.com>
 * Copyright (c) 2024 Tejun Heo <tj@kernel.org>
 */
 #include <scx/common.bpf.h>
 char _license[] SEC("license") = "GPL";
 UEI_DEFINE(uei);
 s32 BPF_STRUCT_OPS(ddsp_vtimelocal_fail_select_cpu, struct task_struct *p,
 		   s32 prev_cpu, u64 wake_flags)
 {
 	s32 cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0);
 	if (cpu >= 0) {
 		/* Shouldn't be allowed to vtime dispatch to a builtin DSQ. */
 		scx_bpf_dispatch_vtime(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL,
 				       p->scx.dsq_vtime, 0);
 		return cpu;
 	}
 	return prev_cpu;
 }
 void BPF_STRUCT_OPS(ddsp_vtimelocal_fail_exit, struct scx_exit_info *ei)
 {
 	UEI_RECORD(uei, ei);
 }
 SEC(".struct_ops.link")
 struct sched_ext_ops ddsp_vtimelocal_fail_ops = {
 	.select_cpu		= ddsp_vtimelocal_fail_select_cpu,
 	.exit			= ddsp_vtimelocal_fail_exit,
 	.name			= "ddsp_vtimelocal_fail",
 	.timeout_ms		= 1000U,
 };
--- a/tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.c
+++ b/tools/testing/selftests/sched_ext/ddsp_vtimelocal_fail.c
@ -0,0 +1,56 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2024 David Vernet <dvernet@meta.com>
 * Copyright (c) 2024 Tejun Heo <tj@kernel.org>
 */
 #include <bpf/bpf.h>
 #include <scx/common.h>
 #include <unistd.h>
 #include "ddsp_vtimelocal_fail.bpf.skel.h"
 #include "scx_test.h"
 static enum scx_test_status setup(void **ctx)
 {
 	struct ddsp_vtimelocal_fail *skel;
 	skel = ddsp_vtimelocal_fail__open_and_load();
 	SCX_FAIL_IF(!skel, "Failed to open and load skel");
 	*ctx = skel;
 	return SCX_TEST_PASS;
 }
 static enum scx_test_status run(void *ctx)
 {
 	struct ddsp_vtimelocal_fail *skel = ctx;
 	struct bpf_link *link;
 	link = bpf_map__attach_struct_ops(skel->maps.ddsp_vtimelocal_fail_ops);
 	SCX_FAIL_IF(!link, "Failed to attach struct_ops");
 	sleep(1);
 	SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_ERROR));
 	bpf_link__destroy(link);
 	return SCX_TEST_PASS;
 }
 static void cleanup(void *ctx)
 {
 	struct ddsp_vtimelocal_fail *skel = ctx;
 	ddsp_vtimelocal_fail__destroy(skel);
 }
 struct scx_test ddsp_vtimelocal_fail = {
 	.name = "ddsp_vtimelocal_fail",
 	.description = "Verify we gracefully fail, and fall back to using a "
 		       "built-in DSQ, if we do a direct vtime dispatch to a "
 		       "built-in DSQ from DSQ in ops.select_cpu()",
 	.setup = setup,
 	.run = run,
 	.cleanup = cleanup,
 };
 REGISTER_SCX_TEST(&ddsp_vtimelocal_fail)
--- a/tools/testing/selftests/sched_ext/dsp_local_on.bpf.c
+++ b/tools/testing/selftests/sched_ext/dsp_local_on.bpf.c
@ -0,0 +1,65 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2024 David Vernet <dvernet@meta.com>
 */
 #include <scx/common.bpf.h>
 char _license[] SEC("license") = "GPL";
 const volatile s32 nr_cpus;
 UEI_DEFINE(uei);
 struct {
 	__uint(type, BPF_MAP_TYPE_QUEUE);
 	__uint(max_entries, 8192);
 	__type(value, s32);
 } queue SEC(".maps");
 s32 BPF_STRUCT_OPS(dsp_local_on_select_cpu, struct task_struct *p,
 		   s32 prev_cpu, u64 wake_flags)
 {
 	return prev_cpu;
 }
 void BPF_STRUCT_OPS(dsp_local_on_enqueue, struct task_struct *p,
 		    u64 enq_flags)
 {
 	s32 pid = p->pid;
 	if (bpf_map_push_elem(&queue, &pid, 0))
 		scx_bpf_error("Failed to enqueue %s[%d]", p->comm, p->pid);
 }
 void BPF_STRUCT_OPS(dsp_local_on_dispatch, s32 cpu, struct task_struct *prev)
 {
 	s32 pid, target;
 	struct task_struct *p;
 	if (bpf_map_pop_elem(&queue, &pid))
 		return;
 	p = bpf_task_from_pid(pid);
 	if (!p)
 		return;
 	target = bpf_get_prandom_u32() % nr_cpus;
 	scx_bpf_dispatch(p, SCX_DSQ_LOCAL_ON | target, SCX_SLICE_DFL, 0);
 	bpf_task_release(p);
 }
 void BPF_STRUCT_OPS(dsp_local_on_exit, struct scx_exit_info *ei)
 {
 	UEI_RECORD(uei, ei);
 }
 SEC(".struct_ops.link")
 struct sched_ext_ops dsp_local_on_ops = {
 	.select_cpu		= dsp_local_on_select_cpu,
 	.enqueue		= dsp_local_on_enqueue,
 	.dispatch		= dsp_local_on_dispatch,
 	.exit			= dsp_local_on_exit,
 	.name			= "dsp_local_on",
 	.timeout_ms		= 1000U,
 };
--- a/tools/testing/selftests/sched_ext/dsp_local_on.c
+++ b/tools/testing/selftests/sched_ext/dsp_local_on.c
@ -0,0 +1,58 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2024 David Vernet <dvernet@meta.com>
 */
 #include <bpf/bpf.h>
 #include <scx/common.h>
 #include <unistd.h>
 #include "dsp_local_on.bpf.skel.h"
 #include "scx_test.h"
 static enum scx_test_status setup(void **ctx)
 {
 	struct dsp_local_on *skel;
 	skel = dsp_local_on__open();
 	SCX_FAIL_IF(!skel, "Failed to open");
 	skel->rodata->nr_cpus = libbpf_num_possible_cpus();
 	SCX_FAIL_IF(dsp_local_on__load(skel), "Failed to load skel");
 	*ctx = skel;
 	return SCX_TEST_PASS;
 }
 static enum scx_test_status run(void *ctx)
 {
 	struct dsp_local_on *skel = ctx;
 	struct bpf_link *link;
 	link = bpf_map__attach_struct_ops(skel->maps.dsp_local_on_ops);
 	SCX_FAIL_IF(!link, "Failed to attach struct_ops");
 	/* Just sleeping is fine, plenty of scheduling events happening */
 	sleep(1);
 	SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_ERROR));
 	bpf_link__destroy(link);
 	return SCX_TEST_PASS;
 }
 static void cleanup(void *ctx)
 {
 	struct dsp_local_on *skel = ctx;
 	dsp_local_on__destroy(skel);
 }
 struct scx_test dsp_local_on = {
 	.name = "dsp_local_on",
 	.description = "Verify we can directly dispatch tasks to a local DSQs "
 		       "from osp.dispatch()",
 	.setup = setup,
 	.run = run,
 	.cleanup = cleanup,
 };
 REGISTER_SCX_TEST(&dsp_local_on)
--- a/tools/testing/selftests/sched_ext/enq_last_no_enq_fails.bpf.c
+++ b/tools/testing/selftests/sched_ext/enq_last_no_enq_fails.bpf.c
@ -0,0 +1,21 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * A scheduler that validates the behavior of direct dispatching with a default
 * select_cpu implementation.
 *
 * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2023 David Vernet <dvernet@meta.com>
 * Copyright (c) 2023 Tejun Heo <tj@kernel.org>
 */
 #include <scx/common.bpf.h>
 char _license[] SEC("license") = "GPL";
 SEC(".struct_ops.link")
 struct sched_ext_ops enq_last_no_enq_fails_ops = {
 	.name			= "enq_last_no_enq_fails",
 	/* Need to define ops.enqueue() with SCX_OPS_ENQ_LAST */
 	.flags			= SCX_OPS_ENQ_LAST,
 	.timeout_ms		= 1000U,
 };
--- a/tools/testing/selftests/sched_ext/enq_last_no_enq_fails.c
+++ b/tools/testing/selftests/sched_ext/enq_last_no_enq_fails.c
@ -0,0 +1,60 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2023 David Vernet <dvernet@meta.com>
 * Copyright (c) 2023 Tejun Heo <tj@kernel.org>
 */
 #include <bpf/bpf.h>
 #include <scx/common.h>
 #include <sys/wait.h>
 #include <unistd.h>
 #include "enq_last_no_enq_fails.bpf.skel.h"
 #include "scx_test.h"
 static enum scx_test_status setup(void **ctx)
 {
 	struct enq_last_no_enq_fails *skel;
 	skel = enq_last_no_enq_fails__open_and_load();
 	if (!skel) {
 		SCX_ERR("Failed to open and load skel");
 		return SCX_TEST_FAIL;
 	}
 	*ctx = skel;
 	return SCX_TEST_PASS;
 }
 static enum scx_test_status run(void *ctx)
 {
 	struct enq_last_no_enq_fails *skel = ctx;
 	struct bpf_link *link;
 	link = bpf_map__attach_struct_ops(skel->maps.enq_last_no_enq_fails_ops);
 	if (link) {
 		SCX_ERR("Incorrectly succeeded in to attaching scheduler");
 		return SCX_TEST_FAIL;
 	}
 	bpf_link__destroy(link);
 	return SCX_TEST_PASS;
 }
 static void cleanup(void *ctx)
 {
 	struct enq_last_no_enq_fails *skel = ctx;
 	enq_last_no_enq_fails__destroy(skel);
 }
 struct scx_test enq_last_no_enq_fails = {
 	.name = "enq_last_no_enq_fails",
 	.description = "Verify we fail to load a scheduler if we specify "
 		       "the SCX_OPS_ENQ_LAST flag without defining "
 		       "ops.enqueue()",
 	.setup = setup,
 	.run = run,
 	.cleanup = cleanup,
 };
 REGISTER_SCX_TEST(&enq_last_no_enq_fails)
--- a/tools/testing/selftests/sched_ext/enq_select_cpu_fails.bpf.c
+++ b/tools/testing/selftests/sched_ext/enq_select_cpu_fails.bpf.c
@ -0,0 +1,43 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2023 David Vernet <dvernet@meta.com>
 * Copyright (c) 2023 Tejun Heo <tj@kernel.org>
 */
 #include <scx/common.bpf.h>
 char _license[] SEC("license") = "GPL";
 /* Manually specify the signature until the kfunc is added to the scx repo. */
 s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags,
 			   bool *found) __ksym;
 s32 BPF_STRUCT_OPS(enq_select_cpu_fails_select_cpu, struct task_struct *p,
 		   s32 prev_cpu, u64 wake_flags)
 {
 	return prev_cpu;
 }
 void BPF_STRUCT_OPS(enq_select_cpu_fails_enqueue, struct task_struct *p,
 		    u64 enq_flags)
 {
 	/*
 	 * Need to initialize the variable or the verifier will fail to load.
 	 * Improving these semantics is actively being worked on.
 	 */
 	bool found = false;
 	/* Can only call from ops.select_cpu() */
 	scx_bpf_select_cpu_dfl(p, 0, 0, &found);
 	scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
 }
 SEC(".struct_ops.link")
 struct sched_ext_ops enq_select_cpu_fails_ops = {
 	.select_cpu		= enq_select_cpu_fails_select_cpu,
 	.enqueue		= enq_select_cpu_fails_enqueue,
 	.name			= "enq_select_cpu_fails",
 	.timeout_ms		= 1000U,
 };
--- a/tools/testing/selftests/sched_ext/enq_select_cpu_fails.c
+++ b/tools/testing/selftests/sched_ext/enq_select_cpu_fails.c
@ -0,0 +1,61 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2023 David Vernet <dvernet@meta.com>
 * Copyright (c) 2023 Tejun Heo <tj@kernel.org>
 */
 #include <bpf/bpf.h>
 #include <scx/common.h>
 #include <sys/wait.h>
 #include <unistd.h>
 #include "enq_select_cpu_fails.bpf.skel.h"
 #include "scx_test.h"
 static enum scx_test_status setup(void **ctx)
 {
 	struct enq_select_cpu_fails *skel;
 	skel = enq_select_cpu_fails__open_and_load();
 	if (!skel) {
 		SCX_ERR("Failed to open and load skel");
 		return SCX_TEST_FAIL;
 	}
 	*ctx = skel;
 	return SCX_TEST_PASS;
 }
 static enum scx_test_status run(void *ctx)
 {
 	struct enq_select_cpu_fails *skel = ctx;
 	struct bpf_link *link;
 	link = bpf_map__attach_struct_ops(skel->maps.enq_select_cpu_fails_ops);
 	if (!link) {
 		SCX_ERR("Failed to attach scheduler");
 		return SCX_TEST_FAIL;
 	}
 	sleep(1);
 	bpf_link__destroy(link);
 	return SCX_TEST_PASS;
 }
 static void cleanup(void *ctx)
 {
 	struct enq_select_cpu_fails *skel = ctx;
 	enq_select_cpu_fails__destroy(skel);
 }
 struct scx_test enq_select_cpu_fails = {
 	.name = "enq_select_cpu_fails",
 	.description = "Verify we fail to call scx_bpf_select_cpu_dfl() "
 		       "from ops.enqueue()",
 	.setup = setup,
 	.run = run,
 	.cleanup = cleanup,
 };
 REGISTER_SCX_TEST(&enq_select_cpu_fails)
--- a/tools/testing/selftests/sched_ext/exit.bpf.c
+++ b/tools/testing/selftests/sched_ext/exit.bpf.c
@ -0,0 +1,84 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2024 David Vernet <dvernet@meta.com>
 */
 #include <scx/common.bpf.h>
 char _license[] SEC("license") = "GPL";
 #include "exit_test.h"
 const volatile int exit_point;
 UEI_DEFINE(uei);
 #define EXIT_CLEANLY() scx_bpf_exit(exit_point, "%d", exit_point)
 s32 BPF_STRUCT_OPS(exit_select_cpu, struct task_struct *p,
 		   s32 prev_cpu, u64 wake_flags)
 {
 	bool found;
 	if (exit_point == EXIT_SELECT_CPU)
 		EXIT_CLEANLY();
 	return scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &found);
 }
 void BPF_STRUCT_OPS(exit_enqueue, struct task_struct *p, u64 enq_flags)
 {
 	if (exit_point == EXIT_ENQUEUE)
 		EXIT_CLEANLY();
 	scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
 }
 void BPF_STRUCT_OPS(exit_dispatch, s32 cpu, struct task_struct *p)
 {
 	if (exit_point == EXIT_DISPATCH)
 		EXIT_CLEANLY();
 	scx_bpf_consume(SCX_DSQ_GLOBAL);
 }
 void BPF_STRUCT_OPS(exit_enable, struct task_struct *p)
 {
 	if (exit_point == EXIT_ENABLE)
 		EXIT_CLEANLY();
 }
 s32 BPF_STRUCT_OPS(exit_init_task, struct task_struct *p,
 		    struct scx_init_task_args *args)
 {
 	if (exit_point == EXIT_INIT_TASK)
 		EXIT_CLEANLY();
 	return 0;
 }
 void BPF_STRUCT_OPS(exit_exit, struct scx_exit_info *ei)
 {
 	UEI_RECORD(uei, ei);
 }
 s32 BPF_STRUCT_OPS_SLEEPABLE(exit_init)
 {
 	if (exit_point == EXIT_INIT)
 		EXIT_CLEANLY();
 	return 0;
 }
 SEC(".struct_ops.link")
 struct sched_ext_ops exit_ops = {
 	.select_cpu		= exit_select_cpu,
 	.enqueue		= exit_enqueue,
 	.dispatch		= exit_dispatch,
 	.init_task		= exit_init_task,
 	.enable			= exit_enable,
 	.exit			= exit_exit,
 	.init			= exit_init,
 	.name			= "exit",
 	.timeout_ms		= 1000U,
 };
--- a/tools/testing/selftests/sched_ext/exit.c
+++ b/tools/testing/selftests/sched_ext/exit.c
@ -0,0 +1,55 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2024 David Vernet <dvernet@meta.com>
 */
 #include <bpf/bpf.h>
 #include <sched.h>
 #include <scx/common.h>
 #include <sys/wait.h>
 #include <unistd.h>
 #include "exit.bpf.skel.h"
 #include "scx_test.h"
 #include "exit_test.h"
 static enum scx_test_status run(void *ctx)
 {
 	enum exit_test_case tc;
 	for (tc = 0; tc < NUM_EXITS; tc++) {
 		struct exit *skel;
 		struct bpf_link *link;
 		char buf[16];
 		skel = exit__open();
 		skel->rodata->exit_point = tc;
 		exit__load(skel);
 		link = bpf_map__attach_struct_ops(skel->maps.exit_ops);
 		if (!link) {
 			SCX_ERR("Failed to attach scheduler");
 			exit__destroy(skel);
 			return SCX_TEST_FAIL;
 		}
 		/* Assumes uei.kind is written last */
 		while (skel->data->uei.kind == EXIT_KIND(SCX_EXIT_NONE))
 			sched_yield();
 		SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_UNREG_BPF));
 		SCX_EQ(skel->data->uei.exit_code, tc);
 		sprintf(buf, "%d", tc);
 		SCX_ASSERT(!strcmp(skel->data->uei.msg, buf));
 		bpf_link__destroy(link);
 		exit__destroy(skel);
 	}
 	return SCX_TEST_PASS;
 }
 struct scx_test exit_test = {
 	.name = "exit",
 	.description = "Verify we can cleanly exit a scheduler in multiple places",
 	.run = run,
 };
 REGISTER_SCX_TEST(&exit_test)
--- a/tools/testing/selftests/sched_ext/exit_test.h
+++ b/tools/testing/selftests/sched_ext/exit_test.h
@ -0,0 +1,20 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2024 David Vernet <dvernet@meta.com>
 */
 #ifndef __EXIT_TEST_H__
 #define __EXIT_TEST_H__
 enum exit_test_case {
 	EXIT_SELECT_CPU,
 	EXIT_ENQUEUE,
 	EXIT_DISPATCH,
 	EXIT_ENABLE,
 	EXIT_INIT_TASK,
 	EXIT_INIT,
 	NUM_EXITS,
 };
 #endif  // # __EXIT_TEST_H__
--- a/tools/testing/selftests/sched_ext/hotplug.bpf.c
+++ b/tools/testing/selftests/sched_ext/hotplug.bpf.c
@ -0,0 +1,61 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2024 David Vernet <dvernet@meta.com>
 */
 #include <scx/common.bpf.h>
 char _license[] SEC("license") = "GPL";
 #include "hotplug_test.h"
 UEI_DEFINE(uei);
 void BPF_STRUCT_OPS(hotplug_exit, struct scx_exit_info *ei)
 {
 	UEI_RECORD(uei, ei);
 }
 static void exit_from_hotplug(s32 cpu, bool onlining)
 {
 	/*
 	 * Ignored, just used to verify that we can invoke blocking kfuncs
 	 * from the hotplug path.
 	 */
 	scx_bpf_create_dsq(0, -1);
 	s64 code = SCX_ECODE_ACT_RESTART | HOTPLUG_EXIT_RSN;
 	if (onlining)
 		code |= HOTPLUG_ONLINING;
 	scx_bpf_exit(code, "hotplug event detected (%d going %s)", cpu,
 		     onlining ? "online" : "offline");
 }
 void BPF_STRUCT_OPS_SLEEPABLE(hotplug_cpu_online, s32 cpu)
 {
 	exit_from_hotplug(cpu, true);
 }
 void BPF_STRUCT_OPS_SLEEPABLE(hotplug_cpu_offline, s32 cpu)
 {
 	exit_from_hotplug(cpu, false);
 }
 SEC(".struct_ops.link")
 struct sched_ext_ops hotplug_cb_ops = {
 	.cpu_online		= hotplug_cpu_online,
 	.cpu_offline		= hotplug_cpu_offline,
 	.exit			= hotplug_exit,
 	.name			= "hotplug_cbs",
 	.timeout_ms		= 1000U,
 };
 SEC(".struct_ops.link")
 struct sched_ext_ops hotplug_nocb_ops = {
 	.exit			= hotplug_exit,
 	.name			= "hotplug_nocbs",
 	.timeout_ms		= 1000U,
 };
--- a/tools/testing/selftests/sched_ext/hotplug.c
+++ b/tools/testing/selftests/sched_ext/hotplug.c
@ -0,0 +1,168 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2024 David Vernet <dvernet@meta.com>
 */
 #include <bpf/bpf.h>
 #include <sched.h>
 #include <scx/common.h>
 #include <sched.h>
 #include <sys/wait.h>
 #include <unistd.h>
 #include "hotplug_test.h"
 #include "hotplug.bpf.skel.h"
 #include "scx_test.h"
 #include "util.h"
 const char *online_path = "/sys/devices/system/cpu/cpu1/online";
 static bool is_cpu_online(void)
 {
 	return file_read_long(online_path) > 0;
 }
 static void toggle_online_status(bool online)
 {
 	long val = online ? 1 : 0;
 	int ret;
 	ret = file_write_long(online_path, val);
 	if (ret != 0)
 		fprintf(stderr, "Failed to bring CPU %s (%s)",
 			online ? "online" : "offline", strerror(errno));
 }
 static enum scx_test_status setup(void **ctx)
 {
 	if (!is_cpu_online())
 		return SCX_TEST_SKIP;
 	return SCX_TEST_PASS;
 }
 static enum scx_test_status test_hotplug(bool onlining, bool cbs_defined)
 {
 	struct hotplug *skel;
 	struct bpf_link *link;
 	long kind, code;
 	SCX_ASSERT(is_cpu_online());
 	skel = hotplug__open_and_load();
 	SCX_ASSERT(skel);
 	/* Testing the offline -> online path, so go offline before starting */
 	if (onlining)
 		toggle_online_status(0);
 	if (cbs_defined) {
 		kind = SCX_KIND_VAL(SCX_EXIT_UNREG_BPF);
 		code = SCX_ECODE_VAL(SCX_ECODE_ACT_RESTART) | HOTPLUG_EXIT_RSN;
 		if (onlining)
 			code |= HOTPLUG_ONLINING;
 	} else {
 		kind = SCX_KIND_VAL(SCX_EXIT_UNREG_KERN);
 		code = SCX_ECODE_VAL(SCX_ECODE_ACT_RESTART) |
 		       SCX_ECODE_VAL(SCX_ECODE_RSN_HOTPLUG);
 	}
 	if (cbs_defined)
 		link = bpf_map__attach_struct_ops(skel->maps.hotplug_cb_ops);
 	else
 		link = bpf_map__attach_struct_ops(skel->maps.hotplug_nocb_ops);
 	if (!link) {
 		SCX_ERR("Failed to attach scheduler");
 		hotplug__destroy(skel);
 		return SCX_TEST_FAIL;
 	}
 	toggle_online_status(onlining ? 1 : 0);
 	while (!UEI_EXITED(skel, uei))
 		sched_yield();
 	SCX_EQ(skel->data->uei.kind, kind);
 	SCX_EQ(UEI_REPORT(skel, uei), code);
 	if (!onlining)
 		toggle_online_status(1);
 	bpf_link__destroy(link);
 	hotplug__destroy(skel);
 	return SCX_TEST_PASS;
 }
 static enum scx_test_status test_hotplug_attach(void)
 {
 	struct hotplug *skel;
 	struct bpf_link *link;
 	enum scx_test_status status = SCX_TEST_PASS;
 	long kind, code;
 	SCX_ASSERT(is_cpu_online());
 	SCX_ASSERT(scx_hotplug_seq() > 0);
 	skel = SCX_OPS_OPEN(hotplug_nocb_ops, hotplug);
 	SCX_ASSERT(skel);
 	SCX_OPS_LOAD(skel, hotplug_nocb_ops, hotplug, uei);
 	/*
 	 * Take the CPU offline to increment the global hotplug seq, which
 	 * should cause attach to fail due to us setting the hotplug seq above
 	 */
 	toggle_online_status(0);
 	link = bpf_map__attach_struct_ops(skel->maps.hotplug_nocb_ops);
 	toggle_online_status(1);
 	SCX_ASSERT(link);
 	while (!UEI_EXITED(skel, uei))
 		sched_yield();
 	kind = SCX_KIND_VAL(SCX_EXIT_UNREG_KERN);
 	code = SCX_ECODE_VAL(SCX_ECODE_ACT_RESTART) |
 	       SCX_ECODE_VAL(SCX_ECODE_RSN_HOTPLUG);
 	SCX_EQ(skel->data->uei.kind, kind);
 	SCX_EQ(UEI_REPORT(skel, uei), code);
 	bpf_link__destroy(link);
 	hotplug__destroy(skel);
 	return status;
 }
 static enum scx_test_status run(void *ctx)
 {
 #define HP_TEST(__onlining, __cbs_defined) ({				\
 	if (test_hotplug(__onlining, __cbs_defined) != SCX_TEST_PASS)	\
 		return SCX_TEST_FAIL;					\
 })
 	HP_TEST(true, true);
 	HP_TEST(false, true);
 	HP_TEST(true, false);
 	HP_TEST(false, false);
 #undef HP_TEST
 	return test_hotplug_attach();
 }
 static void cleanup(void *ctx)
 {
 	toggle_online_status(1);
 }
 struct scx_test hotplug_test = {
 	.name = "hotplug",
 	.description = "Verify hotplug behavior",
 	.setup = setup,
 	.run = run,
 	.cleanup = cleanup,
 };
 REGISTER_SCX_TEST(&hotplug_test)
--- a/tools/testing/selftests/sched_ext/hotplug_test.h
+++ b/tools/testing/selftests/sched_ext/hotplug_test.h
@ -0,0 +1,15 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2024 David Vernet <dvernet@meta.com>
 */
 #ifndef __HOTPLUG_TEST_H__
 #define __HOTPLUG_TEST_H__
 enum hotplug_test_flags {
 	HOTPLUG_EXIT_RSN = 1LLU << 0,
 	HOTPLUG_ONLINING = 1LLU << 1,
 };
 #endif  // # __HOTPLUG_TEST_H__
--- a/tools/testing/selftests/sched_ext/init_enable_count.bpf.c
+++ b/tools/testing/selftests/sched_ext/init_enable_count.bpf.c
@ -0,0 +1,53 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * A scheduler that verifies that we do proper counting of init, enable, etc
 * callbacks.
 *
 * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2023 David Vernet <dvernet@meta.com>
 * Copyright (c) 2023 Tejun Heo <tj@kernel.org>
 */
 #include <scx/common.bpf.h>
 char _license[] SEC("license") = "GPL";
 u64 init_task_cnt, exit_task_cnt, enable_cnt, disable_cnt;
 u64 init_fork_cnt, init_transition_cnt;
 s32 BPF_STRUCT_OPS_SLEEPABLE(cnt_init_task, struct task_struct *p,
 			     struct scx_init_task_args *args)
 {
 	__sync_fetch_and_add(&init_task_cnt, 1);
 	if (args->fork)
 		__sync_fetch_and_add(&init_fork_cnt, 1);
 	else
 		__sync_fetch_and_add(&init_transition_cnt, 1);
 	return 0;
 }
 void BPF_STRUCT_OPS(cnt_exit_task, struct task_struct *p)
 {
 	__sync_fetch_and_add(&exit_task_cnt, 1);
 }
 void BPF_STRUCT_OPS(cnt_enable, struct task_struct *p)
 {
 	__sync_fetch_and_add(&enable_cnt, 1);
 }
 void BPF_STRUCT_OPS(cnt_disable, struct task_struct *p)
 {
 	__sync_fetch_and_add(&disable_cnt, 1);
 }
 SEC(".struct_ops.link")
 struct sched_ext_ops init_enable_count_ops = {
 	.init_task	= cnt_init_task,
 	.exit_task	= cnt_exit_task,
 	.enable		= cnt_enable,
 	.disable	= cnt_disable,
 	.name		= "init_enable_count",
 };
--- a/tools/testing/selftests/sched_ext/init_enable_count.c
+++ b/tools/testing/selftests/sched_ext/init_enable_count.c
@ -0,0 +1,166 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2023 David Vernet <dvernet@meta.com>
 * Copyright (c) 2023 Tejun Heo <tj@kernel.org>
 */
 #include <stdio.h>
 #include <unistd.h>
 #include <sched.h>
 #include <bpf/bpf.h>
 #include <scx/common.h>
 #include <sys/wait.h>
 #include "scx_test.h"
 #include "init_enable_count.bpf.skel.h"
 #define SCHED_EXT 7
 static struct init_enable_count *
 open_load_prog(bool global)
 {
 	struct init_enable_count *skel;
 	skel = init_enable_count__open();
 	SCX_BUG_ON(!skel, "Failed to open skel");
 	if (!global)
 		skel->struct_ops.init_enable_count_ops->flags |= SCX_OPS_SWITCH_PARTIAL;
 	SCX_BUG_ON(init_enable_count__load(skel), "Failed to load skel");
 	return skel;
 }
 static enum scx_test_status run_test(bool global)
 {
 	struct init_enable_count *skel;
 	struct bpf_link *link;
 	const u32 num_children = 5, num_pre_forks = 1024;
 	int ret, i, status;
 	struct sched_param param = {};
 	pid_t pids[num_pre_forks];
 	skel = open_load_prog(global);
 	/*
 	 * Fork a bunch of children before we attach the scheduler so that we
 	 * ensure (at least in practical terms) that there are more tasks that
 	 * transition from SCHED_OTHER -> SCHED_EXT than there are tasks that
 	 * take the fork() path either below or in other processes.
 	 */
 	for (i = 0; i < num_pre_forks; i++) {
 		pids[i] = fork();
 		SCX_FAIL_IF(pids[i] < 0, "Failed to fork child");
 		if (pids[i] == 0) {
 			sleep(1);
 			exit(0);
 		}
 	}
 	link = bpf_map__attach_struct_ops(skel->maps.init_enable_count_ops);
 	SCX_FAIL_IF(!link, "Failed to attach struct_ops");
 	for (i = 0; i < num_pre_forks; i++) {
 		SCX_FAIL_IF(waitpid(pids[i], &status, 0) != pids[i],
 			    "Failed to wait for pre-forked child\n");
 		SCX_FAIL_IF(status != 0, "Pre-forked child %d exited with status %d\n", i,
 			    status);
 	}
 	bpf_link__destroy(link);
 	SCX_GE(skel->bss->init_task_cnt, num_pre_forks);
 	SCX_GE(skel->bss->exit_task_cnt, num_pre_forks);
 	link = bpf_map__attach_struct_ops(skel->maps.init_enable_count_ops);
 	SCX_FAIL_IF(!link, "Failed to attach struct_ops");
 	/* SCHED_EXT children */
 	for (i = 0; i < num_children; i++) {
 		pids[i] = fork();
 		SCX_FAIL_IF(pids[i] < 0, "Failed to fork child");
 		if (pids[i] == 0) {
 			ret = sched_setscheduler(0, SCHED_EXT, &param);
 			SCX_BUG_ON(ret, "Failed to set sched to sched_ext");
 			/*
 			 * Reset to SCHED_OTHER for half of them. Counts for
 			 * everything should still be the same regardless, as
 			 * ops.disable() is invoked even if a task is still on
 			 * SCHED_EXT before it exits.
 			 */
 			if (i % 2 == 0) {
 				ret = sched_setscheduler(0, SCHED_OTHER, &param);
 				SCX_BUG_ON(ret, "Failed to reset sched to normal");
 			}
 			exit(0);
 		}
 	}
 	for (i = 0; i < num_children; i++) {
 		SCX_FAIL_IF(waitpid(pids[i], &status, 0) != pids[i],
 			    "Failed to wait for SCX child\n");
 		SCX_FAIL_IF(status != 0, "SCX child %d exited with status %d\n", i,
 			    status);
 	}
 	/* SCHED_OTHER children */
 	for (i = 0; i < num_children; i++) {
 		pids[i] = fork();
 		if (pids[i] == 0)
 			exit(0);
 	}
 	for (i = 0; i < num_children; i++) {
 		SCX_FAIL_IF(waitpid(pids[i], &status, 0) != pids[i],
 			    "Failed to wait for normal child\n");
 		SCX_FAIL_IF(status != 0, "Normal child %d exited with status %d\n", i,
 			    status);
 	}
 	bpf_link__destroy(link);
 	SCX_GE(skel->bss->init_task_cnt, 2 * num_children);
 	SCX_GE(skel->bss->exit_task_cnt, 2 * num_children);
 	if (global) {
 		SCX_GE(skel->bss->enable_cnt, 2 * num_children);
 		SCX_GE(skel->bss->disable_cnt, 2 * num_children);
 	} else {
 		SCX_EQ(skel->bss->enable_cnt, num_children);
 		SCX_EQ(skel->bss->disable_cnt, num_children);
 	}
 	/*
 	 * We forked a ton of tasks before we attached the scheduler above, so
 	 * this should be fine. Technically it could be flaky if a ton of forks
 	 * are happening at the same time in other processes, but that should
 	 * be exceedingly unlikely.
 	 */
 	SCX_GT(skel->bss->init_transition_cnt, skel->bss->init_fork_cnt);
 	SCX_GE(skel->bss->init_fork_cnt, 2 * num_children);
 	init_enable_count__destroy(skel);
 	return SCX_TEST_PASS;
 }
 static enum scx_test_status run(void *ctx)
 {
 	enum scx_test_status status;
 	status = run_test(true);
 	if (status != SCX_TEST_PASS)
 		return status;
 	return run_test(false);
 }
 struct scx_test init_enable_count = {
 	.name = "init_enable_count",
 	.description = "Verify we do the correct amount of counting of init, "
 		       "enable, etc callbacks.",
 	.run = run,
 };
 REGISTER_SCX_TEST(&init_enable_count)
--- a/tools/testing/selftests/sched_ext/maximal.bpf.c
+++ b/tools/testing/selftests/sched_ext/maximal.bpf.c
@ -0,0 +1,164 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * A scheduler with every callback defined.
 *
 * This scheduler defines every callback.
 *
 * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2024 David Vernet <dvernet@meta.com>
 */
 #include <scx/common.bpf.h>
 char _license[] SEC("license") = "GPL";
 s32 BPF_STRUCT_OPS(maximal_select_cpu, struct task_struct *p, s32 prev_cpu,
 		   u64 wake_flags)
 {
 	return prev_cpu;
 }
 void BPF_STRUCT_OPS(maximal_enqueue, struct task_struct *p, u64 enq_flags)
 {
 	scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
 }
 void BPF_STRUCT_OPS(maximal_dequeue, struct task_struct *p, u64 deq_flags)
 {}
 void BPF_STRUCT_OPS(maximal_dispatch, s32 cpu, struct task_struct *prev)
 {
 	scx_bpf_consume(SCX_DSQ_GLOBAL);
 }
 void BPF_STRUCT_OPS(maximal_runnable, struct task_struct *p, u64 enq_flags)
 {}
 void BPF_STRUCT_OPS(maximal_running, struct task_struct *p)
 {}
 void BPF_STRUCT_OPS(maximal_stopping, struct task_struct *p, bool runnable)
 {}
 void BPF_STRUCT_OPS(maximal_quiescent, struct task_struct *p, u64 deq_flags)
 {}
 bool BPF_STRUCT_OPS(maximal_yield, struct task_struct *from,
 		    struct task_struct *to)
 {
 	return false;
 }
 bool BPF_STRUCT_OPS(maximal_core_sched_before, struct task_struct *a,
 		    struct task_struct *b)
 {
 	return false;
 }
 void BPF_STRUCT_OPS(maximal_set_weight, struct task_struct *p, u32 weight)
 {}
 void BPF_STRUCT_OPS(maximal_set_cpumask, struct task_struct *p,
 		    const struct cpumask *cpumask)
 {}
 void BPF_STRUCT_OPS(maximal_update_idle, s32 cpu, bool idle)
 {}
 void BPF_STRUCT_OPS(maximal_cpu_acquire, s32 cpu,
 		    struct scx_cpu_acquire_args *args)
 {}
 void BPF_STRUCT_OPS(maximal_cpu_release, s32 cpu,
 		    struct scx_cpu_release_args *args)
 {}
 void BPF_STRUCT_OPS(maximal_cpu_online, s32 cpu)
 {}
 void BPF_STRUCT_OPS(maximal_cpu_offline, s32 cpu)
 {}
 s32 BPF_STRUCT_OPS(maximal_init_task, struct task_struct *p,
 		   struct scx_init_task_args *args)
 {
 	return 0;
 }
 void BPF_STRUCT_OPS(maximal_enable, struct task_struct *p)
 {}
 void BPF_STRUCT_OPS(maximal_exit_task, struct task_struct *p,
 		    struct scx_exit_task_args *args)
 {}
 void BPF_STRUCT_OPS(maximal_disable, struct task_struct *p)
 {}
 s32 BPF_STRUCT_OPS(maximal_cgroup_init, struct cgroup *cgrp,
 		   struct scx_cgroup_init_args *args)
 {
 	return 0;
 }
 void BPF_STRUCT_OPS(maximal_cgroup_exit, struct cgroup *cgrp)
 {}
 s32 BPF_STRUCT_OPS(maximal_cgroup_prep_move, struct task_struct *p,
 		   struct cgroup *from, struct cgroup *to)
 {
 	return 0;
 }
 void BPF_STRUCT_OPS(maximal_cgroup_move, struct task_struct *p,
 		    struct cgroup *from, struct cgroup *to)
 {}
 void BPF_STRUCT_OPS(maximal_cgroup_cancel_move, struct task_struct *p,
 	       struct cgroup *from, struct cgroup *to)
 {}
 void BPF_STRUCT_OPS(maximal_cgroup_set_weight, struct cgroup *cgrp, u32 weight)
 {}
 s32 BPF_STRUCT_OPS_SLEEPABLE(maximal_init)
 {
 	return 0;
 }
 void BPF_STRUCT_OPS(maximal_exit, struct scx_exit_info *info)
 {}
 SEC(".struct_ops.link")
 struct sched_ext_ops maximal_ops = {
 	.select_cpu		= maximal_select_cpu,
 	.enqueue		= maximal_enqueue,
 	.dequeue		= maximal_dequeue,
 	.dispatch		= maximal_dispatch,
 	.runnable		= maximal_runnable,
 	.running		= maximal_running,
 	.stopping		= maximal_stopping,
 	.quiescent		= maximal_quiescent,
 	.yield			= maximal_yield,
 	.core_sched_before	= maximal_core_sched_before,
 	.set_weight		= maximal_set_weight,
 	.set_cpumask		= maximal_set_cpumask,
 	.update_idle		= maximal_update_idle,
 	.cpu_acquire		= maximal_cpu_acquire,
 	.cpu_release		= maximal_cpu_release,
 	.cpu_online		= maximal_cpu_online,
 	.cpu_offline		= maximal_cpu_offline,
 	.init_task		= maximal_init_task,
 	.enable			= maximal_enable,
 	.exit_task		= maximal_exit_task,
 	.disable		= maximal_disable,
 	.cgroup_init		= maximal_cgroup_init,
 	.cgroup_exit		= maximal_cgroup_exit,
 	.cgroup_prep_move	= maximal_cgroup_prep_move,
 	.cgroup_move		= maximal_cgroup_move,
 	.cgroup_cancel_move	= maximal_cgroup_cancel_move,
 	.cgroup_set_weight	= maximal_cgroup_set_weight,
 	.init			= maximal_init,
 	.exit			= maximal_exit,
 	.name			= "maximal",
 };
--- a/tools/testing/selftests/sched_ext/maximal.c
+++ b/tools/testing/selftests/sched_ext/maximal.c
@ -0,0 +1,51 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2024 David Vernet <dvernet@meta.com>
 */
 #include <bpf/bpf.h>
 #include <scx/common.h>
 #include <sys/wait.h>
 #include <unistd.h>
 #include "maximal.bpf.skel.h"
 #include "scx_test.h"
 static enum scx_test_status setup(void **ctx)
 {
 	struct maximal *skel;
 	skel = maximal__open_and_load();
 	SCX_FAIL_IF(!skel, "Failed to open and load skel");
 	*ctx = skel;
 	return SCX_TEST_PASS;
 }
 static enum scx_test_status run(void *ctx)
 {
 	struct maximal *skel = ctx;
 	struct bpf_link *link;
 	link = bpf_map__attach_struct_ops(skel->maps.maximal_ops);
 	SCX_FAIL_IF(!link, "Failed to attach scheduler");
 	bpf_link__destroy(link);
 	return SCX_TEST_PASS;
 }
 static void cleanup(void *ctx)
 {
 	struct maximal *skel = ctx;
 	maximal__destroy(skel);
 }
 struct scx_test maximal = {
 	.name = "maximal",
 	.description = "Verify we can load a scheduler with every callback defined",
 	.setup = setup,
 	.run = run,
 	.cleanup = cleanup,
 };
 REGISTER_SCX_TEST(&maximal)
--- a/tools/testing/selftests/sched_ext/maybe_null.bpf.c
+++ b/tools/testing/selftests/sched_ext/maybe_null.bpf.c
@ -0,0 +1,36 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
 */
 #include <scx/common.bpf.h>
 char _license[] SEC("license") = "GPL";
 u64 vtime_test;
 void BPF_STRUCT_OPS(maybe_null_running, struct task_struct *p)
 {}
 void BPF_STRUCT_OPS(maybe_null_success_dispatch, s32 cpu, struct task_struct *p)
 {
 	if (p != NULL)
 		vtime_test = p->scx.dsq_vtime;
 }
 bool BPF_STRUCT_OPS(maybe_null_success_yield, struct task_struct *from,
 		    struct task_struct *to)
 {
 	if (to)
 		bpf_printk("Yielding to %s[%d]", to->comm, to->pid);
 	return false;
 }
 SEC(".struct_ops.link")
 struct sched_ext_ops maybe_null_success = {
 	.dispatch               = maybe_null_success_dispatch,
 	.yield			= maybe_null_success_yield,
 	.enable			= maybe_null_running,
 	.name			= "minimal",
 };
--- a/tools/testing/selftests/sched_ext/maybe_null.c
+++ b/tools/testing/selftests/sched_ext/maybe_null.c
@ -0,0 +1,49 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
 */
 #include <bpf/bpf.h>
 #include <scx/common.h>
 #include <sys/wait.h>
 #include <unistd.h>
 #include "maybe_null.bpf.skel.h"
 #include "maybe_null_fail_dsp.bpf.skel.h"
 #include "maybe_null_fail_yld.bpf.skel.h"
 #include "scx_test.h"
 static enum scx_test_status run(void *ctx)
 {
 	struct maybe_null *skel;
 	struct maybe_null_fail_dsp *fail_dsp;
 	struct maybe_null_fail_yld *fail_yld;
 	skel = maybe_null__open_and_load();
 	if (!skel) {
 		SCX_ERR("Failed to open and load maybe_null skel");
 		return SCX_TEST_FAIL;
 	}
 	maybe_null__destroy(skel);
 	fail_dsp = maybe_null_fail_dsp__open_and_load();
 	if (fail_dsp) {
 		maybe_null_fail_dsp__destroy(fail_dsp);
 		SCX_ERR("Should failed to open and load maybe_null_fail_dsp skel");
 		return SCX_TEST_FAIL;
 	}
 	fail_yld = maybe_null_fail_yld__open_and_load();
 	if (fail_yld) {
 		maybe_null_fail_yld__destroy(fail_yld);
 		SCX_ERR("Should failed to open and load maybe_null_fail_yld skel");
 		return SCX_TEST_FAIL;
 	}
 	return SCX_TEST_PASS;
 }
 struct scx_test maybe_null = {
 	.name = "maybe_null",
 	.description = "Verify if PTR_MAYBE_NULL work for .dispatch",
 	.run = run,
 };
 REGISTER_SCX_TEST(&maybe_null)
--- a/tools/testing/selftests/sched_ext/maybe_null_fail_dsp.bpf.c
+++ b/tools/testing/selftests/sched_ext/maybe_null_fail_dsp.bpf.c
@ -0,0 +1,25 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
 */
 #include <scx/common.bpf.h>
 char _license[] SEC("license") = "GPL";
 u64 vtime_test;
 void BPF_STRUCT_OPS(maybe_null_running, struct task_struct *p)
 {}
 void BPF_STRUCT_OPS(maybe_null_fail_dispatch, s32 cpu, struct task_struct *p)
 {
 	vtime_test = p->scx.dsq_vtime;
 }
 SEC(".struct_ops.link")
 struct sched_ext_ops maybe_null_fail = {
 	.dispatch               = maybe_null_fail_dispatch,
 	.enable			= maybe_null_running,
 	.name			= "maybe_null_fail_dispatch",
 };
--- a/tools/testing/selftests/sched_ext/maybe_null_fail_yld.bpf.c
+++ b/tools/testing/selftests/sched_ext/maybe_null_fail_yld.bpf.c
@ -0,0 +1,28 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
 */
 #include <scx/common.bpf.h>
 char _license[] SEC("license") = "GPL";
 u64 vtime_test;
 void BPF_STRUCT_OPS(maybe_null_running, struct task_struct *p)
 {}
 bool BPF_STRUCT_OPS(maybe_null_fail_yield, struct task_struct *from,
 		    struct task_struct *to)
 {
 	bpf_printk("Yielding to %s[%d]", to->comm, to->pid);
 	return false;
 }
 SEC(".struct_ops.link")
 struct sched_ext_ops maybe_null_fail = {
 	.yield			= maybe_null_fail_yield,
 	.enable			= maybe_null_running,
 	.name			= "maybe_null_fail_yield",
 };
--- a/tools/testing/selftests/sched_ext/minimal.bpf.c
+++ b/tools/testing/selftests/sched_ext/minimal.bpf.c
@ -0,0 +1,21 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * A completely minimal scheduler.
 *
 * This scheduler defines the absolute minimal set of struct sched_ext_ops
 * fields: its name. It should _not_ fail to be loaded, and can be used to
 * exercise the default scheduling paths in ext.c.
 *
 * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2023 David Vernet <dvernet@meta.com>
 * Copyright (c) 2023 Tejun Heo <tj@kernel.org>
 */
 #include <scx/common.bpf.h>
 char _license[] SEC("license") = "GPL";
 SEC(".struct_ops.link")
 struct sched_ext_ops minimal_ops = {
 	.name			= "minimal",
 };
--- a/tools/testing/selftests/sched_ext/minimal.c
+++ b/tools/testing/selftests/sched_ext/minimal.c
@ -0,0 +1,58 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2023 David Vernet <dvernet@meta.com>
 * Copyright (c) 2023 Tejun Heo <tj@kernel.org>
 */
 #include <bpf/bpf.h>
 #include <scx/common.h>
 #include <sys/wait.h>
 #include <unistd.h>
 #include "minimal.bpf.skel.h"
 #include "scx_test.h"
 static enum scx_test_status setup(void **ctx)
 {
 	struct minimal *skel;
 	skel = minimal__open_and_load();
 	if (!skel) {
 		SCX_ERR("Failed to open and load skel");
 		return SCX_TEST_FAIL;
 	}
 	*ctx = skel;
 	return SCX_TEST_PASS;
 }
 static enum scx_test_status run(void *ctx)
 {
 	struct minimal *skel = ctx;
 	struct bpf_link *link;
 	link = bpf_map__attach_struct_ops(skel->maps.minimal_ops);
 	if (!link) {
 		SCX_ERR("Failed to attach scheduler");
 		return SCX_TEST_FAIL;
 	}
 	bpf_link__destroy(link);
 	return SCX_TEST_PASS;
 }
 static void cleanup(void *ctx)
 {
 	struct minimal *skel = ctx;
 	minimal__destroy(skel);
 }
 struct scx_test minimal = {
 	.name = "minimal",
 	.description = "Verify we can load a fully minimal scheduler",
 	.setup = setup,
 	.run = run,
 	.cleanup = cleanup,
 };
 REGISTER_SCX_TEST(&minimal)
--- a/tools/testing/selftests/sched_ext/prog_run.bpf.c
+++ b/tools/testing/selftests/sched_ext/prog_run.bpf.c
@ -0,0 +1,33 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * A scheduler that validates that we can invoke sched_ext kfuncs in
 * BPF_PROG_TYPE_SYSCALL programs.
 *
 * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2024 David Vernet <dvernet@meta.com>
 */
 #include <scx/common.bpf.h>
 UEI_DEFINE(uei);
 char _license[] SEC("license") = "GPL";
 SEC("syscall")
 int BPF_PROG(prog_run_syscall)
 {
 	scx_bpf_create_dsq(0, -1);
 	scx_bpf_exit(0xdeadbeef, "Exited from PROG_RUN");
 	return 0;
 }
 void BPF_STRUCT_OPS(prog_run_exit, struct scx_exit_info *ei)
 {
 	UEI_RECORD(uei, ei);
 }
 SEC(".struct_ops.link")
 struct sched_ext_ops prog_run_ops = {
 	.exit			= prog_run_exit,
 	.name			= "prog_run",
 };
--- a/tools/testing/selftests/sched_ext/prog_run.c
+++ b/tools/testing/selftests/sched_ext/prog_run.c
@ -0,0 +1,78 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2024 David Vernet <dvernet@meta.com>
 */
 #include <bpf/bpf.h>
 #include <sched.h>
 #include <scx/common.h>
 #include <sys/wait.h>
 #include <unistd.h>
 #include "prog_run.bpf.skel.h"
 #include "scx_test.h"
 static enum scx_test_status setup(void **ctx)
 {
 	struct prog_run *skel;
 	skel = prog_run__open_and_load();
 	if (!skel) {
 		SCX_ERR("Failed to open and load skel");
 		return SCX_TEST_FAIL;
 	}
 	*ctx = skel;
 	return SCX_TEST_PASS;
 }
 static enum scx_test_status run(void *ctx)
 {
 	struct prog_run *skel = ctx;
 	struct bpf_link *link;
 	int prog_fd, err = 0;
 	prog_fd = bpf_program__fd(skel->progs.prog_run_syscall);
 	if (prog_fd < 0) {
 		SCX_ERR("Failed to get BPF_PROG_RUN prog");
 		return SCX_TEST_FAIL;
 	}
 	LIBBPF_OPTS(bpf_test_run_opts, topts);
 	link = bpf_map__attach_struct_ops(skel->maps.prog_run_ops);
 	if (!link) {
 		SCX_ERR("Failed to attach scheduler");
 		close(prog_fd);
 		return SCX_TEST_FAIL;
 	}
 	err = bpf_prog_test_run_opts(prog_fd, &topts);
 	SCX_EQ(err, 0);
 	/* Assumes uei.kind is written last */
 	while (skel->data->uei.kind == EXIT_KIND(SCX_EXIT_NONE))
 		sched_yield();
 	SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_UNREG_BPF));
 	SCX_EQ(skel->data->uei.exit_code, 0xdeadbeef);
 	close(prog_fd);
 	bpf_link__destroy(link);
 	return SCX_TEST_PASS;
 }
 static void cleanup(void *ctx)
 {
 	struct prog_run *skel = ctx;
 	prog_run__destroy(skel);
 }
 struct scx_test prog_run = {
 	.name = "prog_run",
 	.description = "Verify we can call into a scheduler with BPF_PROG_RUN, and invoke kfuncs",
 	.setup = setup,
 	.run = run,
 	.cleanup = cleanup,
 };
 REGISTER_SCX_TEST(&prog_run)
--- a/tools/testing/selftests/sched_ext/reload_loop.c
+++ b/tools/testing/selftests/sched_ext/reload_loop.c
@ -0,0 +1,75 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2024 David Vernet <dvernet@meta.com>
 */
 #include <bpf/bpf.h>
 #include <pthread.h>
 #include <scx/common.h>
 #include <sys/wait.h>
 #include <unistd.h>
 #include "maximal.bpf.skel.h"
 #include "scx_test.h"
 static struct maximal *skel;
 static pthread_t threads[2];
 bool force_exit = false;
 static enum scx_test_status setup(void **ctx)
 {
 	skel = maximal__open_and_load();
 	if (!skel) {
 		SCX_ERR("Failed to open and load skel");
 		return SCX_TEST_FAIL;
 	}
 	return SCX_TEST_PASS;
 }
 static void *do_reload_loop(void *arg)
 {
 	u32 i;
 	for (i = 0; i < 1024 && !force_exit; i++) {
 		struct bpf_link *link;
 		link = bpf_map__attach_struct_ops(skel->maps.maximal_ops);
 		if (link)
 			bpf_link__destroy(link);
 	}
 	return NULL;
 }
 static enum scx_test_status run(void *ctx)
 {
 	int err;
 	void *ret;
 	err = pthread_create(&threads[0], NULL, do_reload_loop, NULL);
 	SCX_FAIL_IF(err, "Failed to create thread 0");
 	err = pthread_create(&threads[1], NULL, do_reload_loop, NULL);
 	SCX_FAIL_IF(err, "Failed to create thread 1");
 	SCX_FAIL_IF(pthread_join(threads[0], &ret), "thread 0 failed");
 	SCX_FAIL_IF(pthread_join(threads[1], &ret), "thread 1 failed");
 	return SCX_TEST_PASS;
 }
 static void cleanup(void *ctx)
 {
 	force_exit = true;
 	maximal__destroy(skel);
 }
 struct scx_test reload_loop = {
 	.name = "reload_loop",
 	.description = "Stress test loading and unloading schedulers repeatedly in a tight loop",
 	.setup = setup,
 	.run = run,
 	.cleanup = cleanup,
 };
 REGISTER_SCX_TEST(&reload_loop)
--- a/tools/testing/selftests/sched_ext/runner.c
+++ b/tools/testing/selftests/sched_ext/runner.c
@ -0,0 +1,201 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2024 David Vernet <dvernet@meta.com>
 * Copyright (c) 2024 Tejun Heo <tj@kernel.org>
 */
 #include <stdio.h>
 #include <unistd.h>
 #include <signal.h>
 #include <libgen.h>
 #include <bpf/bpf.h>
 #include "scx_test.h"
 const char help_fmt[] =
 "The runner for sched_ext tests.\n"
 "\n"
 "The runner is statically linked against all testcases, and runs them all serially.\n"
 "It's required for the testcases to be serial, as only a single host-wide sched_ext\n"
 "scheduler may be loaded at any given time."
 "\n"
 "Usage: %s [-t TEST] [-h]\n"
 "\n"
 "  -t TEST       Only run tests whose name includes this string\n"
 "  -s            Include print output for skipped tests\n"
 "  -q            Don't print the test descriptions during run\n"
 "  -h            Display this help and exit\n";
 static volatile int exit_req;
 static bool quiet, print_skipped;
 #define MAX_SCX_TESTS 2048
 static struct scx_test __scx_tests[MAX_SCX_TESTS];
 static unsigned __scx_num_tests = 0;
 static void sigint_handler(int simple)
 {
 	exit_req = 1;
 }
 static void print_test_preamble(const struct scx_test *test, bool quiet)
 {
 	printf("===== START =====\n");
 	printf("TEST: %s\n", test->name);
 	if (!quiet)
 		printf("DESCRIPTION: %s\n", test->description);
 	printf("OUTPUT:\n");
 }
 static const char *status_to_result(enum scx_test_status status)
 {
 	switch (status) {
 	case SCX_TEST_PASS:
 	case SCX_TEST_SKIP:
 		return "ok";
 	case SCX_TEST_FAIL:
 		return "not ok";
 	default:
 		return "<UNKNOWN>";
 	}
 }
 static void print_test_result(const struct scx_test *test,
 			      enum scx_test_status status,
 			      unsigned int testnum)
 {
 	const char *result = status_to_result(status);
 	const char *directive = status == SCX_TEST_SKIP ? "SKIP " : "";
 	printf("%s %u %s # %s\n", result, testnum, test->name, directive);
 	printf("=====  END  =====\n");
 }
 static bool should_skip_test(const struct scx_test *test, const char * filter)
 {
 	return !strstr(test->name, filter);
 }
 static enum scx_test_status run_test(const struct scx_test *test)
 {
 	enum scx_test_status status;
 	void *context = NULL;
 	if (test->setup) {
 		status = test->setup(&context);
 		if (status != SCX_TEST_PASS)
 			return status;
 	}
 	status = test->run(context);
 	if (test->cleanup)
 		test->cleanup(context);
 	return status;
 }
 static bool test_valid(const struct scx_test *test)
 {
 	if (!test) {
 		fprintf(stderr, "NULL test detected\n");
 		return false;
 	}
 	if (!test->name) {
 		fprintf(stderr,
 			"Test with no name found. Must specify test name.\n");
 		return false;
 	}
 	if (!test->description) {
 		fprintf(stderr, "Test %s requires description.\n", test->name);
 		return false;
 	}
 	if (!test->run) {
 		fprintf(stderr, "Test %s has no run() callback\n", test->name);
 		return false;
 	}
 	return true;
 }
 int main(int argc, char **argv)
 {
 	const char *filter = NULL;
 	unsigned testnum = 0, i;
 	unsigned passed = 0, skipped = 0, failed = 0;
 	int opt;
 	signal(SIGINT, sigint_handler);
 	signal(SIGTERM, sigint_handler);
 	libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
 	while ((opt = getopt(argc, argv, "qst:h")) != -1) {
 		switch (opt) {
 		case 'q':
 			quiet = true;
 			break;
 		case 's':
 			print_skipped = true;
 			break;
 		case 't':
 			filter = optarg;
 			break;
 		default:
 			fprintf(stderr, help_fmt, basename(argv[0]));
 			return opt != 'h';
 		}
 	}
 	for (i = 0; i < __scx_num_tests; i++) {
 		enum scx_test_status status;
 		struct scx_test *test = &__scx_tests[i];
 		if (filter && should_skip_test(test, filter)) {
 			/*
 			 * Printing the skipped tests and their preambles can
 			 * add a lot of noise to the runner output. Printing
 			 * this is only really useful for CI, so let's skip it
 			 * by default.
 			 */
 			if (print_skipped) {
 				print_test_preamble(test, quiet);
 				print_test_result(test, SCX_TEST_SKIP, ++testnum);
 			}
 			continue;
 		}
 		print_test_preamble(test, quiet);
 		status = run_test(test);
 		print_test_result(test, status, ++testnum);
 		switch (status) {
 		case SCX_TEST_PASS:
 			passed++;
 			break;
 		case SCX_TEST_SKIP:
 			skipped++;
 			break;
 		case SCX_TEST_FAIL:
 			failed++;
 			break;
 		}
 	}
 	printf("\n\n=============================\n\n");
 	printf("RESULTS:\n\n");
 	printf("PASSED:  %u\n", passed);
 	printf("SKIPPED: %u\n", skipped);
 	printf("FAILED:  %u\n", failed);
 	return 0;
 }
 void scx_test_register(struct scx_test *test)
 {
 	SCX_BUG_ON(!test_valid(test), "Invalid test found");
 	SCX_BUG_ON(__scx_num_tests >= MAX_SCX_TESTS, "Maximum tests exceeded");
 	__scx_tests[__scx_num_tests++] = *test;
 }
--- a/tools/testing/selftests/sched_ext/scx_test.h
+++ b/tools/testing/selftests/sched_ext/scx_test.h
@ -0,0 +1,131 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2023 Tejun Heo <tj@kernel.org>
 * Copyright (c) 2023 David Vernet <dvernet@meta.com>
 */
 #ifndef __SCX_TEST_H__
 #define __SCX_TEST_H__
 #include <errno.h>
 #include <scx/common.h>
 #include <scx/compat.h>
 enum scx_test_status {
 	SCX_TEST_PASS = 0,
 	SCX_TEST_SKIP,
 	SCX_TEST_FAIL,
 };
 #define EXIT_KIND(__ent) __COMPAT_ENUM_OR_ZERO("scx_exit_kind", #__ent)
 struct scx_test {
 	/**
 	 * name - The name of the testcase.
 	 */
 	const char *name;
 	/**
 	 * description - A description of your testcase: what it tests and is
 	 * meant to validate.
 	 */
 	const char *description;
 	/*
 	 * setup - Setup the test.
 	 * @ctx: A pointer to a context object that will be passed to run and
 	 *	 cleanup.
 	 *
 	 * An optional callback that allows a testcase to perform setup for its
 	 * run. A test may return SCX_TEST_SKIP to skip the run.
 	 */
 	enum scx_test_status (*setup)(void **ctx);
 	/*
 	 * run - Run the test.
 	 * @ctx: Context set in the setup() callback. If @ctx was not set in
 	 *	 setup(), it is NULL.
 	 *
 	 * The main test. Callers should return one of:
 	 *
 	 * - SCX_TEST_PASS: Test passed
 	 * - SCX_TEST_SKIP: Test should be skipped
 	 * - SCX_TEST_FAIL: Test failed
 	 *
 	 * This callback must be defined.
 	 */
 	enum scx_test_status (*run)(void *ctx);
 	/*
 	 * cleanup - Perform cleanup following the test
 	 * @ctx: Context set in the setup() callback. If @ctx was not set in
 	 *	 setup(), it is NULL.
 	 *
 	 * An optional callback that allows a test to perform cleanup after
 	 * being run. This callback is run even if the run() callback returns
 	 * SCX_TEST_SKIP or SCX_TEST_FAIL. It is not run if setup() returns
 	 * SCX_TEST_SKIP or SCX_TEST_FAIL.
 	 */
 	void (*cleanup)(void *ctx);
 };
 void scx_test_register(struct scx_test *test);
 #define REGISTER_SCX_TEST(__test)			\
 	__attribute__((constructor))			\
 	static void ___scxregister##__LINE__(void)	\
 	{						\
 		scx_test_register(__test);		\
 	}
 #define SCX_ERR(__fmt, ...)						\
 	do {								\
 		fprintf(stderr, "ERR: %s:%d\n", __FILE__, __LINE__);	\
 		fprintf(stderr, __fmt"\n", ##__VA_ARGS__);			\
 	} while (0)
 #define SCX_FAIL(__fmt, ...)						\
 	do {								\
 		SCX_ERR(__fmt, ##__VA_ARGS__);				\
 		return SCX_TEST_FAIL;					\
 	} while (0)
 #define SCX_FAIL_IF(__cond, __fmt, ...)					\
 	do {								\
 		if (__cond)						\
 			SCX_FAIL(__fmt, ##__VA_ARGS__);			\
 	} while (0)
 #define SCX_GT(_x, _y) SCX_FAIL_IF((_x) <= (_y), "Expected %s > %s (%lu > %lu)",	\
 				   #_x, #_y, (u64)(_x), (u64)(_y))
 #define SCX_GE(_x, _y) SCX_FAIL_IF((_x) < (_y), "Expected %s >= %s (%lu >= %lu)",	\
 				   #_x, #_y, (u64)(_x), (u64)(_y))
 #define SCX_LT(_x, _y) SCX_FAIL_IF((_x) >= (_y), "Expected %s < %s (%lu < %lu)",	\
 				   #_x, #_y, (u64)(_x), (u64)(_y))
 #define SCX_LE(_x, _y) SCX_FAIL_IF((_x) > (_y), "Expected %s <= %s (%lu <= %lu)",	\
 				   #_x, #_y, (u64)(_x), (u64)(_y))
 #define SCX_EQ(_x, _y) SCX_FAIL_IF((_x) != (_y), "Expected %s == %s (%lu == %lu)",	\
 				   #_x, #_y, (u64)(_x), (u64)(_y))
 #define SCX_ASSERT(_x) SCX_FAIL_IF(!(_x), "Expected %s to be true (%lu)",		\
 				   #_x, (u64)(_x))
 #define SCX_ECODE_VAL(__ecode) ({						\
        u64 __val = 0;								\
 	bool __found = false;							\
 										\
 	__found = __COMPAT_read_enum("scx_exit_code", #__ecode, &__val);	\
 	SCX_ASSERT(__found);							\
 	(s64)__val;								\
 })
 #define SCX_KIND_VAL(__kind) ({							\
        u64 __val = 0;								\
 	bool __found = false;							\
 										\
 	__found = __COMPAT_read_enum("scx_exit_kind", #__kind, &__val);		\
 	SCX_ASSERT(__found);							\
 	__val;									\
 })
 #endif  // # __SCX_TEST_H__
--- a/tools/testing/selftests/sched_ext/select_cpu_dfl.bpf.c
+++ b/tools/testing/selftests/sched_ext/select_cpu_dfl.bpf.c
@ -0,0 +1,40 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * A scheduler that validates the behavior of direct dispatching with a default
 * select_cpu implementation.
 *
 * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2023 David Vernet <dvernet@meta.com>
 * Copyright (c) 2023 Tejun Heo <tj@kernel.org>
 */
 #include <scx/common.bpf.h>
 char _license[] SEC("license") = "GPL";
 bool saw_local = false;
 static bool task_is_test(const struct task_struct *p)
 {
 	return !bpf_strncmp(p->comm, 9, "select_cpu");
 }
 void BPF_STRUCT_OPS(select_cpu_dfl_enqueue, struct task_struct *p,
 		    u64 enq_flags)
 {
 	const struct cpumask *idle_mask = scx_bpf_get_idle_cpumask();
 	if (task_is_test(p) &&
 	    bpf_cpumask_test_cpu(scx_bpf_task_cpu(p), idle_mask)) {
 		saw_local = true;
 	}
 	scx_bpf_put_idle_cpumask(idle_mask);
 	scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
 }
 SEC(".struct_ops.link")
 struct sched_ext_ops select_cpu_dfl_ops = {
 	.enqueue		= select_cpu_dfl_enqueue,
 	.name			= "select_cpu_dfl",
 };
--- a/tools/testing/selftests/sched_ext/select_cpu_dfl.c
+++ b/tools/testing/selftests/sched_ext/select_cpu_dfl.c
@ -0,0 +1,72 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2023 David Vernet <dvernet@meta.com>
 * Copyright (c) 2023 Tejun Heo <tj@kernel.org>
 */
 #include <bpf/bpf.h>
 #include <scx/common.h>
 #include <sys/wait.h>
 #include <unistd.h>
 #include "select_cpu_dfl.bpf.skel.h"
 #include "scx_test.h"
 #define NUM_CHILDREN 1028
 static enum scx_test_status setup(void **ctx)
 {
 	struct select_cpu_dfl *skel;
 	skel = select_cpu_dfl__open_and_load();
 	SCX_FAIL_IF(!skel, "Failed to open and load skel");
 	*ctx = skel;
 	return SCX_TEST_PASS;
 }
 static enum scx_test_status run(void *ctx)
 {
 	struct select_cpu_dfl *skel = ctx;
 	struct bpf_link *link;
 	pid_t pids[NUM_CHILDREN];
 	int i, status;
 	link = bpf_map__attach_struct_ops(skel->maps.select_cpu_dfl_ops);
 	SCX_FAIL_IF(!link, "Failed to attach scheduler");
 	for (i = 0; i < NUM_CHILDREN; i++) {
 		pids[i] = fork();
 		if (pids[i] == 0) {
 			sleep(1);
 			exit(0);
 		}
 	}
 	for (i = 0; i < NUM_CHILDREN; i++) {
 		SCX_EQ(waitpid(pids[i], &status, 0), pids[i]);
 		SCX_EQ(status, 0);
 	}
 	SCX_ASSERT(!skel->bss->saw_local);
 	bpf_link__destroy(link);
 	return SCX_TEST_PASS;
 }
 static void cleanup(void *ctx)
 {
 	struct select_cpu_dfl *skel = ctx;
 	select_cpu_dfl__destroy(skel);
 }
 struct scx_test select_cpu_dfl = {
 	.name = "select_cpu_dfl",
 	.description = "Verify the default ops.select_cpu() dispatches tasks "
 		       "when idles cores are found, and skips ops.enqueue()",
 	.setup = setup,
 	.run = run,
 	.cleanup = cleanup,
 };
 REGISTER_SCX_TEST(&select_cpu_dfl)
--- a/tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.bpf.c
+++ b/tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.bpf.c
@ -0,0 +1,89 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * A scheduler that validates the behavior of direct dispatching with a default
 * select_cpu implementation, and with the SCX_OPS_ENQ_DFL_NO_DISPATCH ops flag
 * specified.
 *
 * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2023 David Vernet <dvernet@meta.com>
 * Copyright (c) 2023 Tejun Heo <tj@kernel.org>
 */
 #include <scx/common.bpf.h>
 char _license[] SEC("license") = "GPL";
 bool saw_local = false;
 /* Per-task scheduling context */
 struct task_ctx {
 	bool	force_local;	/* CPU changed by ops.select_cpu() */
 };
 struct {
 	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
 	__uint(map_flags, BPF_F_NO_PREALLOC);
 	__type(key, int);
 	__type(value, struct task_ctx);
 } task_ctx_stor SEC(".maps");
 /* Manually specify the signature until the kfunc is added to the scx repo. */
 s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags,
 			   bool *found) __ksym;
 s32 BPF_STRUCT_OPS(select_cpu_dfl_nodispatch_select_cpu, struct task_struct *p,
 		   s32 prev_cpu, u64 wake_flags)
 {
 	struct task_ctx *tctx;
 	s32 cpu;
 	tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
 	if (!tctx) {
 		scx_bpf_error("task_ctx lookup failed");
 		return -ESRCH;
 	}
 	cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags,
 				     &tctx->force_local);
 	return cpu;
 }
 void BPF_STRUCT_OPS(select_cpu_dfl_nodispatch_enqueue, struct task_struct *p,
 		    u64 enq_flags)
 {
 	u64 dsq_id = SCX_DSQ_GLOBAL;
 	struct task_ctx *tctx;
 	tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
 	if (!tctx) {
 		scx_bpf_error("task_ctx lookup failed");
 		return;
 	}
 	if (tctx->force_local) {
 		dsq_id = SCX_DSQ_LOCAL;
 		tctx->force_local = false;
 		saw_local = true;
 	}
 	scx_bpf_dispatch(p, dsq_id, SCX_SLICE_DFL, enq_flags);
 }
 s32 BPF_STRUCT_OPS(select_cpu_dfl_nodispatch_init_task,
 		   struct task_struct *p, struct scx_init_task_args *args)
 {
 	if (bpf_task_storage_get(&task_ctx_stor, p, 0,
 				 BPF_LOCAL_STORAGE_GET_F_CREATE))
 		return 0;
 	else
 		return -ENOMEM;
 }
 SEC(".struct_ops.link")
 struct sched_ext_ops select_cpu_dfl_nodispatch_ops = {
 	.select_cpu		= select_cpu_dfl_nodispatch_select_cpu,
 	.enqueue		= select_cpu_dfl_nodispatch_enqueue,
 	.init_task		= select_cpu_dfl_nodispatch_init_task,
 	.name			= "select_cpu_dfl_nodispatch",
 };
--- a/tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.c
+++ b/tools/testing/selftests/sched_ext/select_cpu_dfl_nodispatch.c
@ -0,0 +1,72 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2023 David Vernet <dvernet@meta.com>
 * Copyright (c) 2023 Tejun Heo <tj@kernel.org>
 */
 #include <bpf/bpf.h>
 #include <scx/common.h>
 #include <sys/wait.h>
 #include <unistd.h>
 #include "select_cpu_dfl_nodispatch.bpf.skel.h"
 #include "scx_test.h"
 #define NUM_CHILDREN 1028
 static enum scx_test_status setup(void **ctx)
 {
 	struct select_cpu_dfl_nodispatch *skel;
 	skel = select_cpu_dfl_nodispatch__open_and_load();
 	SCX_FAIL_IF(!skel, "Failed to open and load skel");
 	*ctx = skel;
 	return SCX_TEST_PASS;
 }
 static enum scx_test_status run(void *ctx)
 {
 	struct select_cpu_dfl_nodispatch *skel = ctx;
 	struct bpf_link *link;
 	pid_t pids[NUM_CHILDREN];
 	int i, status;
 	link = bpf_map__attach_struct_ops(skel->maps.select_cpu_dfl_nodispatch_ops);
 	SCX_FAIL_IF(!link, "Failed to attach scheduler");
 	for (i = 0; i < NUM_CHILDREN; i++) {
 		pids[i] = fork();
 		if (pids[i] == 0) {
 			sleep(1);
 			exit(0);
 		}
 	}
 	for (i = 0; i < NUM_CHILDREN; i++) {
 		SCX_EQ(waitpid(pids[i], &status, 0), pids[i]);
 		SCX_EQ(status, 0);
 	}
 	SCX_ASSERT(skel->bss->saw_local);
 	bpf_link__destroy(link);
 	return SCX_TEST_PASS;
 }
 static void cleanup(void *ctx)
 {
 	struct select_cpu_dfl_nodispatch *skel = ctx;
 	select_cpu_dfl_nodispatch__destroy(skel);
 }
 struct scx_test select_cpu_dfl_nodispatch = {
 	.name = "select_cpu_dfl_nodispatch",
 	.description = "Verify behavior of scx_bpf_select_cpu_dfl() in "
 		       "ops.select_cpu()",
 	.setup = setup,
 	.run = run,
 	.cleanup = cleanup,
 };
 REGISTER_SCX_TEST(&select_cpu_dfl_nodispatch)
--- a/tools/testing/selftests/sched_ext/select_cpu_dispatch.bpf.c
+++ b/tools/testing/selftests/sched_ext/select_cpu_dispatch.bpf.c
@ -0,0 +1,41 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * A scheduler that validates the behavior of direct dispatching with a default
 * select_cpu implementation.
 *
 * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2023 David Vernet <dvernet@meta.com>
 * Copyright (c) 2023 Tejun Heo <tj@kernel.org>
 */
 #include <scx/common.bpf.h>
 char _license[] SEC("license") = "GPL";
 s32 BPF_STRUCT_OPS(select_cpu_dispatch_select_cpu, struct task_struct *p,
 		   s32 prev_cpu, u64 wake_flags)
 {
 	u64 dsq_id = SCX_DSQ_LOCAL;
 	s32 cpu = prev_cpu;
 	if (scx_bpf_test_and_clear_cpu_idle(cpu))
 		goto dispatch;
 	cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0);
 	if (cpu >= 0)
 		goto dispatch;
 	dsq_id = SCX_DSQ_GLOBAL;
 	cpu = prev_cpu;
 dispatch:
 	scx_bpf_dispatch(p, dsq_id, SCX_SLICE_DFL, 0);
 	return cpu;
 }
 SEC(".struct_ops.link")
 struct sched_ext_ops select_cpu_dispatch_ops = {
 	.select_cpu		= select_cpu_dispatch_select_cpu,
 	.name			= "select_cpu_dispatch",
 	.timeout_ms		= 1000U,
 };
--- a/tools/testing/selftests/sched_ext/select_cpu_dispatch.c
+++ b/tools/testing/selftests/sched_ext/select_cpu_dispatch.c
@ -0,0 +1,70 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2023 David Vernet <dvernet@meta.com>
 * Copyright (c) 2023 Tejun Heo <tj@kernel.org>
 */
 #include <bpf/bpf.h>
 #include <scx/common.h>
 #include <sys/wait.h>
 #include <unistd.h>
 #include "select_cpu_dispatch.bpf.skel.h"
 #include "scx_test.h"
 #define NUM_CHILDREN 1028
 static enum scx_test_status setup(void **ctx)
 {
 	struct select_cpu_dispatch *skel;
 	skel = select_cpu_dispatch__open_and_load();
 	SCX_FAIL_IF(!skel, "Failed to open and load skel");
 	*ctx = skel;
 	return SCX_TEST_PASS;
 }
 static enum scx_test_status run(void *ctx)
 {
 	struct select_cpu_dispatch *skel = ctx;
 	struct bpf_link *link;
 	pid_t pids[NUM_CHILDREN];
 	int i, status;
 	link = bpf_map__attach_struct_ops(skel->maps.select_cpu_dispatch_ops);
 	SCX_FAIL_IF(!link, "Failed to attach scheduler");
 	for (i = 0; i < NUM_CHILDREN; i++) {
 		pids[i] = fork();
 		if (pids[i] == 0) {
 			sleep(1);
 			exit(0);
 		}
 	}
 	for (i = 0; i < NUM_CHILDREN; i++) {
 		SCX_EQ(waitpid(pids[i], &status, 0), pids[i]);
 		SCX_EQ(status, 0);
 	}
 	bpf_link__destroy(link);
 	return SCX_TEST_PASS;
 }
 static void cleanup(void *ctx)
 {
 	struct select_cpu_dispatch *skel = ctx;
 	select_cpu_dispatch__destroy(skel);
 }
 struct scx_test select_cpu_dispatch = {
 	.name = "select_cpu_dispatch",
 	.description = "Test direct dispatching to built-in DSQs from "
 		       "ops.select_cpu()",
 	.setup = setup,
 	.run = run,
 	.cleanup = cleanup,
 };
 REGISTER_SCX_TEST(&select_cpu_dispatch)
--- a/tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.bpf.c
+++ b/tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.bpf.c
@ -0,0 +1,37 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * A scheduler that validates the behavior of direct dispatching with a default
 * select_cpu implementation.
 *
 * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2023 David Vernet <dvernet@meta.com>
 * Copyright (c) 2023 Tejun Heo <tj@kernel.org>
 */
 #include <scx/common.bpf.h>
 char _license[] SEC("license") = "GPL";
 UEI_DEFINE(uei);
 s32 BPF_STRUCT_OPS(select_cpu_dispatch_bad_dsq_select_cpu, struct task_struct *p,
 		   s32 prev_cpu, u64 wake_flags)
 {
 	/* Dispatching to a random DSQ should fail. */
 	scx_bpf_dispatch(p, 0xcafef00d, SCX_SLICE_DFL, 0);
 	return prev_cpu;
 }
 void BPF_STRUCT_OPS(select_cpu_dispatch_bad_dsq_exit, struct scx_exit_info *ei)
 {
 	UEI_RECORD(uei, ei);
 }
 SEC(".struct_ops.link")
 struct sched_ext_ops select_cpu_dispatch_bad_dsq_ops = {
 	.select_cpu		= select_cpu_dispatch_bad_dsq_select_cpu,
 	.exit			= select_cpu_dispatch_bad_dsq_exit,
 	.name			= "select_cpu_dispatch_bad_dsq",
 	.timeout_ms		= 1000U,
 };
--- a/tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.c
+++ b/tools/testing/selftests/sched_ext/select_cpu_dispatch_bad_dsq.c
@ -0,0 +1,56 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2023 David Vernet <dvernet@meta.com>
 * Copyright (c) 2023 Tejun Heo <tj@kernel.org>
 */
 #include <bpf/bpf.h>
 #include <scx/common.h>
 #include <sys/wait.h>
 #include <unistd.h>
 #include "select_cpu_dispatch_bad_dsq.bpf.skel.h"
 #include "scx_test.h"
 static enum scx_test_status setup(void **ctx)
 {
 	struct select_cpu_dispatch_bad_dsq *skel;
 	skel = select_cpu_dispatch_bad_dsq__open_and_load();
 	SCX_FAIL_IF(!skel, "Failed to open and load skel");
 	*ctx = skel;
 	return SCX_TEST_PASS;
 }
 static enum scx_test_status run(void *ctx)
 {
 	struct select_cpu_dispatch_bad_dsq *skel = ctx;
 	struct bpf_link *link;
 	link = bpf_map__attach_struct_ops(skel->maps.select_cpu_dispatch_bad_dsq_ops);
 	SCX_FAIL_IF(!link, "Failed to attach scheduler");
 	sleep(1);
 	SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_ERROR));
 	bpf_link__destroy(link);
 	return SCX_TEST_PASS;
 }
 static void cleanup(void *ctx)
 {
 	struct select_cpu_dispatch_bad_dsq *skel = ctx;
 	select_cpu_dispatch_bad_dsq__destroy(skel);
 }
 struct scx_test select_cpu_dispatch_bad_dsq = {
 	.name = "select_cpu_dispatch_bad_dsq",
 	.description = "Verify graceful failure if we direct-dispatch to a "
 		       "bogus DSQ in ops.select_cpu()",
 	.setup = setup,
 	.run = run,
 	.cleanup = cleanup,
 };
 REGISTER_SCX_TEST(&select_cpu_dispatch_bad_dsq)
--- a/tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.bpf.c
+++ b/tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.bpf.c
@ -0,0 +1,38 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * A scheduler that validates the behavior of direct dispatching with a default
 * select_cpu implementation.
 *
 * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2023 David Vernet <dvernet@meta.com>
 * Copyright (c) 2023 Tejun Heo <tj@kernel.org>
 */
 #include <scx/common.bpf.h>
 char _license[] SEC("license") = "GPL";
 UEI_DEFINE(uei);
 s32 BPF_STRUCT_OPS(select_cpu_dispatch_dbl_dsp_select_cpu, struct task_struct *p,
 		   s32 prev_cpu, u64 wake_flags)
 {
 	/* Dispatching twice in a row is disallowed. */
 	scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, 0);
 	scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, 0);
 	return prev_cpu;
 }
 void BPF_STRUCT_OPS(select_cpu_dispatch_dbl_dsp_exit, struct scx_exit_info *ei)
 {
 	UEI_RECORD(uei, ei);
 }
 SEC(".struct_ops.link")
 struct sched_ext_ops select_cpu_dispatch_dbl_dsp_ops = {
 	.select_cpu		= select_cpu_dispatch_dbl_dsp_select_cpu,
 	.exit			= select_cpu_dispatch_dbl_dsp_exit,
 	.name			= "select_cpu_dispatch_dbl_dsp",
 	.timeout_ms		= 1000U,
 };
--- a/tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.c
+++ b/tools/testing/selftests/sched_ext/select_cpu_dispatch_dbl_dsp.c
@ -0,0 +1,56 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2023 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2023 David Vernet <dvernet@meta.com>
 * Copyright (c) 2023 Tejun Heo <tj@kernel.org>
 */
 #include <bpf/bpf.h>
 #include <scx/common.h>
 #include <sys/wait.h>
 #include <unistd.h>
 #include "select_cpu_dispatch_dbl_dsp.bpf.skel.h"
 #include "scx_test.h"
 static enum scx_test_status setup(void **ctx)
 {
 	struct select_cpu_dispatch_dbl_dsp *skel;
 	skel = select_cpu_dispatch_dbl_dsp__open_and_load();
 	SCX_FAIL_IF(!skel, "Failed to open and load skel");
 	*ctx = skel;
 	return SCX_TEST_PASS;
 }
 static enum scx_test_status run(void *ctx)
 {
 	struct select_cpu_dispatch_dbl_dsp *skel = ctx;
 	struct bpf_link *link;
 	link = bpf_map__attach_struct_ops(skel->maps.select_cpu_dispatch_dbl_dsp_ops);
 	SCX_FAIL_IF(!link, "Failed to attach scheduler");
 	sleep(1);
 	SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_ERROR));
 	bpf_link__destroy(link);
 	return SCX_TEST_PASS;
 }
 static void cleanup(void *ctx)
 {
 	struct select_cpu_dispatch_dbl_dsp *skel = ctx;
 	select_cpu_dispatch_dbl_dsp__destroy(skel);
 }
 struct scx_test select_cpu_dispatch_dbl_dsp = {
 	.name = "select_cpu_dispatch_dbl_dsp",
 	.description = "Verify graceful failure if we dispatch twice to a "
 		       "DSQ in ops.select_cpu()",
 	.setup = setup,
 	.run = run,
 	.cleanup = cleanup,
 };
 REGISTER_SCX_TEST(&select_cpu_dispatch_dbl_dsp)
--- a/tools/testing/selftests/sched_ext/select_cpu_vtime.bpf.c
+++ b/tools/testing/selftests/sched_ext/select_cpu_vtime.bpf.c
@ -0,0 +1,92 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * A scheduler that validates that enqueue flags are properly stored and
 * applied at dispatch time when a task is directly dispatched from
 * ops.select_cpu(). We validate this by using scx_bpf_dispatch_vtime(), and
 * making the test a very basic vtime scheduler.
 *
 * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2024 David Vernet <dvernet@meta.com>
 * Copyright (c) 2024 Tejun Heo <tj@kernel.org>
 */
 #include <scx/common.bpf.h>
 char _license[] SEC("license") = "GPL";
 volatile bool consumed;
 static u64 vtime_now;
 #define VTIME_DSQ 0
 static inline bool vtime_before(u64 a, u64 b)
 {
 	return (s64)(a - b) < 0;
 }
 static inline u64 task_vtime(const struct task_struct *p)
 {
 	u64 vtime = p->scx.dsq_vtime;
 	if (vtime_before(vtime, vtime_now - SCX_SLICE_DFL))
 		return vtime_now - SCX_SLICE_DFL;
 	else
 		return vtime;
 }
 s32 BPF_STRUCT_OPS(select_cpu_vtime_select_cpu, struct task_struct *p,
 		   s32 prev_cpu, u64 wake_flags)
 {
 	s32 cpu;
 	cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0);
 	if (cpu >= 0)
 		goto ddsp;
 	cpu = prev_cpu;
 	scx_bpf_test_and_clear_cpu_idle(cpu);
 ddsp:
 	scx_bpf_dispatch_vtime(p, VTIME_DSQ, SCX_SLICE_DFL, task_vtime(p), 0);
 	return cpu;
 }
 void BPF_STRUCT_OPS(select_cpu_vtime_dispatch, s32 cpu, struct task_struct *p)
 {
 	if (scx_bpf_consume(VTIME_DSQ))
 		consumed = true;
 }
 void BPF_STRUCT_OPS(select_cpu_vtime_running, struct task_struct *p)
 {
 	if (vtime_before(vtime_now, p->scx.dsq_vtime))
 		vtime_now = p->scx.dsq_vtime;
 }
 void BPF_STRUCT_OPS(select_cpu_vtime_stopping, struct task_struct *p,
 		    bool runnable)
 {
 	p->scx.dsq_vtime += (SCX_SLICE_DFL - p->scx.slice) * 100 / p->scx.weight;
 }
 void BPF_STRUCT_OPS(select_cpu_vtime_enable, struct task_struct *p)
 {
 	p->scx.dsq_vtime = vtime_now;
 }
 s32 BPF_STRUCT_OPS_SLEEPABLE(select_cpu_vtime_init)
 {
 	return scx_bpf_create_dsq(VTIME_DSQ, -1);
 }
 SEC(".struct_ops.link")
 struct sched_ext_ops select_cpu_vtime_ops = {
 	.select_cpu		= select_cpu_vtime_select_cpu,
 	.dispatch		= select_cpu_vtime_dispatch,
 	.running		= select_cpu_vtime_running,
 	.stopping		= select_cpu_vtime_stopping,
 	.enable			= select_cpu_vtime_enable,
 	.init			= select_cpu_vtime_init,
 	.name			= "select_cpu_vtime",
 	.timeout_ms		= 1000U,
 };
--- a/tools/testing/selftests/sched_ext/select_cpu_vtime.c
+++ b/tools/testing/selftests/sched_ext/select_cpu_vtime.c
@ -0,0 +1,59 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2024 David Vernet <dvernet@meta.com>
 * Copyright (c) 2024 Tejun Heo <tj@kernel.org>
 */
 #include <bpf/bpf.h>
 #include <scx/common.h>
 #include <sys/wait.h>
 #include <unistd.h>
 #include "select_cpu_vtime.bpf.skel.h"
 #include "scx_test.h"
 static enum scx_test_status setup(void **ctx)
 {
 	struct select_cpu_vtime *skel;
 	skel = select_cpu_vtime__open_and_load();
 	SCX_FAIL_IF(!skel, "Failed to open and load skel");
 	*ctx = skel;
 	return SCX_TEST_PASS;
 }
 static enum scx_test_status run(void *ctx)
 {
 	struct select_cpu_vtime *skel = ctx;
 	struct bpf_link *link;
 	SCX_ASSERT(!skel->bss->consumed);
 	link = bpf_map__attach_struct_ops(skel->maps.select_cpu_vtime_ops);
 	SCX_FAIL_IF(!link, "Failed to attach scheduler");
 	sleep(1);
 	SCX_ASSERT(skel->bss->consumed);
 	bpf_link__destroy(link);
 	return SCX_TEST_PASS;
 }
 static void cleanup(void *ctx)
 {
 	struct select_cpu_vtime *skel = ctx;
 	select_cpu_vtime__destroy(skel);
 }
 struct scx_test select_cpu_vtime = {
 	.name = "select_cpu_vtime",
 	.description = "Test doing direct vtime-dispatching from "
 		       "ops.select_cpu(), to a non-built-in DSQ",
 	.setup = setup,
 	.run = run,
 	.cleanup = cleanup,
 };
 REGISTER_SCX_TEST(&select_cpu_vtime)
--- a/tools/testing/selftests/sched_ext/test_example.c
+++ b/tools/testing/selftests/sched_ext/test_example.c
@ -0,0 +1,49 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2024 Tejun Heo <tj@kernel.org>
 * Copyright (c) 2024 David Vernet <dvernet@meta.com>
 */
 #include <bpf/bpf.h>
 #include <scx/common.h>
 #include "scx_test.h"
 static bool setup_called = false;
 static bool run_called = false;
 static bool cleanup_called = false;
 static int context = 10;
 static enum scx_test_status setup(void **ctx)
 {
 	setup_called = true;
 	*ctx = &context;
 	return SCX_TEST_PASS;
 }
 static enum scx_test_status run(void *ctx)
 {
 	int *arg = ctx;
 	SCX_ASSERT(setup_called);
 	SCX_ASSERT(!run_called && !cleanup_called);
 	SCX_EQ(*arg, context);
 	run_called = true;
 	return SCX_TEST_PASS;
 }
 static void cleanup (void *ctx)
 {
 	SCX_BUG_ON(!run_called || cleanup_called, "Wrong callbacks invoked");
 }
 struct scx_test example = {
 	.name		= "example",
 	.description	= "Validate the basic function of the test suite itself",
 	.setup		= setup,
 	.run		= run,
 	.cleanup	= cleanup,
 };
 REGISTER_SCX_TEST(&example)
--- a/tools/testing/selftests/sched_ext/util.c
+++ b/tools/testing/selftests/sched_ext/util.c
@ -0,0 +1,71 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2024 David Vernet <dvernet@meta.com>
 */
 #include <errno.h>
 #include <fcntl.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
 /* Returns read len on success, or -errno on failure. */
 static ssize_t read_text(const char *path, char *buf, size_t max_len)
 {
 	ssize_t len;
 	int fd;
 	fd = open(path, O_RDONLY);
 	if (fd < 0)
 		return -errno;
 	len = read(fd, buf, max_len - 1);
 	if (len >= 0)
 		buf[len] = 0;
 	close(fd);
 	return len < 0 ? -errno : len;
 }
 /* Returns written len on success, or -errno on failure. */
 static ssize_t write_text(const char *path, char *buf, ssize_t len)
 {
 	int fd;
 	ssize_t written;
 	fd = open(path, O_WRONLY | O_APPEND);
 	if (fd < 0)
 		return -errno;
 	written = write(fd, buf, len);
 	close(fd);
 	return written < 0 ? -errno : written;
 }
 long file_read_long(const char *path)
 {
 	char buf[128];
 	if (read_text(path, buf, sizeof(buf)) <= 0)
 		return -1;
 	return atol(buf);
 }
 int file_write_long(const char *path, long val)
 {
 	char buf[64];
 	int ret;
 	ret = sprintf(buf, "%lu", val);
 	if (ret < 0)
 		return ret;
 	if (write_text(path, buf, sizeof(buf)) <= 0)
 		return -1;
 	return 0;
 }
--- a/tools/testing/selftests/sched_ext/util.h
+++ b/tools/testing/selftests/sched_ext/util.h
@ -0,0 +1,13 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
 * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
 * Copyright (c) 2024 David Vernet <void@manifault.com>
 */
 #ifndef __SCX_TEST_UTIL_H__
 #define __SCX_TEST_UTIL_H__
 long file_read_long(const char *path);
 int file_write_long(const char *path, long val);
 #endif // __SCX_TEST_H__