1164 lines
55 KiB
ReStructuredText
1164 lines
55 KiB
ReStructuredText
===================================================
|
||
A Tour Through TREE_RCU's Data Structures [LWN.net]
|
||
===================================================
|
||
|
||
December 18, 2016
|
||
|
||
This article was contributed by Paul E. McKenney
|
||
|
||
Introduction
|
||
============
|
||
|
||
This document describes RCU's major data structures and their relationship
|
||
to each other.
|
||
|
||
Data-Structure Relationships
|
||
============================
|
||
|
||
RCU is for all intents and purposes a large state machine, and its
|
||
data structures maintain the state in such a way as to allow RCU readers
|
||
to execute extremely quickly, while also processing the RCU grace periods
|
||
requested by updaters in an efficient and extremely scalable fashion.
|
||
The efficiency and scalability of RCU updaters is provided primarily
|
||
by a combining tree, as shown below:
|
||
|
||
.. kernel-figure:: BigTreeClassicRCU.svg
|
||
|
||
This diagram shows an enclosing ``rcu_state`` structure containing a tree
|
||
of ``rcu_node`` structures. Each leaf node of the ``rcu_node`` tree has up
|
||
to 16 ``rcu_data`` structures associated with it, so that there are
|
||
``NR_CPUS`` number of ``rcu_data`` structures, one for each possible CPU.
|
||
This structure is adjusted at boot time, if needed, to handle the common
|
||
case where ``nr_cpu_ids`` is much less than ``NR_CPUs``.
|
||
For example, a number of Linux distributions set ``NR_CPUs=4096``,
|
||
which results in a three-level ``rcu_node`` tree.
|
||
If the actual hardware has only 16 CPUs, RCU will adjust itself
|
||
at boot time, resulting in an ``rcu_node`` tree with only a single node.
|
||
|
||
The purpose of this combining tree is to allow per-CPU events
|
||
such as quiescent states, dyntick-idle transitions,
|
||
and CPU hotplug operations to be processed efficiently
|
||
and scalably.
|
||
Quiescent states are recorded by the per-CPU ``rcu_data`` structures,
|
||
and other events are recorded by the leaf-level ``rcu_node``
|
||
structures.
|
||
All of these events are combined at each level of the tree until finally
|
||
grace periods are completed at the tree's root ``rcu_node``
|
||
structure.
|
||
A grace period can be completed at the root once every CPU
|
||
(or, in the case of ``CONFIG_PREEMPT_RCU``, task)
|
||
has passed through a quiescent state.
|
||
Once a grace period has completed, record of that fact is propagated
|
||
back down the tree.
|
||
|
||
As can be seen from the diagram, on a 64-bit system
|
||
a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout
|
||
of 64 at the root and a fanout of 16 at the leaves.
|
||
|
||
+-----------------------------------------------------------------------+
|
||
| **Quick Quiz**: |
|
||
+-----------------------------------------------------------------------+
|
||
| Why isn't the fanout at the leaves also 64? |
|
||
+-----------------------------------------------------------------------+
|
||
| **Answer**: |
|
||
+-----------------------------------------------------------------------+
|
||
| Because there are more types of events that affect the leaf-level |
|
||
| ``rcu_node`` structures than further up the tree. Therefore, if the |
|
||
| leaf ``rcu_node`` structures have fanout of 64, the contention on |
|
||
| these structures' ``->structures`` becomes excessive. Experimentation |
|
||
| on a wide variety of systems has shown that a fanout of 16 works well |
|
||
| for the leaves of the ``rcu_node`` tree. |
|
||
| |
|
||
| Of course, further experience with systems having hundreds or |
|
||
| thousands of CPUs may demonstrate that the fanout for the non-leaf |
|
||
| ``rcu_node`` structures must also be reduced. Such reduction can be |
|
||
| easily carried out when and if it proves necessary. In the meantime, |
|
||
| if you are using such a system and running into contention problems |
|
||
| on the non-leaf ``rcu_node`` structures, you may use the |
|
||
| ``CONFIG_RCU_FANOUT`` kernel configuration parameter to reduce the |
|
||
| non-leaf fanout as needed. |
|
||
| |
|
||
| Kernels built for systems with strong NUMA characteristics might |
|
||
| also need to adjust ``CONFIG_RCU_FANOUT`` so that the domains of |
|
||
| the ``rcu_node`` structures align with hardware boundaries. |
|
||
| However, there has thus far been no need for this. |
|
||
+-----------------------------------------------------------------------+
|
||
|
||
If your system has more than 1,024 CPUs (or more than 512 CPUs on a
|
||
32-bit system), then RCU will automatically add more levels to the tree.
|
||
For example, if you are crazy enough to build a 64-bit system with
|
||
65,536 CPUs, RCU would configure the ``rcu_node`` tree as follows:
|
||
|
||
.. kernel-figure:: HugeTreeClassicRCU.svg
|
||
|
||
RCU currently permits up to a four-level tree, which on a 64-bit system
|
||
accommodates up to 4,194,304 CPUs, though only a mere 524,288 CPUs for
|
||
32-bit systems. On the other hand, you can set both
|
||
``CONFIG_RCU_FANOUT`` and ``CONFIG_RCU_FANOUT_LEAF`` to be as small as
|
||
2, which would result in a 16-CPU test using a 4-level tree. This can be
|
||
useful for testing large-system capabilities on small test machines.
|
||
|
||
This multi-level combining tree allows us to get most of the performance
|
||
and scalability benefits of partitioning, even though RCU grace-period
|
||
detection is inherently a global operation. The trick here is that only
|
||
the last CPU to report a quiescent state into a given ``rcu_node``
|
||
structure need advance to the ``rcu_node`` structure at the next level
|
||
up the tree. This means that at the leaf-level ``rcu_node`` structure,
|
||
only one access out of sixteen will progress up the tree. For the
|
||
internal ``rcu_node`` structures, the situation is even more extreme:
|
||
Only one access out of sixty-four will progress up the tree. Because the
|
||
vast majority of the CPUs do not progress up the tree, the lock
|
||
contention remains roughly constant up the tree. No matter how many CPUs
|
||
there are in the system, at most 64 quiescent-state reports per grace
|
||
period will progress all the way to the root ``rcu_node`` structure,
|
||
thus ensuring that the lock contention on that root ``rcu_node``
|
||
structure remains acceptably low.
|
||
|
||
In effect, the combining tree acts like a big shock absorber, keeping
|
||
lock contention under control at all tree levels regardless of the level
|
||
of loading on the system.
|
||
|
||
RCU updaters wait for normal grace periods by registering RCU callbacks,
|
||
either directly via ``call_rcu()`` or indirectly via
|
||
``synchronize_rcu()`` and friends. RCU callbacks are represented by
|
||
``rcu_head`` structures, which are queued on ``rcu_data`` structures
|
||
while they are waiting for a grace period to elapse, as shown in the
|
||
following figure:
|
||
|
||
.. kernel-figure:: BigTreePreemptRCUBHdyntickCB.svg
|
||
|
||
This figure shows how ``TREE_RCU``'s and ``PREEMPT_RCU``'s major data
|
||
structures are related. Lesser data structures will be introduced with
|
||
the algorithms that make use of them.
|
||
|
||
Note that each of the data structures in the above figure has its own
|
||
synchronization:
|
||
|
||
#. Each ``rcu_state`` structures has a lock and a mutex, and some fields
|
||
are protected by the corresponding root ``rcu_node`` structure's lock.
|
||
#. Each ``rcu_node`` structure has a spinlock.
|
||
#. The fields in ``rcu_data`` are private to the corresponding CPU,
|
||
although a few can be read and written by other CPUs.
|
||
|
||
It is important to note that different data structures can have very
|
||
different ideas about the state of RCU at any given time. For but one
|
||
example, awareness of the start or end of a given RCU grace period
|
||
propagates slowly through the data structures. This slow propagation is
|
||
absolutely necessary for RCU to have good read-side performance. If this
|
||
balkanized implementation seems foreign to you, one useful trick is to
|
||
consider each instance of these data structures to be a different
|
||
person, each having the usual slightly different view of reality.
|
||
|
||
The general role of each of these data structures is as follows:
|
||
|
||
#. ``rcu_state``: This structure forms the interconnection between the
|
||
``rcu_node`` and ``rcu_data`` structures, tracks grace periods,
|
||
serves as short-term repository for callbacks orphaned by CPU-hotplug
|
||
events, maintains ``rcu_barrier()`` state, tracks expedited
|
||
grace-period state, and maintains state used to force quiescent
|
||
states when grace periods extend too long,
|
||
#. ``rcu_node``: This structure forms the combining tree that propagates
|
||
quiescent-state information from the leaves to the root, and also
|
||
propagates grace-period information from the root to the leaves. It
|
||
provides local copies of the grace-period state in order to allow
|
||
this information to be accessed in a synchronized manner without
|
||
suffering the scalability limitations that would otherwise be imposed
|
||
by global locking. In ``CONFIG_PREEMPT_RCU`` kernels, it manages the
|
||
lists of tasks that have blocked while in their current RCU read-side
|
||
critical section. In ``CONFIG_PREEMPT_RCU`` with
|
||
``CONFIG_RCU_BOOST``, it manages the per-\ ``rcu_node``
|
||
priority-boosting kernel threads (kthreads) and state. Finally, it
|
||
records CPU-hotplug state in order to determine which CPUs should be
|
||
ignored during a given grace period.
|
||
#. ``rcu_data``: This per-CPU structure is the focus of quiescent-state
|
||
detection and RCU callback queuing. It also tracks its relationship
|
||
to the corresponding leaf ``rcu_node`` structure to allow
|
||
more-efficient propagation of quiescent states up the ``rcu_node``
|
||
combining tree. Like the ``rcu_node`` structure, it provides a local
|
||
copy of the grace-period information to allow for-free synchronized
|
||
access to this information from the corresponding CPU. Finally, this
|
||
structure records past dyntick-idle state for the corresponding CPU
|
||
and also tracks statistics.
|
||
#. ``rcu_head``: This structure represents RCU callbacks, and is the
|
||
only structure allocated and managed by RCU users. The ``rcu_head``
|
||
structure is normally embedded within the RCU-protected data
|
||
structure.
|
||
|
||
If all you wanted from this article was a general notion of how RCU's
|
||
data structures are related, you are done. Otherwise, each of the
|
||
following sections give more details on the ``rcu_state``, ``rcu_node``
|
||
and ``rcu_data`` data structures.
|
||
|
||
The ``rcu_state`` Structure
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
The ``rcu_state`` structure is the base structure that represents the
|
||
state of RCU in the system. This structure forms the interconnection
|
||
between the ``rcu_node`` and ``rcu_data`` structures, tracks grace
|
||
periods, contains the lock used to synchronize with CPU-hotplug events,
|
||
and maintains state used to force quiescent states when grace periods
|
||
extend too long,
|
||
|
||
A few of the ``rcu_state`` structure's fields are discussed, singly and
|
||
in groups, in the following sections. The more specialized fields are
|
||
covered in the discussion of their use.
|
||
|
||
Relationship to rcu_node and rcu_data Structures
|
||
''''''''''''''''''''''''''''''''''''''''''''''''
|
||
|
||
This portion of the ``rcu_state`` structure is declared as follows:
|
||
|
||
::
|
||
|
||
1 struct rcu_node node[NUM_RCU_NODES];
|
||
2 struct rcu_node *level[NUM_RCU_LVLS + 1];
|
||
3 struct rcu_data __percpu *rda;
|
||
|
||
+-----------------------------------------------------------------------+
|
||
| **Quick Quiz**: |
|
||
+-----------------------------------------------------------------------+
|
||
| Wait a minute! You said that the ``rcu_node`` structures formed a |
|
||
| tree, but they are declared as a flat array! What gives? |
|
||
+-----------------------------------------------------------------------+
|
||
| **Answer**: |
|
||
+-----------------------------------------------------------------------+
|
||
| The tree is laid out in the array. The first node In the array is the |
|
||
| head, the next set of nodes in the array are children of the head |
|
||
| node, and so on until the last set of nodes in the array are the |
|
||
| leaves. |
|
||
| See the following diagrams to see how this works. |
|
||
+-----------------------------------------------------------------------+
|
||
|
||
The ``rcu_node`` tree is embedded into the ``->node[]`` array as shown
|
||
in the following figure:
|
||
|
||
.. kernel-figure:: TreeMapping.svg
|
||
|
||
One interesting consequence of this mapping is that a breadth-first
|
||
traversal of the tree is implemented as a simple linear scan of the
|
||
array, which is in fact what the ``rcu_for_each_node_breadth_first()``
|
||
macro does. This macro is used at the beginning and ends of grace
|
||
periods.
|
||
|
||
Each entry of the ``->level`` array references the first ``rcu_node``
|
||
structure on the corresponding level of the tree, for example, as shown
|
||
below:
|
||
|
||
.. kernel-figure:: TreeMappingLevel.svg
|
||
|
||
The zero\ :sup:`th` element of the array references the root
|
||
``rcu_node`` structure, the first element references the first child of
|
||
the root ``rcu_node``, and finally the second element references the
|
||
first leaf ``rcu_node`` structure.
|
||
|
||
For whatever it is worth, if you draw the tree to be tree-shaped rather
|
||
than array-shaped, it is easy to draw a planar representation:
|
||
|
||
.. kernel-figure:: TreeLevel.svg
|
||
|
||
Finally, the ``->rda`` field references a per-CPU pointer to the
|
||
corresponding CPU's ``rcu_data`` structure.
|
||
|
||
All of these fields are constant once initialization is complete, and
|
||
therefore need no protection.
|
||
|
||
Grace-Period Tracking
|
||
'''''''''''''''''''''
|
||
|
||
This portion of the ``rcu_state`` structure is declared as follows:
|
||
|
||
::
|
||
|
||
1 unsigned long gp_seq;
|
||
|
||
RCU grace periods are numbered, and the ``->gp_seq`` field contains the
|
||
current grace-period sequence number. The bottom two bits are the state
|
||
of the current grace period, which can be zero for not yet started or
|
||
one for in progress. In other words, if the bottom two bits of
|
||
``->gp_seq`` are zero, then RCU is idle. Any other value in the bottom
|
||
two bits indicates that something is broken. This field is protected by
|
||
the root ``rcu_node`` structure's ``->lock`` field.
|
||
|
||
There are ``->gp_seq`` fields in the ``rcu_node`` and ``rcu_data``
|
||
structures as well. The fields in the ``rcu_state`` structure represent
|
||
the most current value, and those of the other structures are compared
|
||
in order to detect the beginnings and ends of grace periods in a
|
||
distributed fashion. The values flow from ``rcu_state`` to ``rcu_node``
|
||
(down the tree from the root to the leaves) to ``rcu_data``.
|
||
|
||
Miscellaneous
|
||
'''''''''''''
|
||
|
||
This portion of the ``rcu_state`` structure is declared as follows:
|
||
|
||
::
|
||
|
||
1 unsigned long gp_max;
|
||
2 char abbr;
|
||
3 char *name;
|
||
|
||
The ``->gp_max`` field tracks the duration of the longest grace period
|
||
in jiffies. It is protected by the root ``rcu_node``'s ``->lock``.
|
||
|
||
The ``->name`` and ``->abbr`` fields distinguish between preemptible RCU
|
||
(“rcu_preempt” and “p”) and non-preemptible RCU (“rcu_sched” and “s”).
|
||
These fields are used for diagnostic and tracing purposes.
|
||
|
||
The ``rcu_node`` Structure
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
The ``rcu_node`` structures form the combining tree that propagates
|
||
quiescent-state information from the leaves to the root and also that
|
||
propagates grace-period information from the root down to the leaves.
|
||
They provides local copies of the grace-period state in order to allow
|
||
this information to be accessed in a synchronized manner without
|
||
suffering the scalability limitations that would otherwise be imposed by
|
||
global locking. In ``CONFIG_PREEMPT_RCU`` kernels, they manage the lists
|
||
of tasks that have blocked while in their current RCU read-side critical
|
||
section. In ``CONFIG_PREEMPT_RCU`` with ``CONFIG_RCU_BOOST``, they
|
||
manage the per-\ ``rcu_node`` priority-boosting kernel threads
|
||
(kthreads) and state. Finally, they record CPU-hotplug state in order to
|
||
determine which CPUs should be ignored during a given grace period.
|
||
|
||
The ``rcu_node`` structure's fields are discussed, singly and in groups,
|
||
in the following sections.
|
||
|
||
Connection to Combining Tree
|
||
''''''''''''''''''''''''''''
|
||
|
||
This portion of the ``rcu_node`` structure is declared as follows:
|
||
|
||
::
|
||
|
||
1 struct rcu_node *parent;
|
||
2 u8 level;
|
||
3 u8 grpnum;
|
||
4 unsigned long grpmask;
|
||
5 int grplo;
|
||
6 int grphi;
|
||
|
||
The ``->parent`` pointer references the ``rcu_node`` one level up in the
|
||
tree, and is ``NULL`` for the root ``rcu_node``. The RCU implementation
|
||
makes heavy use of this field to push quiescent states up the tree. The
|
||
``->level`` field gives the level in the tree, with the root being at
|
||
level zero, its children at level one, and so on. The ``->grpnum`` field
|
||
gives this node's position within the children of its parent, so this
|
||
number can range between 0 and 31 on 32-bit systems and between 0 and 63
|
||
on 64-bit systems. The ``->level`` and ``->grpnum`` fields are used only
|
||
during initialization and for tracing. The ``->grpmask`` field is the
|
||
bitmask counterpart of ``->grpnum``, and therefore always has exactly
|
||
one bit set. This mask is used to clear the bit corresponding to this
|
||
``rcu_node`` structure in its parent's bitmasks, which are described
|
||
later. Finally, the ``->grplo`` and ``->grphi`` fields contain the
|
||
lowest and highest numbered CPU served by this ``rcu_node`` structure,
|
||
respectively.
|
||
|
||
All of these fields are constant, and thus do not require any
|
||
synchronization.
|
||
|
||
Synchronization
|
||
'''''''''''''''
|
||
|
||
This field of the ``rcu_node`` structure is declared as follows:
|
||
|
||
::
|
||
|
||
1 raw_spinlock_t lock;
|
||
|
||
This field is used to protect the remaining fields in this structure,
|
||
unless otherwise stated. That said, all of the fields in this structure
|
||
can be accessed without locking for tracing purposes. Yes, this can
|
||
result in confusing traces, but better some tracing confusion than to be
|
||
heisenbugged out of existence.
|
||
|
||
.. _grace-period-tracking-1:
|
||
|
||
Grace-Period Tracking
|
||
'''''''''''''''''''''
|
||
|
||
This portion of the ``rcu_node`` structure is declared as follows:
|
||
|
||
::
|
||
|
||
1 unsigned long gp_seq;
|
||
2 unsigned long gp_seq_needed;
|
||
|
||
The ``rcu_node`` structures' ``->gp_seq`` fields are the counterparts of
|
||
the field of the same name in the ``rcu_state`` structure. They each may
|
||
lag up to one step behind their ``rcu_state`` counterpart. If the bottom
|
||
two bits of a given ``rcu_node`` structure's ``->gp_seq`` field is zero,
|
||
then this ``rcu_node`` structure believes that RCU is idle.
|
||
|
||
The ``>gp_seq`` field of each ``rcu_node`` structure is updated at the
|
||
beginning and the end of each grace period.
|
||
|
||
The ``->gp_seq_needed`` fields record the furthest-in-the-future grace
|
||
period request seen by the corresponding ``rcu_node`` structure. The
|
||
request is considered fulfilled when the value of the ``->gp_seq`` field
|
||
equals or exceeds that of the ``->gp_seq_needed`` field.
|
||
|
||
+-----------------------------------------------------------------------+
|
||
| **Quick Quiz**: |
|
||
+-----------------------------------------------------------------------+
|
||
| Suppose that this ``rcu_node`` structure doesn't see a request for a |
|
||
| very long time. Won't wrapping of the ``->gp_seq`` field cause |
|
||
| problems? |
|
||
+-----------------------------------------------------------------------+
|
||
| **Answer**: |
|
||
+-----------------------------------------------------------------------+
|
||
| No, because if the ``->gp_seq_needed`` field lags behind the |
|
||
| ``->gp_seq`` field, the ``->gp_seq_needed`` field will be updated at |
|
||
| the end of the grace period. Modulo-arithmetic comparisons therefore |
|
||
| will always get the correct answer, even with wrapping. |
|
||
+-----------------------------------------------------------------------+
|
||
|
||
Quiescent-State Tracking
|
||
''''''''''''''''''''''''
|
||
|
||
These fields manage the propagation of quiescent states up the combining
|
||
tree.
|
||
|
||
This portion of the ``rcu_node`` structure has fields as follows:
|
||
|
||
::
|
||
|
||
1 unsigned long qsmask;
|
||
2 unsigned long expmask;
|
||
3 unsigned long qsmaskinit;
|
||
4 unsigned long expmaskinit;
|
||
|
||
The ``->qsmask`` field tracks which of this ``rcu_node`` structure's
|
||
children still need to report quiescent states for the current normal
|
||
grace period. Such children will have a value of 1 in their
|
||
corresponding bit. Note that the leaf ``rcu_node`` structures should be
|
||
thought of as having ``rcu_data`` structures as their children.
|
||
Similarly, the ``->expmask`` field tracks which of this ``rcu_node``
|
||
structure's children still need to report quiescent states for the
|
||
current expedited grace period. An expedited grace period has the same
|
||
conceptual properties as a normal grace period, but the expedited
|
||
implementation accepts extreme CPU overhead to obtain much lower
|
||
grace-period latency, for example, consuming a few tens of microseconds
|
||
worth of CPU time to reduce grace-period duration from milliseconds to
|
||
tens of microseconds. The ``->qsmaskinit`` field tracks which of this
|
||
``rcu_node`` structure's children cover for at least one online CPU.
|
||
This mask is used to initialize ``->qsmask``, and ``->expmaskinit`` is
|
||
used to initialize ``->expmask`` and the beginning of the normal and
|
||
expedited grace periods, respectively.
|
||
|
||
+-----------------------------------------------------------------------+
|
||
| **Quick Quiz**: |
|
||
+-----------------------------------------------------------------------+
|
||
| Why are these bitmasks protected by locking? Come on, haven't you |
|
||
| heard of atomic instructions??? |
|
||
+-----------------------------------------------------------------------+
|
||
| **Answer**: |
|
||
+-----------------------------------------------------------------------+
|
||
| Lockless grace-period computation! Such a tantalizing possibility! |
|
||
| But consider the following sequence of events: |
|
||
| |
|
||
| #. CPU 0 has been in dyntick-idle mode for quite some time. When it |
|
||
| wakes up, it notices that the current RCU grace period needs it to |
|
||
| report in, so it sets a flag where the scheduling clock interrupt |
|
||
| will find it. |
|
||
| #. Meanwhile, CPU 1 is running ``force_quiescent_state()``, and |
|
||
| notices that CPU 0 has been in dyntick idle mode, which qualifies |
|
||
| as an extended quiescent state. |
|
||
| #. CPU 0's scheduling clock interrupt fires in the middle of an RCU |
|
||
| read-side critical section, and notices that the RCU core needs |
|
||
| something, so commences RCU softirq processing. |
|
||
| #. CPU 0's softirq handler executes and is just about ready to report |
|
||
| its quiescent state up the ``rcu_node`` tree. |
|
||
| #. But CPU 1 beats it to the punch, completing the current grace |
|
||
| period and starting a new one. |
|
||
| #. CPU 0 now reports its quiescent state for the wrong grace period. |
|
||
| That grace period might now end before the RCU read-side critical |
|
||
| section. If that happens, disaster will ensue. |
|
||
| |
|
||
| So the locking is absolutely required in order to coordinate clearing |
|
||
| of the bits with updating of the grace-period sequence number in |
|
||
| ``->gp_seq``. |
|
||
+-----------------------------------------------------------------------+
|
||
|
||
Blocked-Task Management
|
||
'''''''''''''''''''''''
|
||
|
||
``PREEMPT_RCU`` allows tasks to be preempted in the midst of their RCU
|
||
read-side critical sections, and these tasks must be tracked explicitly.
|
||
The details of exactly why and how they are tracked will be covered in a
|
||
separate article on RCU read-side processing. For now, it is enough to
|
||
know that the ``rcu_node`` structure tracks them.
|
||
|
||
::
|
||
|
||
1 struct list_head blkd_tasks;
|
||
2 struct list_head *gp_tasks;
|
||
3 struct list_head *exp_tasks;
|
||
4 bool wait_blkd_tasks;
|
||
|
||
The ``->blkd_tasks`` field is a list header for the list of blocked and
|
||
preempted tasks. As tasks undergo context switches within RCU read-side
|
||
critical sections, their ``task_struct`` structures are enqueued (via
|
||
the ``task_struct``'s ``->rcu_node_entry`` field) onto the head of the
|
||
``->blkd_tasks`` list for the leaf ``rcu_node`` structure corresponding
|
||
to the CPU on which the outgoing context switch executed. As these tasks
|
||
later exit their RCU read-side critical sections, they remove themselves
|
||
from the list. This list is therefore in reverse time order, so that if
|
||
one of the tasks is blocking the current grace period, all subsequent
|
||
tasks must also be blocking that same grace period. Therefore, a single
|
||
pointer into this list suffices to track all tasks blocking a given
|
||
grace period. That pointer is stored in ``->gp_tasks`` for normal grace
|
||
periods and in ``->exp_tasks`` for expedited grace periods. These last
|
||
two fields are ``NULL`` if either there is no grace period in flight or
|
||
if there are no blocked tasks preventing that grace period from
|
||
completing. If either of these two pointers is referencing a task that
|
||
removes itself from the ``->blkd_tasks`` list, then that task must
|
||
advance the pointer to the next task on the list, or set the pointer to
|
||
``NULL`` if there are no subsequent tasks on the list.
|
||
|
||
For example, suppose that tasks T1, T2, and T3 are all hard-affinitied
|
||
to the largest-numbered CPU in the system. Then if task T1 blocked in an
|
||
RCU read-side critical section, then an expedited grace period started,
|
||
then task T2 blocked in an RCU read-side critical section, then a normal
|
||
grace period started, and finally task 3 blocked in an RCU read-side
|
||
critical section, then the state of the last leaf ``rcu_node``
|
||
structure's blocked-task list would be as shown below:
|
||
|
||
.. kernel-figure:: blkd_task.svg
|
||
|
||
Task T1 is blocking both grace periods, task T2 is blocking only the
|
||
normal grace period, and task T3 is blocking neither grace period. Note
|
||
that these tasks will not remove themselves from this list immediately
|
||
upon resuming execution. They will instead remain on the list until they
|
||
execute the outermost ``rcu_read_unlock()`` that ends their RCU
|
||
read-side critical section.
|
||
|
||
The ``->wait_blkd_tasks`` field indicates whether or not the current
|
||
grace period is waiting on a blocked task.
|
||
|
||
Sizing the ``rcu_node`` Array
|
||
'''''''''''''''''''''''''''''
|
||
|
||
The ``rcu_node`` array is sized via a series of C-preprocessor
|
||
expressions as follows:
|
||
|
||
::
|
||
|
||
1 #ifdef CONFIG_RCU_FANOUT
|
||
2 #define RCU_FANOUT CONFIG_RCU_FANOUT
|
||
3 #else
|
||
4 # ifdef CONFIG_64BIT
|
||
5 # define RCU_FANOUT 64
|
||
6 # else
|
||
7 # define RCU_FANOUT 32
|
||
8 # endif
|
||
9 #endif
|
||
10
|
||
11 #ifdef CONFIG_RCU_FANOUT_LEAF
|
||
12 #define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF
|
||
13 #else
|
||
14 # ifdef CONFIG_64BIT
|
||
15 # define RCU_FANOUT_LEAF 64
|
||
16 # else
|
||
17 # define RCU_FANOUT_LEAF 32
|
||
18 # endif
|
||
19 #endif
|
||
20
|
||
21 #define RCU_FANOUT_1 (RCU_FANOUT_LEAF)
|
||
22 #define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT)
|
||
23 #define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT)
|
||
24 #define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT)
|
||
25
|
||
26 #if NR_CPUS <= RCU_FANOUT_1
|
||
27 # define RCU_NUM_LVLS 1
|
||
28 # define NUM_RCU_LVL_0 1
|
||
29 # define NUM_RCU_NODES NUM_RCU_LVL_0
|
||
30 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0 }
|
||
31 # define RCU_NODE_NAME_INIT { "rcu_node_0" }
|
||
32 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0" }
|
||
33 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0" }
|
||
34 #elif NR_CPUS <= RCU_FANOUT_2
|
||
35 # define RCU_NUM_LVLS 2
|
||
36 # define NUM_RCU_LVL_0 1
|
||
37 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
|
||
38 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1)
|
||
39 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1 }
|
||
40 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1" }
|
||
41 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1" }
|
||
42 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1" }
|
||
43 #elif NR_CPUS <= RCU_FANOUT_3
|
||
44 # define RCU_NUM_LVLS 3
|
||
45 # define NUM_RCU_LVL_0 1
|
||
46 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
|
||
47 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
|
||
48 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2)
|
||
49 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 }
|
||
50 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2" }
|
||
51 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" }
|
||
52 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2" }
|
||
53 #elif NR_CPUS <= RCU_FANOUT_4
|
||
54 # define RCU_NUM_LVLS 4
|
||
55 # define NUM_RCU_LVL_0 1
|
||
56 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3)
|
||
57 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
|
||
58 # define NUM_RCU_LVL_3 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
|
||
59 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3)
|
||
60 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 }
|
||
61 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" }
|
||
62 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" }
|
||
63 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2", "rcu_node_exp_3" }
|
||
64 #else
|
||
65 # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS"
|
||
66 #endif
|
||
|
||
The maximum number of levels in the ``rcu_node`` structure is currently
|
||
limited to four, as specified by lines 21-24 and the structure of the
|
||
subsequent “if” statement. For 32-bit systems, this allows
|
||
16*32*32*32=524,288 CPUs, which should be sufficient for the next few
|
||
years at least. For 64-bit systems, 16*64*64*64=4,194,304 CPUs is
|
||
allowed, which should see us through the next decade or so. This
|
||
four-level tree also allows kernels built with ``CONFIG_RCU_FANOUT=8``
|
||
to support up to 4096 CPUs, which might be useful in very large systems
|
||
having eight CPUs per socket (but please note that no one has yet shown
|
||
any measurable performance degradation due to misaligned socket and
|
||
``rcu_node`` boundaries). In addition, building kernels with a full four
|
||
levels of ``rcu_node`` tree permits better testing of RCU's
|
||
combining-tree code.
|
||
|
||
The ``RCU_FANOUT`` symbol controls how many children are permitted at
|
||
each non-leaf level of the ``rcu_node`` tree. If the
|
||
``CONFIG_RCU_FANOUT`` Kconfig option is not specified, it is set based
|
||
on the word size of the system, which is also the Kconfig default.
|
||
|
||
The ``RCU_FANOUT_LEAF`` symbol controls how many CPUs are handled by
|
||
each leaf ``rcu_node`` structure. Experience has shown that allowing a
|
||
given leaf ``rcu_node`` structure to handle 64 CPUs, as permitted by the
|
||
number of bits in the ``->qsmask`` field on a 64-bit system, results in
|
||
excessive contention for the leaf ``rcu_node`` structures' ``->lock``
|
||
fields. The number of CPUs per leaf ``rcu_node`` structure is therefore
|
||
limited to 16 given the default value of ``CONFIG_RCU_FANOUT_LEAF``. If
|
||
``CONFIG_RCU_FANOUT_LEAF`` is unspecified, the value selected is based
|
||
on the word size of the system, just as for ``CONFIG_RCU_FANOUT``.
|
||
Lines 11-19 perform this computation.
|
||
|
||
Lines 21-24 compute the maximum number of CPUs supported by a
|
||
single-level (which contains a single ``rcu_node`` structure),
|
||
two-level, three-level, and four-level ``rcu_node`` tree, respectively,
|
||
given the fanout specified by ``RCU_FANOUT`` and ``RCU_FANOUT_LEAF``.
|
||
These numbers of CPUs are retained in the ``RCU_FANOUT_1``,
|
||
``RCU_FANOUT_2``, ``RCU_FANOUT_3``, and ``RCU_FANOUT_4`` C-preprocessor
|
||
variables, respectively.
|
||
|
||
These variables are used to control the C-preprocessor ``#if`` statement
|
||
spanning lines 26-66 that computes the number of ``rcu_node`` structures
|
||
required for each level of the tree, as well as the number of levels
|
||
required. The number of levels is placed in the ``NUM_RCU_LVLS``
|
||
C-preprocessor variable by lines 27, 35, 44, and 54. The number of
|
||
``rcu_node`` structures for the topmost level of the tree is always
|
||
exactly one, and this value is unconditionally placed into
|
||
``NUM_RCU_LVL_0`` by lines 28, 36, 45, and 55. The rest of the levels
|
||
(if any) of the ``rcu_node`` tree are computed by dividing the maximum
|
||
number of CPUs by the fanout supported by the number of levels from the
|
||
current level down, rounding up. This computation is performed by
|
||
lines 37, 46-47, and 56-58. Lines 31-33, 40-42, 50-52, and 62-63 create
|
||
initializers for lockdep lock-class names. Finally, lines 64-66 produce
|
||
an error if the maximum number of CPUs is too large for the specified
|
||
fanout.
|
||
|
||
The ``rcu_segcblist`` Structure
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
The ``rcu_segcblist`` structure maintains a segmented list of callbacks
|
||
as follows:
|
||
|
||
::
|
||
|
||
1 #define RCU_DONE_TAIL 0
|
||
2 #define RCU_WAIT_TAIL 1
|
||
3 #define RCU_NEXT_READY_TAIL 2
|
||
4 #define RCU_NEXT_TAIL 3
|
||
5 #define RCU_CBLIST_NSEGS 4
|
||
6
|
||
7 struct rcu_segcblist {
|
||
8 struct rcu_head *head;
|
||
9 struct rcu_head **tails[RCU_CBLIST_NSEGS];
|
||
10 unsigned long gp_seq[RCU_CBLIST_NSEGS];
|
||
11 long len;
|
||
12 long len_lazy;
|
||
13 };
|
||
|
||
The segments are as follows:
|
||
|
||
#. ``RCU_DONE_TAIL``: Callbacks whose grace periods have elapsed. These
|
||
callbacks are ready to be invoked.
|
||
#. ``RCU_WAIT_TAIL``: Callbacks that are waiting for the current grace
|
||
period. Note that different CPUs can have different ideas about which
|
||
grace period is current, hence the ``->gp_seq`` field.
|
||
#. ``RCU_NEXT_READY_TAIL``: Callbacks waiting for the next grace period
|
||
to start.
|
||
#. ``RCU_NEXT_TAIL``: Callbacks that have not yet been associated with a
|
||
grace period.
|
||
|
||
The ``->head`` pointer references the first callback or is ``NULL`` if
|
||
the list contains no callbacks (which is *not* the same as being empty).
|
||
Each element of the ``->tails[]`` array references the ``->next``
|
||
pointer of the last callback in the corresponding segment of the list,
|
||
or the list's ``->head`` pointer if that segment and all previous
|
||
segments are empty. If the corresponding segment is empty but some
|
||
previous segment is not empty, then the array element is identical to
|
||
its predecessor. Older callbacks are closer to the head of the list, and
|
||
new callbacks are added at the tail. This relationship between the
|
||
``->head`` pointer, the ``->tails[]`` array, and the callbacks is shown
|
||
in this diagram:
|
||
|
||
.. kernel-figure:: nxtlist.svg
|
||
|
||
In this figure, the ``->head`` pointer references the first RCU callback
|
||
in the list. The ``->tails[RCU_DONE_TAIL]`` array element references the
|
||
``->head`` pointer itself, indicating that none of the callbacks is
|
||
ready to invoke. The ``->tails[RCU_WAIT_TAIL]`` array element references
|
||
callback CB 2's ``->next`` pointer, which indicates that CB 1 and CB 2
|
||
are both waiting on the current grace period, give or take possible
|
||
disagreements about exactly which grace period is the current one. The
|
||
``->tails[RCU_NEXT_READY_TAIL]`` array element references the same RCU
|
||
callback that ``->tails[RCU_WAIT_TAIL]`` does, which indicates that
|
||
there are no callbacks waiting on the next RCU grace period. The
|
||
``->tails[RCU_NEXT_TAIL]`` array element references CB 4's ``->next``
|
||
pointer, indicating that all the remaining RCU callbacks have not yet
|
||
been assigned to an RCU grace period. Note that the
|
||
``->tails[RCU_NEXT_TAIL]`` array element always references the last RCU
|
||
callback's ``->next`` pointer unless the callback list is empty, in
|
||
which case it references the ``->head`` pointer.
|
||
|
||
There is one additional important special case for the
|
||
``->tails[RCU_NEXT_TAIL]`` array element: It can be ``NULL`` when this
|
||
list is *disabled*. Lists are disabled when the corresponding CPU is
|
||
offline or when the corresponding CPU's callbacks are offloaded to a
|
||
kthread, both of which are described elsewhere.
|
||
|
||
CPUs advance their callbacks from the ``RCU_NEXT_TAIL`` to the
|
||
``RCU_NEXT_READY_TAIL`` to the ``RCU_WAIT_TAIL`` to the
|
||
``RCU_DONE_TAIL`` list segments as grace periods advance.
|
||
|
||
The ``->gp_seq[]`` array records grace-period numbers corresponding to
|
||
the list segments. This is what allows different CPUs to have different
|
||
ideas as to which is the current grace period while still avoiding
|
||
premature invocation of their callbacks. In particular, this allows CPUs
|
||
that go idle for extended periods to determine which of their callbacks
|
||
are ready to be invoked after reawakening.
|
||
|
||
The ``->len`` counter contains the number of callbacks in ``->head``,
|
||
and the ``->len_lazy`` contains the number of those callbacks that are
|
||
known to only free memory, and whose invocation can therefore be safely
|
||
deferred.
|
||
|
||
.. important::
|
||
|
||
It is the ``->len`` field that determines whether or
|
||
not there are callbacks associated with this ``rcu_segcblist``
|
||
structure, *not* the ``->head`` pointer. The reason for this is that all
|
||
the ready-to-invoke callbacks (that is, those in the ``RCU_DONE_TAIL``
|
||
segment) are extracted all at once at callback-invocation time
|
||
(``rcu_do_batch``), due to which ``->head`` may be set to NULL if there
|
||
are no not-done callbacks remaining in the ``rcu_segcblist``. If
|
||
callback invocation must be postponed, for example, because a
|
||
high-priority process just woke up on this CPU, then the remaining
|
||
callbacks are placed back on the ``RCU_DONE_TAIL`` segment and
|
||
``->head`` once again points to the start of the segment. In short, the
|
||
head field can briefly be ``NULL`` even though the CPU has callbacks
|
||
present the entire time. Therefore, it is not appropriate to test the
|
||
``->head`` pointer for ``NULL``.
|
||
|
||
In contrast, the ``->len`` and ``->len_lazy`` counts are adjusted only
|
||
after the corresponding callbacks have been invoked. This means that the
|
||
``->len`` count is zero only if the ``rcu_segcblist`` structure really
|
||
is devoid of callbacks. Of course, off-CPU sampling of the ``->len``
|
||
count requires careful use of appropriate synchronization, for example,
|
||
memory barriers. This synchronization can be a bit subtle, particularly
|
||
in the case of ``rcu_barrier()``.
|
||
|
||
The ``rcu_data`` Structure
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
The ``rcu_data`` maintains the per-CPU state for the RCU subsystem. The
|
||
fields in this structure may be accessed only from the corresponding CPU
|
||
(and from tracing) unless otherwise stated. This structure is the focus
|
||
of quiescent-state detection and RCU callback queuing. It also tracks
|
||
its relationship to the corresponding leaf ``rcu_node`` structure to
|
||
allow more-efficient propagation of quiescent states up the ``rcu_node``
|
||
combining tree. Like the ``rcu_node`` structure, it provides a local
|
||
copy of the grace-period information to allow for-free synchronized
|
||
access to this information from the corresponding CPU. Finally, this
|
||
structure records past dyntick-idle state for the corresponding CPU and
|
||
also tracks statistics.
|
||
|
||
The ``rcu_data`` structure's fields are discussed, singly and in groups,
|
||
in the following sections.
|
||
|
||
Connection to Other Data Structures
|
||
'''''''''''''''''''''''''''''''''''
|
||
|
||
This portion of the ``rcu_data`` structure is declared as follows:
|
||
|
||
::
|
||
|
||
1 int cpu;
|
||
2 struct rcu_node *mynode;
|
||
3 unsigned long grpmask;
|
||
4 bool beenonline;
|
||
|
||
The ``->cpu`` field contains the number of the corresponding CPU and the
|
||
``->mynode`` field references the corresponding ``rcu_node`` structure.
|
||
The ``->mynode`` is used to propagate quiescent states up the combining
|
||
tree. These two fields are constant and therefore do not require
|
||
synchronization.
|
||
|
||
The ``->grpmask`` field indicates the bit in the ``->mynode->qsmask``
|
||
corresponding to this ``rcu_data`` structure, and is also used when
|
||
propagating quiescent states. The ``->beenonline`` flag is set whenever
|
||
the corresponding CPU comes online, which means that the debugfs tracing
|
||
need not dump out any ``rcu_data`` structure for which this flag is not
|
||
set.
|
||
|
||
Quiescent-State and Grace-Period Tracking
|
||
'''''''''''''''''''''''''''''''''''''''''
|
||
|
||
This portion of the ``rcu_data`` structure is declared as follows:
|
||
|
||
::
|
||
|
||
1 unsigned long gp_seq;
|
||
2 unsigned long gp_seq_needed;
|
||
3 bool cpu_no_qs;
|
||
4 bool core_needs_qs;
|
||
5 bool gpwrap;
|
||
|
||
The ``->gp_seq`` field is the counterpart of the field of the same name
|
||
in the ``rcu_state`` and ``rcu_node`` structures. The
|
||
``->gp_seq_needed`` field is the counterpart of the field of the same
|
||
name in the rcu_node structure. They may each lag up to one behind their
|
||
``rcu_node`` counterparts, but in ``CONFIG_NO_HZ_IDLE`` and
|
||
``CONFIG_NO_HZ_FULL`` kernels can lag arbitrarily far behind for CPUs in
|
||
dyntick-idle mode (but these counters will catch up upon exit from
|
||
dyntick-idle mode). If the lower two bits of a given ``rcu_data``
|
||
structure's ``->gp_seq`` are zero, then this ``rcu_data`` structure
|
||
believes that RCU is idle.
|
||
|
||
+-----------------------------------------------------------------------+
|
||
| **Quick Quiz**: |
|
||
+-----------------------------------------------------------------------+
|
||
| All this replication of the grace period numbers can only cause |
|
||
| massive confusion. Why not just keep a global sequence number and be |
|
||
| done with it??? |
|
||
+-----------------------------------------------------------------------+
|
||
| **Answer**: |
|
||
+-----------------------------------------------------------------------+
|
||
| Because if there was only a single global sequence numbers, there |
|
||
| would need to be a single global lock to allow safely accessing and |
|
||
| updating it. And if we are not going to have a single global lock, we |
|
||
| need to carefully manage the numbers on a per-node basis. Recall from |
|
||
| the answer to a previous Quick Quiz that the consequences of applying |
|
||
| a previously sampled quiescent state to the wrong grace period are |
|
||
| quite severe. |
|
||
+-----------------------------------------------------------------------+
|
||
|
||
The ``->cpu_no_qs`` flag indicates that the CPU has not yet passed
|
||
through a quiescent state, while the ``->core_needs_qs`` flag indicates
|
||
that the RCU core needs a quiescent state from the corresponding CPU.
|
||
The ``->gpwrap`` field indicates that the corresponding CPU has remained
|
||
idle for so long that the ``gp_seq`` counter is in danger of overflow,
|
||
which will cause the CPU to disregard the values of its counters on its
|
||
next exit from idle.
|
||
|
||
RCU Callback Handling
|
||
'''''''''''''''''''''
|
||
|
||
In the absence of CPU-hotplug events, RCU callbacks are invoked by the
|
||
same CPU that registered them. This is strictly a cache-locality
|
||
optimization: callbacks can and do get invoked on CPUs other than the
|
||
one that registered them. After all, if the CPU that registered a given
|
||
callback has gone offline before the callback can be invoked, there
|
||
really is no other choice.
|
||
|
||
This portion of the ``rcu_data`` structure is declared as follows:
|
||
|
||
::
|
||
|
||
1 struct rcu_segcblist cblist;
|
||
2 long qlen_last_fqs_check;
|
||
3 unsigned long n_cbs_invoked;
|
||
4 unsigned long n_nocbs_invoked;
|
||
5 unsigned long n_cbs_orphaned;
|
||
6 unsigned long n_cbs_adopted;
|
||
7 unsigned long n_force_qs_snap;
|
||
8 long blimit;
|
||
|
||
The ``->cblist`` structure is the segmented callback list described
|
||
earlier. The CPU advances the callbacks in its ``rcu_data`` structure
|
||
whenever it notices that another RCU grace period has completed. The CPU
|
||
detects the completion of an RCU grace period by noticing that the value
|
||
of its ``rcu_data`` structure's ``->gp_seq`` field differs from that of
|
||
its leaf ``rcu_node`` structure. Recall that each ``rcu_node``
|
||
structure's ``->gp_seq`` field is updated at the beginnings and ends of
|
||
each grace period.
|
||
|
||
The ``->qlen_last_fqs_check`` and ``->n_force_qs_snap`` coordinate the
|
||
forcing of quiescent states from ``call_rcu()`` and friends when
|
||
callback lists grow excessively long.
|
||
|
||
The ``->n_cbs_invoked``, ``->n_cbs_orphaned``, and ``->n_cbs_adopted``
|
||
fields count the number of callbacks invoked, sent to other CPUs when
|
||
this CPU goes offline, and received from other CPUs when those other
|
||
CPUs go offline. The ``->n_nocbs_invoked`` is used when the CPU's
|
||
callbacks are offloaded to a kthread.
|
||
|
||
Finally, the ``->blimit`` counter is the maximum number of RCU callbacks
|
||
that may be invoked at a given time.
|
||
|
||
Dyntick-Idle Handling
|
||
'''''''''''''''''''''
|
||
|
||
This portion of the ``rcu_data`` structure is declared as follows:
|
||
|
||
::
|
||
|
||
1 int dynticks_snap;
|
||
2 unsigned long dynticks_fqs;
|
||
|
||
The ``->dynticks_snap`` field is used to take a snapshot of the
|
||
corresponding CPU's dyntick-idle state when forcing quiescent states,
|
||
and is therefore accessed from other CPUs. Finally, the
|
||
``->dynticks_fqs`` field is used to count the number of times this CPU
|
||
is determined to be in dyntick-idle state, and is used for tracing and
|
||
debugging purposes.
|
||
|
||
This portion of the rcu_data structure is declared as follows:
|
||
|
||
::
|
||
|
||
1 long dynticks_nesting;
|
||
2 long dynticks_nmi_nesting;
|
||
3 atomic_t dynticks;
|
||
4 bool rcu_need_heavy_qs;
|
||
5 bool rcu_urgent_qs;
|
||
|
||
These fields in the rcu_data structure maintain the per-CPU dyntick-idle
|
||
state for the corresponding CPU. The fields may be accessed only from
|
||
the corresponding CPU (and from tracing) unless otherwise stated.
|
||
|
||
The ``->dynticks_nesting`` field counts the nesting depth of process
|
||
execution, so that in normal circumstances this counter has value zero
|
||
or one. NMIs, irqs, and tracers are counted by the
|
||
``->dynticks_nmi_nesting`` field. Because NMIs cannot be masked, changes
|
||
to this variable have to be undertaken carefully using an algorithm
|
||
provided by Andy Lutomirski. The initial transition from idle adds one,
|
||
and nested transitions add two, so that a nesting level of five is
|
||
represented by a ``->dynticks_nmi_nesting`` value of nine. This counter
|
||
can therefore be thought of as counting the number of reasons why this
|
||
CPU cannot be permitted to enter dyntick-idle mode, aside from
|
||
process-level transitions.
|
||
|
||
However, it turns out that when running in non-idle kernel context, the
|
||
Linux kernel is fully capable of entering interrupt handlers that never
|
||
exit and perhaps also vice versa. Therefore, whenever the
|
||
``->dynticks_nesting`` field is incremented up from zero, the
|
||
``->dynticks_nmi_nesting`` field is set to a large positive number, and
|
||
whenever the ``->dynticks_nesting`` field is decremented down to zero,
|
||
the ``->dynticks_nmi_nesting`` field is set to zero. Assuming that
|
||
the number of misnested interrupts is not sufficient to overflow the
|
||
counter, this approach corrects the ``->dynticks_nmi_nesting`` field
|
||
every time the corresponding CPU enters the idle loop from process
|
||
context.
|
||
|
||
The ``->dynticks`` field counts the corresponding CPU's transitions to
|
||
and from either dyntick-idle or user mode, so that this counter has an
|
||
even value when the CPU is in dyntick-idle mode or user mode and an odd
|
||
value otherwise. The transitions to/from user mode need to be counted
|
||
for user mode adaptive-ticks support (see timers/NO_HZ.txt).
|
||
|
||
The ``->rcu_need_heavy_qs`` field is used to record the fact that the
|
||
RCU core code would really like to see a quiescent state from the
|
||
corresponding CPU, so much so that it is willing to call for
|
||
heavy-weight dyntick-counter operations. This flag is checked by RCU's
|
||
context-switch and ``cond_resched()`` code, which provide a momentary
|
||
idle sojourn in response.
|
||
|
||
Finally, the ``->rcu_urgent_qs`` field is used to record the fact that
|
||
the RCU core code would really like to see a quiescent state from the
|
||
corresponding CPU, with the various other fields indicating just how
|
||
badly RCU wants this quiescent state. This flag is checked by RCU's
|
||
context-switch path (``rcu_note_context_switch``) and the cond_resched
|
||
code.
|
||
|
||
+-----------------------------------------------------------------------+
|
||
| **Quick Quiz**: |
|
||
+-----------------------------------------------------------------------+
|
||
| Why not simply combine the ``->dynticks_nesting`` and |
|
||
| ``->dynticks_nmi_nesting`` counters into a single counter that just |
|
||
| counts the number of reasons that the corresponding CPU is non-idle? |
|
||
+-----------------------------------------------------------------------+
|
||
| **Answer**: |
|
||
+-----------------------------------------------------------------------+
|
||
| Because this would fail in the presence of interrupts whose handlers |
|
||
| never return and of handlers that manage to return from a made-up |
|
||
| interrupt. |
|
||
+-----------------------------------------------------------------------+
|
||
|
||
Additional fields are present for some special-purpose builds, and are
|
||
discussed separately.
|
||
|
||
The ``rcu_head`` Structure
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
Each ``rcu_head`` structure represents an RCU callback. These structures
|
||
are normally embedded within RCU-protected data structures whose
|
||
algorithms use asynchronous grace periods. In contrast, when using
|
||
algorithms that block waiting for RCU grace periods, RCU users need not
|
||
provide ``rcu_head`` structures.
|
||
|
||
The ``rcu_head`` structure has fields as follows:
|
||
|
||
::
|
||
|
||
1 struct rcu_head *next;
|
||
2 void (*func)(struct rcu_head *head);
|
||
|
||
The ``->next`` field is used to link the ``rcu_head`` structures
|
||
together in the lists within the ``rcu_data`` structures. The ``->func``
|
||
field is a pointer to the function to be called when the callback is
|
||
ready to be invoked, and this function is passed a pointer to the
|
||
``rcu_head`` structure. However, ``kfree_rcu()`` uses the ``->func``
|
||
field to record the offset of the ``rcu_head`` structure within the
|
||
enclosing RCU-protected data structure.
|
||
|
||
Both of these fields are used internally by RCU. From the viewpoint of
|
||
RCU users, this structure is an opaque “cookie”.
|
||
|
||
+-----------------------------------------------------------------------+
|
||
| **Quick Quiz**: |
|
||
+-----------------------------------------------------------------------+
|
||
| Given that the callback function ``->func`` is passed a pointer to |
|
||
| the ``rcu_head`` structure, how is that function supposed to find the |
|
||
| beginning of the enclosing RCU-protected data structure? |
|
||
+-----------------------------------------------------------------------+
|
||
| **Answer**: |
|
||
+-----------------------------------------------------------------------+
|
||
| In actual practice, there is a separate callback function per type of |
|
||
| RCU-protected data structure. The callback function can therefore use |
|
||
| the ``container_of()`` macro in the Linux kernel (or other |
|
||
| pointer-manipulation facilities in other software environments) to |
|
||
| find the beginning of the enclosing structure. |
|
||
+-----------------------------------------------------------------------+
|
||
|
||
RCU-Specific Fields in the ``task_struct`` Structure
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
The ``CONFIG_PREEMPT_RCU`` implementation uses some additional fields in
|
||
the ``task_struct`` structure:
|
||
|
||
::
|
||
|
||
1 #ifdef CONFIG_PREEMPT_RCU
|
||
2 int rcu_read_lock_nesting;
|
||
3 union rcu_special rcu_read_unlock_special;
|
||
4 struct list_head rcu_node_entry;
|
||
5 struct rcu_node *rcu_blocked_node;
|
||
6 #endif /* #ifdef CONFIG_PREEMPT_RCU */
|
||
7 #ifdef CONFIG_TASKS_RCU
|
||
8 unsigned long rcu_tasks_nvcsw;
|
||
9 bool rcu_tasks_holdout;
|
||
10 struct list_head rcu_tasks_holdout_list;
|
||
11 int rcu_tasks_idle_cpu;
|
||
12 #endif /* #ifdef CONFIG_TASKS_RCU */
|
||
|
||
The ``->rcu_read_lock_nesting`` field records the nesting level for RCU
|
||
read-side critical sections, and the ``->rcu_read_unlock_special`` field
|
||
is a bitmask that records special conditions that require
|
||
``rcu_read_unlock()`` to do additional work. The ``->rcu_node_entry``
|
||
field is used to form lists of tasks that have blocked within
|
||
preemptible-RCU read-side critical sections and the
|
||
``->rcu_blocked_node`` field references the ``rcu_node`` structure whose
|
||
list this task is a member of, or ``NULL`` if it is not blocked within a
|
||
preemptible-RCU read-side critical section.
|
||
|
||
The ``->rcu_tasks_nvcsw`` field tracks the number of voluntary context
|
||
switches that this task had undergone at the beginning of the current
|
||
tasks-RCU grace period, ``->rcu_tasks_holdout`` is set if the current
|
||
tasks-RCU grace period is waiting on this task,
|
||
``->rcu_tasks_holdout_list`` is a list element enqueuing this task on
|
||
the holdout list, and ``->rcu_tasks_idle_cpu`` tracks which CPU this
|
||
idle task is running, but only if the task is currently running, that
|
||
is, if the CPU is currently idle.
|
||
|
||
Accessor Functions
|
||
~~~~~~~~~~~~~~~~~~
|
||
|
||
The following listing shows the ``rcu_get_root()``,
|
||
``rcu_for_each_node_breadth_first`` and ``rcu_for_each_leaf_node()``
|
||
function and macros:
|
||
|
||
::
|
||
|
||
1 static struct rcu_node *rcu_get_root(struct rcu_state *rsp)
|
||
2 {
|
||
3 return &rsp->node[0];
|
||
4 }
|
||
5
|
||
6 #define rcu_for_each_node_breadth_first(rsp, rnp) \
|
||
7 for ((rnp) = &(rsp)->node[0]; \
|
||
8 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++)
|
||
9
|
||
10 #define rcu_for_each_leaf_node(rsp, rnp) \
|
||
11 for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \
|
||
12 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++)
|
||
|
||
The ``rcu_get_root()`` simply returns a pointer to the first element of
|
||
the specified ``rcu_state`` structure's ``->node[]`` array, which is the
|
||
root ``rcu_node`` structure.
|
||
|
||
As noted earlier, the ``rcu_for_each_node_breadth_first()`` macro takes
|
||
advantage of the layout of the ``rcu_node`` structures in the
|
||
``rcu_state`` structure's ``->node[]`` array, performing a breadth-first
|
||
traversal by simply traversing the array in order. Similarly, the
|
||
``rcu_for_each_leaf_node()`` macro traverses only the last part of the
|
||
array, thus traversing only the leaf ``rcu_node`` structures.
|
||
|
||
+-----------------------------------------------------------------------+
|
||
| **Quick Quiz**: |
|
||
+-----------------------------------------------------------------------+
|
||
| What does ``rcu_for_each_leaf_node()`` do if the ``rcu_node`` tree |
|
||
| contains only a single node? |
|
||
+-----------------------------------------------------------------------+
|
||
| **Answer**: |
|
||
+-----------------------------------------------------------------------+
|
||
| In the single-node case, ``rcu_for_each_leaf_node()`` traverses the |
|
||
| single node. |
|
||
+-----------------------------------------------------------------------+
|
||
|
||
Summary
|
||
~~~~~~~
|
||
|
||
So the state of RCU is represented by an ``rcu_state`` structure, which
|
||
contains a combining tree of ``rcu_node`` and ``rcu_data`` structures.
|
||
Finally, in ``CONFIG_NO_HZ_IDLE`` kernels, each CPU's dyntick-idle state
|
||
is tracked by dynticks-related fields in the ``rcu_data`` structure. If
|
||
you made it this far, you are well prepared to read the code
|
||
walkthroughs in the other articles in this series.
|
||
|
||
Acknowledgments
|
||
~~~~~~~~~~~~~~~
|
||
|
||
I owe thanks to Cyrill Gorcunov, Mathieu Desnoyers, Dhaval Giani, Paul
|
||
Turner, Abhishek Srivastava, Matt Kowalczyk, and Serge Hallyn for
|
||
helping me get this document into a more human-readable state.
|
||
|
||
Legal Statement
|
||
~~~~~~~~~~~~~~~
|
||
|
||
This work represents the view of the author and does not necessarily
|
||
represent the view of IBM.
|
||
|
||
Linux is a registered trademark of Linus Torvalds.
|
||
|
||
Other company, product, and service names may be trademarks or service
|
||
marks of others.
|