linux-stable/Documentation/networking/netdevices.rst

300 lines
9.3 KiB
ReStructuredText
Raw Normal View History

.. SPDX-License-Identifier: GPL-2.0
=====================================
Network Devices, the Kernel, and You!
=====================================
Introduction
============
The following is a random collection of documentation regarding
network devices.
struct net_device lifetime rules
================================
Network device structures need to persist even after module is unloaded and
must be allocated with alloc_netdev_mqs() and friends.
If device has registered successfully, it will be freed on last use
by free_netdev(). This is required to handle the pathological case cleanly
(example: ``rmmod mydriver </sys/class/net/myeth/mtu``)
alloc_netdev_mqs() / alloc_netdev() reserve extra space for driver
private data which gets freed when the network device is freed. If
separately allocated data is attached to the network device
(netdev_priv()) then it is up to the module exit handler to free that.
There are two groups of APIs for registering struct net_device.
First group can be used in normal contexts where ``rtnl_lock`` is not already
held: register_netdev(), unregister_netdev().
Second group can be used when ``rtnl_lock`` is already held:
register_netdevice(), unregister_netdevice(), free_netdevice().
Simple drivers
--------------
Most drivers (especially device drivers) handle lifetime of struct net_device
in context where ``rtnl_lock`` is not held (e.g. driver probe and remove paths).
In that case the struct net_device registration is done using
the register_netdev(), and unregister_netdev() functions:
.. code-block:: c
int probe()
{
struct my_device_priv *priv;
int err;
dev = alloc_netdev_mqs(...);
if (!dev)
return -ENOMEM;
priv = netdev_priv(dev);
/* ... do all device setup before calling register_netdev() ...
*/
err = register_netdev(dev);
if (err)
goto err_undo;
/* net_device is visible to the user! */
err_undo:
/* ... undo the device setup ... */
free_netdev(dev);
return err;
}
void remove()
{
unregister_netdev(dev);
free_netdev(dev);
}
Note that after calling register_netdev() the device is visible in the system.
Users can open it and start sending / receiving traffic immediately,
or run any other callback, so all initialization must be done prior to
registration.
unregister_netdev() closes the device and waits for all users to be done
with it. The memory of struct net_device itself may still be referenced
by sysfs but all operations on that device will fail.
free_netdev() can be called after unregister_netdev() returns on when
register_netdev() failed.
Device management under RTNL
----------------------------
Registering struct net_device while in context which already holds
the ``rtnl_lock`` requires extra care. In those scenarios most drivers
will want to make use of struct net_device's ``needs_free_netdev``
and ``priv_destructor`` members for freeing of state.
Example flow of netdev handling under ``rtnl_lock``:
.. code-block:: c
static void my_setup(struct net_device *dev)
{
dev->needs_free_netdev = true;
}
static void my_destructor(struct net_device *dev)
{
some_obj_destroy(priv->obj);
some_uninit(priv);
}
int create_link()
{
struct my_device_priv *priv;
int err;
ASSERT_RTNL();
dev = alloc_netdev(sizeof(*priv), "net%d", NET_NAME_UNKNOWN, my_setup);
if (!dev)
return -ENOMEM;
priv = netdev_priv(dev);
/* Implicit constructor */
err = some_init(priv);
if (err)
goto err_free_dev;
priv->obj = some_obj_create();
if (!priv->obj) {
err = -ENOMEM;
goto err_some_uninit;
}
/* End of constructor, set the destructor: */
dev->priv_destructor = my_destructor;
err = register_netdevice(dev);
if (err)
/* register_netdevice() calls destructor on failure */
goto err_free_dev;
/* If anything fails now unregister_netdevice() (or unregister_netdev())
* will take care of calling my_destructor and free_netdev().
*/
return 0;
err_some_uninit:
some_uninit(priv);
err_free_dev:
free_netdev(dev);
return err;
}
If struct net_device.priv_destructor is set it will be called by the core
some time after unregister_netdevice(), it will also be called if
register_netdevice() fails. The callback may be invoked with or without
``rtnl_lock`` held.
There is no explicit constructor callback, driver "constructs" the private
netdev state after allocating it and before registration.
Setting struct net_device.needs_free_netdev makes core call free_netdevice()
automatically after unregister_netdevice() when all references to the device
are gone. It only takes effect after a successful call to register_netdevice()
so if register_netdevice() fails driver is responsible for calling
free_netdev().
free_netdev() is safe to call on error paths right after unregister_netdevice()
or when register_netdevice() fails. Parts of netdev (de)registration process
happen after ``rtnl_lock`` is released, therefore in those cases free_netdev()
will defer some of the processing until ``rtnl_lock`` is released.
Devices spawned from struct rtnl_link_ops should never free the
struct net_device directly.
.ndo_init and .ndo_uninit
~~~~~~~~~~~~~~~~~~~~~~~~~
``.ndo_init`` and ``.ndo_uninit`` callbacks are called during net_device
registration and de-registration, under ``rtnl_lock``. Drivers can use
those e.g. when parts of their init process need to run under ``rtnl_lock``.
``.ndo_init`` runs before device is visible in the system, ``.ndo_uninit``
runs during de-registering after device is closed but other subsystems
may still have outstanding references to the netdevice.
MTU
===
Each network device has a Maximum Transfer Unit. The MTU does not
include any link layer protocol overhead. Upper layer protocols must
not pass a socket buffer (skb) to a device to transmit with more data
than the mtu. The MTU does not include link layer header overhead, so
for example on Ethernet if the standard MTU is 1500 bytes used, the
actual skb will contain up to 1514 bytes because of the Ethernet
header. Devices should allow for the 4 byte VLAN header as well.
Segmentation Offload (GSO, TSO) is an exception to this rule. The
upper layer protocol may pass a large socket buffer to the device
transmit routine, and the device will break that up into separate
packets based on the current MTU.
MTU is symmetrical and applies both to receive and transmit. A device
must be able to receive at least the maximum size packet allowed by
the MTU. A network device may use the MTU as mechanism to size receive
buffers, but the device should allow packets with VLAN header. With
standard Ethernet mtu of 1500 bytes, the device should allow up to
1518 byte packets (1500 + 14 header + 4 tag). The device may either:
drop, truncate, or pass up oversize packets, but dropping oversize
packets is preferred.
struct net_device synchronization rules
=======================================
ndo_open:
Synchronization: rtnl_lock() semaphore.
Context: process
ndo_stop:
Synchronization: rtnl_lock() semaphore.
Context: process
Note: netif_running() is guaranteed false
ndo_do_ioctl:
Synchronization: rtnl_lock() semaphore.
Context: process
This is only called by network subsystems internally,
not by user space calling ioctl as it was in before
linux-5.14.
ndo_siocbond:
Synchronization: rtnl_lock() semaphore.
Context: process
Used by the bonding driver for the SIOCBOND family of
ioctl commands.
ndo_siocwandev:
Synchronization: rtnl_lock() semaphore.
Context: process
Used by the drivers/net/wan framework to handle
the SIOCWANDEV ioctl with the if_settings structure.
ndo_siocdevprivate:
Synchronization: rtnl_lock() semaphore.
Context: process
This is used to implement SIOCDEVPRIVATE ioctl helpers.
These should not be added to new drivers, so don't use.
ndo_eth_ioctl:
Synchronization: rtnl_lock() semaphore.
Context: process
ndo_get_stats:
Synchronization: rtnl_lock() semaphore, or RCU.
Context: atomic (can't sleep under RCU)
ndo_start_xmit:
Synchronization: __netif_tx_lock spinlock.
When the driver sets NETIF_F_LLTX in dev->features this will be
[NET]: Add netif_tx_lock Various drivers use xmit_lock internally to synchronise with their transmission routines. They do so without setting xmit_lock_owner. This is fine as long as netpoll is not in use. With netpoll it is possible for deadlocks to occur if xmit_lock_owner isn't set. This is because if a printk occurs while xmit_lock is held and xmit_lock_owner is not set can cause netpoll to attempt to take xmit_lock recursively. While it is possible to resolve this by getting netpoll to use trylock, it is suboptimal because netpoll's sole objective is to maximise the chance of getting the printk out on the wire. So delaying or dropping the message is to be avoided as much as possible. So the only alternative is to always set xmit_lock_owner. The following patch does this by introducing the netif_tx_lock family of functions that take care of setting/unsetting xmit_lock_owner. I renamed xmit_lock to _xmit_lock to indicate that it should not be used directly. I didn't provide irq versions of the netif_tx_lock functions since xmit_lock is meant to be a BH-disabling lock. This is pretty much a straight text substitution except for a small bug fix in winbond. It currently uses netif_stop_queue/spin_unlock_wait to stop transmission. This is unsafe as an IRQ can potentially wake up the queue. So it is safer to use netif_tx_disable. The hamradio bits used spin_lock_irq but it is unnecessary as xmit_lock must never be taken in an IRQ handler. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-09 19:20:56 +00:00
called without holding netif_tx_lock. In this case the driver
has to lock by itself when needed.
The locking there should also properly protect against
set_rx_mode. WARNING: use of NETIF_F_LLTX is deprecated.
Don't use it for new drivers.
Context: Process with BHs disabled or BH (timer),
will be called with interrupts disabled by netconsole.
Return codes:
* NETDEV_TX_OK everything ok.
* NETDEV_TX_BUSY Cannot transmit packet, try later
Usually a bug, means queue start/stop flow control is broken in
the driver. Note: the driver must NOT put the skb in its DMA ring.
ndo_tx_timeout:
Synchronization: netif_tx_lock spinlock; all TX queues frozen.
Context: BHs disabled
Notes: netif_queue_stopped() is guaranteed true
ndo_set_rx_mode:
Synchronization: netif_addr_lock spinlock.
Context: BHs disabled
[NET]: Make NAPI polling independent of struct net_device objects. Several devices have multiple independant RX queues per net device, and some have a single interrupt doorbell for several queues. In either case, it's easier to support layouts like that if the structure representing the poll is independant from the net device itself. The signature of the ->poll() call back goes from: int foo_poll(struct net_device *dev, int *budget) to int foo_poll(struct napi_struct *napi, int budget) The caller is returned the number of RX packets processed (or the number of "NAPI credits" consumed if you want to get abstract). The callee no longer messes around bumping dev->quota, *budget, etc. because that is all handled in the caller upon return. The napi_struct is to be embedded in the device driver private data structures. Furthermore, it is the driver's responsibility to disable all NAPI instances in it's ->stop() device close handler. Since the napi_struct is privatized into the driver's private data structures, only the driver knows how to get at all of the napi_struct instances it may have per-device. With lots of help and suggestions from Rusty Russell, Roland Dreier, Michael Chan, Jeff Garzik, and Jamal Hadi Salim. Bug fixes from Thomas Graf, Roland Dreier, Peter Zijlstra, Joseph Fannin, Scott Wood, Hans J. Koch, and Michael Chan. [ Ported to current tree and all drivers converted. Integrated Stephen's follow-on kerneldoc additions, and restored poll_list handling to the old style to fix mutual exclusion issues. -DaveM ] Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-03 23:41:36 +00:00
struct napi_struct synchronization rules
========================================
napi->poll:
Synchronization:
NAPI_STATE_SCHED bit in napi->state. Device
driver's ndo_stop method will invoke napi_disable() on
[NET]: Make NAPI polling independent of struct net_device objects. Several devices have multiple independant RX queues per net device, and some have a single interrupt doorbell for several queues. In either case, it's easier to support layouts like that if the structure representing the poll is independant from the net device itself. The signature of the ->poll() call back goes from: int foo_poll(struct net_device *dev, int *budget) to int foo_poll(struct napi_struct *napi, int budget) The caller is returned the number of RX packets processed (or the number of "NAPI credits" consumed if you want to get abstract). The callee no longer messes around bumping dev->quota, *budget, etc. because that is all handled in the caller upon return. The napi_struct is to be embedded in the device driver private data structures. Furthermore, it is the driver's responsibility to disable all NAPI instances in it's ->stop() device close handler. Since the napi_struct is privatized into the driver's private data structures, only the driver knows how to get at all of the napi_struct instances it may have per-device. With lots of help and suggestions from Rusty Russell, Roland Dreier, Michael Chan, Jeff Garzik, and Jamal Hadi Salim. Bug fixes from Thomas Graf, Roland Dreier, Peter Zijlstra, Joseph Fannin, Scott Wood, Hans J. Koch, and Michael Chan. [ Ported to current tree and all drivers converted. Integrated Stephen's follow-on kerneldoc additions, and restored poll_list handling to the old style to fix mutual exclusion issues. -DaveM ] Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-03 23:41:36 +00:00
all NAPI instances which will do a sleeping poll on the
NAPI_STATE_SCHED napi->state bit, waiting for all pending
NAPI activity to cease.
Context:
softirq
will be called with interrupts disabled by netconsole.