Commit graph

39375 commits

Author SHA1 Message Date
Florian Westphal
81b4325eba netfilter: nf_queue: remove rcu_read_lock calls
All verdict handlers make use of the nfnetlink .call_rcu callback
so rcu readlock is already held.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-16 18:22:41 +02:00
Florian Westphal
ed78d09d59 netfilter: make nf_queue_entry_get_refs return void
We don't care if module is being unloaded anymore since hook unregister
handling will destroy queue entries using that hook.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-16 18:22:23 +02:00
Florian Westphal
2ffbceb2b0 netfilter: remove hook owner refcounting
since commit 8405a8fff3 ("netfilter: nf_qeueue: Drop queue entries on
nf_unregister_hook") all pending queued entries are discarded.

So we can simply remove all of the owner handling -- when module is
removed it also needs to unregister all its hooks.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-16 18:21:39 +02:00
Pablo Neira
8cbc870829 netfilter: nfnetlink_log: validate dependencies to avoid breaking atomicity
Check that dependencies are fulfilled before updating the logger
instance, otherwise we can leave things in intermediate state on errors
in nfulnl_recv_config().

[ Ken-ichirou reports that this is also fixing missing instance refcnt drop
  on error introduced in his patch 914eebf2f4 ("netfilter: nfnetlink_log:
  autoload nf_conntrack_netlink module NFQA_CFG_F_CONNTRACK config flag"). ]

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Tested-by: Ken-ichirou MATSUZAWA <chamaken@gmail.com>
2015-10-15 06:45:03 +02:00
Pablo Neira Ayuso
336a3b3ee9 netfilter: nfnetlink_log: consolidate check for instance in nfulnl_recv_config()
This patch consolidates the check for valid logger instance once we have
passed the command handling:

The config message that we receive may contain the following info:

1) Command only: We always get a valid instance pointer if we just
   created it. In case that the instance is being destroyed or the
   command is unknown, we jump to exit path of nfulnl_recv_config().
   This patch doesn't modify this handling.

2) Config only: In this case, the instance must always exist since the
   user is asking for configuration updates. If the instance doesn't exist
   this returns -ENODEV.

3) No command and no configs are specified: This case is rare. The
   user is sending us a config message with neither commands nor
   config options. In this case, we have to check if the instance exists
   and bail out otherwise. Before this patch, it was possible to send a
   config message with no command and no config updates for an
   unexisting instance without triggering an error. So this is the only
   case that changes.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Tested-by: Ken-ichirou MATSUZAWA <chamaken@gmail.com>
2015-10-15 06:44:31 +02:00
Ian Morris
dbb526ebfe netfilter: ipv6: pointer cast layout
Correct whitespace layout of a pointer casting.

No changes detected by objdiff.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-14 12:30:08 +02:00
Ian Morris
4305ae44a9 netfilter: ip6_tables: improve if statements
Correct whitespace layout of if statements.

No changes detected by objdiff.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-14 12:29:51 +02:00
Ian Morris
544d9b17f9 netfilter: ip6_tables: ternary operator layout
Correct whitespace layout of ternary operators in the netfilter-ipv6
code.

No changes detected by objdiff.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-13 14:12:38 +02:00
Ian Morris
f9527ea9b6 netfilter: ipv6: whitespace around operators
This patch cleanses whitespace around arithmetical operators.

No changes detected by objdiff.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-13 14:12:38 +02:00
Ian Morris
7695495d5a netfilter: ipv6: code indentation
Use tabs instead of spaces to indent code.

No changes detected by objdiff.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-13 14:12:38 +02:00
Ian Morris
cda219c6ad netfilter: ip6_tables: function definition layout
Use tabs instead of spaces to indent second line of parameters in
function definitions.

No changes detected by objdiff.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-13 14:12:37 +02:00
Ian Morris
6ac94619b6 netfilter: ip6_tables: label placement
Whitespace cleansing: Labels should not be indented.

No changes detected by objdiff.

Signed-off-by: Ian Morris <ipm@chirality.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-13 14:12:37 +02:00
Florian Westphal
7ceebfe46e netfilter: nfqueue: don't use prev pointer
Usage of -prev seems buggy.  While packet was out our hook cannot be
removed but we have no way to know if the previous one is still valid.

So better not use ->prev at all.  Since NF_REPEAT just asks to invoke
same hook function again, just do so, and continue with nf_interate
if we get an ACCEPT verdict.

A side effect of this change is that if nf_reinject(NF_REPEAT) causes
another REPEAT we will now drop the skb instead of a kernel loop.

However, NF_REPEAT loops would be a bug so this should not happen anyway.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-13 12:03:24 +02:00
Ken-ichirou MATSUZAWA
914eebf2f4 netfilter: nfnetlink_log: autoload nf_conntrack_netlink module NFQA_CFG_F_CONNTRACK config flag
This patch enables to load nf_conntrack_netlink module if
NFULNL_CFG_F_CONNTRACK config flag is specified.

Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-12 21:44:12 +02:00
Arnd Bergmann
c932245811 netfilter: bridge: avoid unused label warning
With the ARM mini2440_defconfig, the bridge netfilter code gets
built with both CONFIG_NF_DEFRAG_IPV4 and CONFIG_NF_DEFRAG_IPV6
disabled, which leads to a harmless gcc warning:

net/bridge/br_netfilter_hooks.c: In function 'br_nf_dev_queue_xmit':
net/bridge/br_netfilter_hooks.c:792:2: warning: label 'drop' defined but not used [-Wunused-label]

This gets rid of the warning by cleaning up the code to avoid
the respective #ifdefs causing this problem, and replacing them
with if(IS_ENABLED()) checks. I have verified that the resulting
object code is unchanged, and an additional advantage is that
we now get compile coverage of the unused functions in more
configurations.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: dd302b59bd ("netfilter: bridge: don't leak skb in error paths")
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-12 17:48:36 +02:00
Pablo Neira Ayuso
d53195c259 Merge tag 'ipvs4-for-v4.4' of https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next
Simon Horman says:

====================
Fourth Round of IPVS Updates for v4.4

please consider these build warning cleanups from David Ahern and myself.
They resolve some minor side effects of Eric Biederman' heroic work to
cleanup IPVS which you recently pulled: its queued up for v4.4 so no need
to worry about earlier kernel versions.
====================

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-12 17:38:54 +02:00
Pablo Neira Ayuso
4302f5eeb9 nfnetlink_cttimeout: add rcu_barrier() on module removal
Make sure kfree_rcu() released objects before leaving the module removal
exit path.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-12 17:04:41 +02:00
Pablo Neira Ayuso
ae2d708ed8 netfilter: conntrack: fix crash on timeout object removal
The object and module refcounts are updated for each conntrack template,
however, if we delete the iptables rules and we flush the timeout
database, we may end up with invalid references to timeout object that
are just gone.

Resolve this problem by setting the timeout reference to NULL when the
custom timeout entry is removed from our base. This patch requires some
RCU trickery to ensure safe pointer handling.

This handling is similar to what we already do with conntrack helpers,
the idea is to avoid bumping the timeout object reference counter from
the packet path to avoid the cost of atomic ops.

Reported-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-12 17:04:34 +02:00
Pablo Neira Ayuso
403d89ad9c netfilter: xt_CT: don't put back reference to timeout policy object
On success, this shouldn't put back the timeout policy object, otherwise
we may have module refcount overflow and we allow deletion of timeout
that are still in use.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-12 16:54:45 +02:00
Simon Horman
92240e8dc0 ipvs: Remove possibly unused variables from ip_vs_conn_net_{init,cleanup}
If CONFIG_PROC_FS is undefined then the arguments of proc_create()
and remove_proc_entry() are unused. As a result the net variables of
ip_vs_conn_net_{init,cleanup} are unused.

net/netfilter/ipvs//ip_vs_conn.c: In function ‘ip_vs_conn_net_init’:
net/netfilter/ipvs//ip_vs_conn.c:1350:14: warning: unused variable ‘net’ [-Wunused-variable]
net/netfilter/ipvs//ip_vs_conn.c: In function ‘ip_vs_conn_net_cleanup’:
net/netfilter/ipvs//ip_vs_conn.c:1361:14: warning: unused variable ‘net’ [-Wunused-variable]
...

Resolve this by dereferencing net as needed rather than storing it
in a variable.

Fixes: 3d99376689 ("ipvs: Pass ipvs not net into ip_vs_control_net_(init|cleanup)")
Signed-off-by: Simon Horman <horms@verge.net.au>
Acked-by: Julian Anastasov <ja@ssi.bg>
2015-10-07 10:12:00 +09:00
David Ahern
ed1c9f0e78 ipvs: Remove possibly unused variable from ip_vs_out
Eric's net namespace changes in 1b75097dd7 leaves net unreferenced if
CONFIG_IP_VS_IPV6 is not enabled:

../net/netfilter/ipvs/ip_vs_core.c: In function ‘ip_vs_out’:
../net/netfilter/ipvs/ip_vs_core.c:1177:14: warning: unused variable ‘net’ [-Wunused-variable]

After the net refactoring there is only 1 user; push the reference to the
1 user. While the line length slightly exceeds 80 it seems to be the
best change.

Fixes: 1b75097dd7a26("ipvs: Pass ipvs into ip_vs_out")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
[horms: updated subject]
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-10-07 10:11:59 +09:00
Ken-ichirou MATSUZAWA
a29a9a585b netfilter: nfnetlink_log: allow to attach conntrack
This patch enables to include the conntrack information together
with the packet that is sent to user-space via NFLOG, then a
user-space program can acquire NATed information by this NFULA_CT
attribute.

Including the conntrack information is optional, you can set it
via NFULNL_CFG_F_CONNTRACK flag with the NFULA_CFG_FLAGS attribute
like NFQUEUE.

Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-05 17:32:14 +02:00
Ken-ichirou MATSUZAWA
224a05975e netfilter: ctnetlink: add const qualifier to nfnl_hook.get_ct
get_ct as is and will not update its skb argument, and users of
nfnl_ct_hook is currently only nfqueue, we can add const qualifier.

Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp>
2015-10-05 17:32:13 +02:00
Ken-ichirou MATSUZAWA
83f3e94d34 netfilter: Kconfig rename QUEUE_CT to GLUE_CT
Conntrack information attaching infrastructure is now generic and
update it's name to use `glue' in previous patch. This patch updates
Kconfig symbol name and adding NF_CT_NETLINK dependency.

Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-05 17:32:12 +02:00
Ken-ichirou MATSUZAWA
a4b4766c3c netfilter: nfnetlink_queue: rename related to nfqueue attaching conntrack info
The idea of this series of patch is to attach conntrack information to
nflog like nfqueue has already done. nfqueue conntrack info attaching
basis is generic, rename those names to generic one, glue.

Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-05 17:32:11 +02:00
Pablo Neira Ayuso
b28b1e826f netfilter: nfnetlink_queue: use y2038 safe timestamp
The __build_packet_message function fills a nfulnl_msg_packet_timestamp
structure that uses 64-bit seconds and is therefore y2038 safe, but
it uses an intermediate 'struct timespec' which is not.

This trivially changes the code to use 'struct timespec64' instead,
to correct the result on 32-bit architectures.

This is a copy and paste of Arnd's original patch for nfnetlink_log.

Suggested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-05 17:27:25 +02:00
Pablo Neira Ayuso
2b5b1a01a7 Merge tag 'ipvs3-for-v4.4' of https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next
Simon Horman says:

====================
Third Round of IPVS Updates for v4.4

please consider this build fix from Eric Biederman which resolves
a build problem introduced in is excellent work to cleanup IPVS which
you recently pulled: its queued up for v4.4 so no need to worry
about earlier kernel versions.

I have another minor cleanup, to fix a build warning, pending.
However, I wanted to send this one to you now as its hit nf-next,
net-next and in turn next, and a slow trickle of bug reports are appearing.
====================

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-05 17:27:06 +02:00
Pablo Neira Ayuso
32f40c5fa7 netfilter: rename nfnetlink_queue_core.c to nfnetlink_queue.c
Now that we have integrated the ct glue code into nfnetlink_queue without
introducing dependencies with the conntrack code.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-04 21:45:44 +02:00
Pablo Neira Ayuso
b7bd1809e0 netfilter: nfnetlink_queue: get rid of nfnetlink_queue_ct.c
The original intention was to avoid dependencies between nfnetlink_queue and
conntrack without ifdef pollution. However, we can achieve this by moving the
conntrack dependent code into ctnetlink and keep some glue code to access the
nfq_ct indirection from nfqueue.

After this patch, the nfq_ct indirection is always compiled in the netfilter
core to avoid polluting nfqueue with ifdefs. Thus, if nf_conntrack is not
compiled this results in only 8-bytes of memory waste in x86_64.

This patch also adds ctnetlink_nfqueue_seqadj() to avoid that the nf_conn
structure layout if exposed to nf_queue, which creates another dependency with
nf_conntrack at compilation time.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-04 21:45:44 +02:00
Eric Dumazet
e96f78ab27 tcp/dccp: add SLAB_DESTROY_BY_RCU flag for request sockets
Before letting request sockets being put in TCP/DCCP regular
ehash table, we need to add either :

- SLAB_DESTROY_BY_RCU flag to their kmem_cache
- add RCU grace period before freeing them.

Since we carefully respected the SLAB_DESTROY_BY_RCU protocol
like ESTABLISH and TIMEWAIT sockets, use it here.

req_prot_init() being only used by TCP and DCCP, I did not add
a new slab_flags into their rsk_prot, but reuse prot->slab_flags

Since all reqsk_alloc() users are correctly dealing with a failure,
add the __GFP_NOWARN flag to avoid traces under pressure.

Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 13:25:20 -07:00
Daniel Borkmann
754f1e6a36 sched, bpf: make skb->priority writable
{cls,act}_bpf can now set the skb->priority from an eBPF program based
on various critera, so that for example classful qdiscs like multiq can
update the skb's priority during enqueue time and further push it down
into subsequent qdiscs.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 05:02:41 -07:00
Daniel Borkmann
c46646d048 sched, bpf: add helper for retrieving routing realms
Using routing realms as part of the classifier is quite useful, it
can be viewed as a tag for one or multiple routing entries (think of
an analogy to net_cls cgroup for processes), set by user space routing
daemons or via iproute2 as an indicator for traffic classifiers and
later on processed in the eBPF program.

Unlike actions, the classifier can inspect device flags and enable
netif_keep_dst() if necessary. tc actions don't have that possibility,
but in case people know what they are doing, it can be used from there
as well (e.g. via devs that must keep dsts by design anyway).

If a realm is set, the handler returns the non-zero realm. User space
can set the full 32bit realm for the dst.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 05:02:41 -07:00
Daniel Borkmann
a91263d520 ebpf: migrate bpf_prog's flags to bitfield
As we need to add further flags to the bpf_prog structure, lets migrate
both bools to a bitfield representation. The size of the base structure
(excluding insns) remains unchanged at 40 bytes.

Add also tags for the kmemchecker, so that it doesn't throw false
positives. Even in case gcc would generate suboptimal code, it's not
being accessed in performance critical paths.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 05:02:39 -07:00
Jiri Pirko
9e8f4a548a switchdev: push object ID back to object structure
Suggested-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:49:40 -07:00
Jiri Pirko
648b4a995a switchdev: bring back switchdev_obj and use it as a generic object param
Replace "void *obj" with a generic structure. Introduce couple of
helpers along that.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:49:39 -07:00
Jiri Pirko
52ba57cfdc switchdev: rename switchdev_obj_fdb to switchdev_obj_port_fdb
Make the struct name in sync with object id name.

Suggested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:49:39 -07:00
Jiri Pirko
8f24f3095d switchdev: rename switchdev_obj_vlan to switchdev_obj_port_vlan
Make the struct name in sync with object id name.

Suggested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:49:38 -07:00
Jiri Pirko
1f86839874 switchdev: rename SWITCHDEV_ATTR_* enum values to SWITCHDEV_ATTR_ID_*
To be aligned with obj.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:49:37 -07:00
Jiri Pirko
57d80838da switchdev: rename SWITCHDEV_OBJ_* enum values to SWITCHDEV_OBJ_ID_*
Suggested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:49:36 -07:00
Eric Dumazet
e994b2f0fb tcp: do not lock listener to process SYN packets
Everything should now be ready to finally allow SYN
packets processing without holding listener lock.

Tested:

3.5 Mpps SYNFLOOD. Plenty of cpu cycles available.

Next bottleneck is the refcount taken on listener,
that could be avoided if we remove SLAB_DESTROY_BY_RCU
strict semantic for listeners, and use regular RCU.

    13.18%  [kernel]  [k] __inet_lookup_listener
     9.61%  [kernel]  [k] tcp_conn_request
     8.16%  [kernel]  [k] sha_transform
     5.30%  [kernel]  [k] inet_reqsk_alloc
     4.22%  [kernel]  [k] sock_put
     3.74%  [kernel]  [k] tcp_make_synack
     2.88%  [kernel]  [k] ipt_do_table
     2.56%  [kernel]  [k] memcpy_erms
     2.53%  [kernel]  [k] sock_wfree
     2.40%  [kernel]  [k] tcp_v4_rcv
     2.08%  [kernel]  [k] fib_table_lookup
     1.84%  [kernel]  [k] tcp_openreq_init_rwin

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:46 -07:00
Eric Dumazet
92d6f176fd tcp/dccp: add a reschedule point in inet_csk_listen_stop()
If a listener with thousands of children in accept queue
is dismantled, it can take a while to close all of them.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:45 -07:00
Eric Dumazet
ef547f2ac1 tcp: remove max_qlen_log
This control variable was set at first listen(fd, backlog)
call, but not updated if application tried to increase or decrease
backlog. It made sense at the time listener had a non resizeable
hash table.

Also rounding to powers of two was not very friendly.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:44 -07:00
Eric Dumazet
10cbc8f179 tcp/dccp: remove struct listen_sock
It is enough to check listener sk_state, no need for an extra
condition.

max_qlen_log can be moved into struct request_sock_queue

We can remove syn_wait_lock and the alignment it enforced.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:43 -07:00
Eric Dumazet
ca6fb06518 tcp: attach SYNACK messages to request sockets instead of listener
If a listen backlog is very big (to avoid syncookies), then
the listener sk->sk_wmem_alloc is the main source of false
sharing, as we need to touch it twice per SYNACK re-transmit
and TX completion.

(One SYN packet takes listener lock once, but up to 6 SYNACK
are generated)

By attaching the skb to the request socket, we remove this
source of contention.

Tested:

 listen(fd, 10485760); // single listener (no SO_REUSEPORT)
 16 RX/TX queue NIC
 Sustain a SYNFLOOD attack of ~320,000 SYN per second,
 Sending ~1,400,000 SYNACK per second.
 Perf profiles now show listener spinlock being next bottleneck.

    20.29%  [kernel]  [k] queued_spin_lock_slowpath
    10.06%  [kernel]  [k] __inet_lookup_established
     5.12%  [kernel]  [k] reqsk_timer_handler
     3.22%  [kernel]  [k] get_next_timer_interrupt
     3.00%  [kernel]  [k] tcp_make_synack
     2.77%  [kernel]  [k] ipt_do_table
     2.70%  [kernel]  [k] run_timer_softirq
     2.50%  [kernel]  [k] ip_finish_output
     2.04%  [kernel]  [k] cascade

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:43 -07:00
Eric Dumazet
81b496b31a tcp/dccp: shrink struct listen_sock
We no longer use hash_rnd, nr_table_entries and syn_table[]

For a listener with a backlog of 10 millions sockets, this
saves 80 MBytes of vmalloced memory.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:42 -07:00
Eric Dumazet
079096f103 tcp/dccp: install syn_recv requests into ehash table
In this patch, we insert request sockets into TCP/DCCP
regular ehash table (where ESTABLISHED and TIMEWAIT sockets
are) instead of using the per listener hash table.

ACK packets find SYN_RECV pseudo sockets without having
to find and lock the listener.

In nominal conditions, this halves pressure on listener lock.

Note that this will allow for SO_REUSEPORT refinements,
so that we can select a listener using cpu/numa affinities instead
of the prior 'consistent hash', since only SYN packets will
apply this selection logic.

We will shrink listen_sock in the following patch to ease
code review.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ying Cai <ycai@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:41 -07:00
Eric Dumazet
2feda34192 tcp/dccp: remove inet_csk_reqsk_queue_added() timeout argument
This is no longer used.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:40 -07:00
Eric Dumazet
aa3a0c8ce6 tcp: get_openreq[46]() changes
When request sockets are no longer in a per listener hash table
but on regular TCP ehash, we need to access listener uid
through req->rsk_listener

get_openreq6() also gets a const for its request socket argument.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:40 -07:00
Eric Dumazet
9cfd08601f tcp: remove BUG_ON() in tcp_check_req()
Once listener is lockless, its sk_state can change anytime.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:39 -07:00
Eric Dumazet
ba8e275a45 tcp: cleanup tcp_v[46]_inbound_md5_hash()
We'll soon have to call tcp_v[46]_inbound_md5_hash() twice.
Also add const attribute to the socket, as it might be the
unlocked listener for SYN packets.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03 04:32:38 -07:00