linux-stable

Commit Graph

Author	SHA1	Message	Date
Florian Westphal	b60f2f3d65	net: ipv4: un-inline ip_finish_output2 text data bss dec hex filename old: 16527 44 0 16571 40bb net/ipv4/ip_output.o new: 14935 44 0 14979 3a83 net/ipv4/ip_output.o Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-12 14:19:17 -07:00
Florian Westphal	33b1f31392	net: ip_fragment: remove BRIDGE_NETFILTER mtu special handling since commit `d6b915e29f` ("ip_fragment: don't forward defragmented DF packet") the largest fragment size is available in the IPCB. Therefore we no longer need to care about 'encapsulation' overhead of stripped PPPOE/VLAN headers since ip_do_fragment doesn't use device mtu in such cases. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-06-12 14:16:46 +02:00
Florian Westphal	d6b915e29f	ip_fragment: don't forward defragmented DF packet We currently always send fragments without DF bit set. Thus, given following setup: mtu1500 - mtu1500:1400 - mtu1400:1280 - mtu1280 A R1 R2 B Where R1 and R2 run linux with netfilter defragmentation/conntrack enabled, then if Host A sent a fragmented packet _with_ DF set to B, R1 will respond with icmp too big error if one of these fragments exceeded 1400 bytes. However, if R1 receives fragment sizes 1200 and 100, it would forward the reassembled packet without refragmenting, i.e. R2 will send an icmp error in response to a packet that was never sent, citing mtu that the original sender never exceeded. The other minor issue is that a refragmentation on R1 will conceal the MTU of R2-B since refragmentation does not set DF bit on the fragments. This modifies ip_fragment so that we track largest fragment size seen both for DF and non-DF packets, and set frag_max_size to the largest value. If the DF fragment size is larger or equal to the non-df one, we will consider the packet a path mtu probe: We set DF bit on the reassembled skb and also tag it with a new IPCB flag to force refragmentation even if skb fits outdev mtu. We will also set DF bit on each fragment in this case. Joint work with Hannes Frederic Sowa. Reported-by: Jesse Gross <jesse@nicira.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-27 13:03:31 -04:00
Florian Westphal	c5501eb340	net: ipv4: avoid repeated calls to ip_skb_dst_mtu helper ip_skb_dst_mtu is small inline helper, but its called in several places. before: 17061 44 0 17105 42d1 net/ipv4/ip_output.o after: 16805 44 0 16849 41d1 net/ipv4/ip_output.o Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-27 13:03:30 -04:00
Hannes Frederic Sowa	be12a1fe29	net: skbuff: add skb_append_pagefrags and use it Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-25 00:06:58 -04:00
Andy Zhou	49d16b23cd	bridge_netfilter: No ICMP packet on IPv4 fragmentation error When bridge netfilter re-fragments an IP packet for output, all packets that can not be re-fragmented to their original input size should be silently discarded. However, current bridge netfilter output path generates an ICMP packet with 'size exceeded MTU' message for such packets, this is a bug. This patch refactors the ip_fragment() API to allow two separate use cases. The bridge netfilter user case will not send ICMP, the routing output will, as before. Signed-off-by: Andy Zhou <azhou@nicira.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-19 00:15:39 -04:00
Eric Dumazet	7d771aaac7	ipv4: __ip_local_out_sk() is static __ip_local_out_sk() is only used from net/ipv4/ip_output.c net/ipv4/ip_output.c:94:5: warning: symbol '__ip_local_out_sk' was not declared. Should it be static? Fixes: `7026b1ddb6` ("netfilter: Pass socket pointer down through okfn().") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:21:33 -04:00
Sheng Yong	8bc0034cf6	net: remove extra newlines Signed-off-by: Sheng Yong <shengyong1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-04-07 22:24:37 -04:00
David Miller	7026b1ddb6	netfilter: Pass socket pointer down through okfn(). On the output paths in particular, we have to sometimes deal with two socket contexts. First, and usually skb->sk, is the local socket that generated the frame. And second, is potentially the socket used to control a tunneling socket, such as one the encapsulates using UDP. We do not want to disassociate skb->sk when encapsulating in order to fix this, because that would break socket memory accounting. The most extreme case where this can cause huge problems is an AF_PACKET socket transmitting over a vxlan device. We hit code paths doing checks that assume they are dealing with an ipv4 socket, but are actually operating upon the AF_PACKET one. Signed-off-by: David S. Miller <davem@davemloft.net>	2015-04-07 15:25:55 -04:00
Ian Morris	00db41243e	ipv4: coding style: comparison for inequality with NULL The ipv4 code uses a mixture of coding styles. In some instances check for non-NULL pointer is done as x != NULL and sometimes as x. x is preferred according to checkpatch and this patch makes the code consistent by adopting the latter form. No changes detected by objdiff. Signed-off-by: Ian Morris <ipm@chirality.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-04-03 12:11:15 -04:00
Ian Morris	51456b2914	ipv4: coding style: comparison for equality with NULL The ipv4 code uses a mixture of coding styles. In some instances check for NULL pointer is done as x == NULL and sometimes as !x. !x is preferred according to checkpatch and this patch makes the code consistent by adopting the latter form. No changes detected by objdiff. Signed-off-by: Ian Morris <ipm@chirality.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-04-03 12:11:15 -04:00
Hannes Frederic Sowa	b6a7719aed	ipv4: hash net ptr into fragmentation bucket selection As namespaces are sometimes used with overlapping ip address ranges, we should also use the namespace as input to the hash to select the ip fragmentation counter bucket. Cc: Eric Dumazet <edumazet@google.com> Cc: Flavio Leitner <fbl@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-25 14:07:04 -04:00
Florian Westphal	8d0451638a	netfilter: bridge: kill nf_bridge_pad The br_netfilter frag output function calls skb_cow_head() so in case it needs a larger headroom to e.g. re-add a previously stripped PPPOE or VLAN header things will still work (at cost of reallocation). We can then move nf_bridge_encap_header_len to br_netfilter. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-03-22 19:45:55 +01:00
Michal Kubeček	acf8dd0a9d	udp: only allow UFO for packets from SOCK_DGRAM sockets If an over-MTU UDP datagram is sent through a SOCK_RAW socket to a UFO-capable device, ip_ufo_append_data() sets skb->ip_summed to CHECKSUM_PARTIAL unconditionally as all GSO code assumes transport layer checksum is to be computed on segmentation. However, in this case, skb->csum_start and skb->csum_offset are never set as raw socket transmit path bypasses udp_send_skb() where they are usually set. As a result, driver may access invalid memory when trying to calculate the checksum and store the result (as observed in virtio_net driver). Moreover, the very idea of modifying the userspace provided UDP header is IMHO against raw socket semantics (I wasn't able to find a document clearly stating this or the opposite, though). And while allowing CHECKSUM_NONE in the UFO case would be more efficient, it would be a bit too intrusive change just to handle a corner case like this. Therefore disallowing UFO for packets from SOCK_DGRAM seems to be the best option. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-02 22:19:29 -05:00
David S. Miller	6e03f896b5	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/vxlan.c drivers/vhost/net.c include/linux/if_vlan.h net/core/dev.c The net/core/dev.c conflict was the overlap of one commit marking an existing function static whilst another was adding a new function. In the include/linux/if_vlan.h case, the type used for a local variable was changed in 'net', whereas the function got rewritten to fix a stacked vlan bug in 'net-next'. In drivers/vhost/net.c, Al Viro's iov_iter conversions in 'net-next' overlapped with an endainness fix for VHOST 1.0 in 'net'. In drivers/net/vxlan.c, vxlan_find_vni() added a 'flags' parameter in 'net-next' whereas in 'net' there was a bug fix to pass in the correct network namespace pointer in calls to this function. Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-05 14:33:28 -08:00
Al Viro	21226abb4e	net: switch memcpy_fromiovec()/memcpy_fromiovecend() users to copy_from_iter() That takes care of the majority of ->sendmsg() instances - most of them via memcpy_to_msg() or assorted getfrag() callbacks. One place where we still keep memcpy_fromiovecend() is tipc - there we potentially read the same data over and over; separate patch, that... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-02-04 01:34:15 -05:00
Eric Dumazet	bdbbb8527b	ipv4: tcp: get rid of ugly unicast_sock In commit `be9f4a44e7` ("ipv4: tcp: remove per net tcp_sock") I tried to address contention on a socket lock, but the solution I chose was horrible : commit `3a7c384ffd` ("ipv4: tcp: unicast_sock should not land outside of TCP stack") addressed a selinux regression. commit `0980e56e50` ("ipv4: tcp: set unicast_sock uc_ttl to -1") took care of another regression. commit `b5ec8eeac4` ("ipv4: fix ip_send_skb()") fixed another regression. commit `811230cd85` ("tcp: ipv4: initialize unicast_sock sk_pacing_rate") was another shot in the dark. Really, just use a proper socket per cpu, and remove the skb_orphan() call, to re-enable flow control. This solves a serious problem with FQ packet scheduler when used in hostile environments, as we do not want to allocate a flow structure for every RST packet sent in response to a spoofed packet. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-01 23:06:19 -08:00
Eric Dumazet	811230cd85	tcp: ipv4: initialize unicast_sock sk_pacing_rate When I added sk_pacing_rate field, I forgot to initialize its value in the per cpu unicast_sock used in ip_send_unicast_reply() This means that for sch_fq users, RST packets, or ACK packets sent on behalf of TIME_WAIT sockets might be sent to slowly or even dropped once we reach the per flow limit. Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: `95bd09eb27` ("tcp: TSO packets automatic sizing") Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-28 23:24:47 -08:00
Al Viro	c0371da604	put iov_iter into msghdr Note that the code _using_ ->msg_iter at that point will be very unhappy with anything other than unshifted iovec-backed iov_iter. We still need to convert users to proper primitives. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-12-09 16:29:03 -05:00
Al Viro	f69e6d131f	ip_generic_getfrag, udplite_getfrag: switch to passing msghdr Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-12-09 16:28:22 -05:00
Joe Perches	cbffccc970	net; ipv[46] - Remove 2 unnecessary NETDEBUG OOM messages These messages aren't useful as there's a generic dump_stack() on OOM. Neaten the comment and if test above the OOM by separating the assign in if into an allocation then if test. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-11-06 15:11:10 -05:00
Florian Westphal	330966e501	net: make skb_gso_segment error handling more robust skb_gso_segment has three possible return values: 1. a pointer to the first segmented skb 2. an errno value (IS_ERR()) 3. NULL. This can happen when GSO is used for header verification. However, several callers currently test IS_ERR instead of IS_ERR_OR_NULL and would oops when NULL is returned. Note that these call sites should never actually see such a NULL return value; all callers mask out the GSO bits in the feature argument. However, there have been issues with some protocol handlers erronously not respecting the specified feature mask in some cases. It is preferable to get 'have to turn off hw offloading, else slow' reports rather than 'kernel crashes'. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-10-20 12:38:13 -04:00
Vasily Averin	4062090e3e	ipv4: dst_entry leak in ip_send_unicast_reply() ip_setup_cork() called inside ip_append_data() steals dst entry from rt to cork and in case errors in __ip_append_data() nobody frees stolen dst entry Fixes: `2e77d89b2f` ("net: avoid a pair of dst_hold()/dst_release() in ip_append_data()") Signed-off-by: Vasily Averin <vvs@parallels.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-10-17 15:30:12 -04:00
Pablo Neira Ayuso	1109a90c01	netfilter: use IS_ENABLED(CONFIG_BRIDGE_NETFILTER) In `34666d4` ("netfilter: bridge: move br_netfilter out of the core"), the bridge netfilter code has been modularized. Use IS_ENABLED instead of ifdef to cover the module case. Fixes: `34666d4` ("netfilter: bridge: move br_netfilter out of the core") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2014-10-02 18:30:54 +02:00
Eric Dumazet	24a2d43d88	ipv4: rename ip_options_echo to __ip_options_echo() ip_options_echo() assumes struct ip_options is provided in &IPCB(skb)->opt Lets break this assumption, but provide a helper to not change all call points. ip_send_unicast_reply() gets a new struct ip_options pointer. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-09-28 16:35:42 -04:00
Willem de Bruijn	09c2d251b7	net-timestamp: add key to disambiguate concurrent datagrams Datagrams timestamped on transmission can coexist in the kernel stack and be reordered in packet scheduling. When reading looped datagrams from the socket error queue it is not always possible to unique correlate looped data with original send() call (for application level retransmits). Even if possible, it may be expensive and complex, requiring packet inspection. Introduce a data-independent ID mechanism to associate timestamps with send calls. Pass an ID alongside the timestamp in field ee_data of sock_extended_err. The ID is a simple 32 bit unsigned int that is associated with the socket and incremented on each send() call for which software tx timestamp generation is enabled. The feature is enabled only if SOF_TIMESTAMPING_OPT_ID is set, to avoid changing ee_data for existing applications that expect it 0. The counter is reset each time the flag is reenabled. Reenabling does not change the ID of already submitted data. It is possible to receive out of order IDs if the timestamp stream is not quiesced first. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-08-05 16:35:54 -07:00
Willem de Bruijn	11878b40ed	net-timestamp: SOCK_RAW and PING timestamping Add SO_TIMESTAMPING to sockets of type PF_INET[6]/SOCK_RAW: Add the necessary sock_tx_timestamp calls to the datapath for RAW sockets (ping sockets already had these calls). Fix the IP output path to pass the timestamp flags on the first fragment also for these sockets. The existing code relies on transhdrlen != 0 to indicate a first fragment. For these sockets, that assumption does not hold. This fixes http://bugzilla.kernel.org/show_bug.cgi?id=77221 Tested SOCK_RAW on IPv4 and IPv6, not PING. Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-07-15 16:32:45 -07:00
Eric Dumazet	73f156a6e8	inetpeer: get rid of ip_id_count Ideally, we would need to generate IP ID using a per destination IP generator. linux kernels used inet_peer cache for this purpose, but this had a huge cost on servers disabling MTU discovery. 1) each inet_peer struct consumes 192 bytes 2) inetpeer cache uses a binary tree of inet_peer structs, with a nominal size of ~66000 elements under load. 3) lookups in this tree are hitting a lot of cache lines, as tree depth is about 20. 4) If server deals with many tcp flows, we have a high probability of not finding the inet_peer, allocating a fresh one, inserting it in the tree with same initial ip_id_count, (cf secure_ip_id()) 5) We garbage collect inet_peer aggressively. IP ID generation do not have to be 'perfect' Goal is trying to avoid duplicates in a short period of time, so that reassembly units have a chance to complete reassembly of fragments belonging to one message before receiving other fragments with a recycled ID. We simply use an array of generators, and a Jenkin hash using the dst IP as a key. ipv6_select_ident() is put back into net/ipv6/ip6_output.c where it belongs (it is only used from this file) secure_ip_id() and secure_ipv6_id() no longer are needed. Rename ip_select_ident_more() to ip_select_ident_segs() to avoid unnecessary decrement/increment of the number of segments. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-06-02 11:00:41 -07:00
Lorenzo Colitti	e110861f86	net: add a sysctl to reflect the fwmark on replies Kernel-originated IP packets that have no user socket associated with them (e.g., ICMP errors and echo replies, TCP RSTs, etc.) are emitted with a mark of zero. Add a sysctl to make them have the same mark as the packet they are replying to. This allows an administrator that wishes to do so to use mark-based routing, firewalling, etc. for these replies by marking the original packets inbound. Tested using user-mode linux: - ICMP/ICMPv6 echo replies and errors. - TCP RST packets (IPv4 and IPv6). Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-05-13 18:35:08 -04:00
WANG Cong	60ff746739	net: rename local_df to ignore_df As suggested by several people, rename local_df to ignore_df, since it means "ignore df bit if it is set". Cc: Maciej Żenczykowski <maze@google.com> Cc: Florian Westphal <fw@strlen.de> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-05-12 14:03:41 -04:00
Florian Westphal	c7ba65d7b6	net: ip: push gso skb forwarding handling down the stack Doing the segmentation in the forward path has one major drawback: When using virtio, we may process gso udp packets coming from host network stack. In that case, netfilter POSTROUTING will see one packet with udp header followed by multiple ip fragments. Delay the segmentation and do it after POSTROUTING invocation to avoid this. Fixes: `fe6cc55f3a` ("net: ip, ipv6: handle gso skbs in forwarding path") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-05-07 15:49:07 -04:00
Eric Dumazet	aad88724c9	ipv4: add a sock pointer to dst->output() path. In the dst->output() path for ipv4, the code assumes the skb it has to transmit is attached to an inet socket, specifically via ip_mc_output() : The sk_mc_loop() test triggers a WARN_ON() when the provider of the packet is an AF_PACKET socket. The dst->output() method gets an additional 'struct sock *sk' parameter. This needs a cascade of changes so that this parameter can be propagated from vxlan to final consumer. Fixes: `8f646c922d` ("vxlan: keep original skb ownership") Reported-by: lucien xin <lucien.xin@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-04-15 13:47:15 -04:00
Eric Dumazet	b0270e9101	ipv4: add a sock pointer to ip_queue_xmit() ip_queue_xmit() assumes the skb it has to transmit is attached to an inet socket. Commit `31c70d5956` ("l2tp: keep original skb ownership") changed l2tp to not change skb ownership and thus broke this assumption. One fix is to add a new 'struct sock *sk' parameter to ip_queue_xmit(), so that we do not assume skb->sk points to the socket used by l2tp tunnel. Fixes: `31c70d5956` ("l2tp: keep original skb ownership") Reported-by: Zhan Jianyu <nasa4836@gmail.com> Tested-by: Zhan Jianyu <nasa4836@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-04-15 12:58:34 -04:00
David S. Miller	67ddc87f16	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/wireless/ath/ath9k/recv.c drivers/net/wireless/mwifiex/pcie.c net/ipv6/sit.c The SIT driver conflict consists of a bug fix being done by hand in 'net' (missing u64_stats_init()) whilst in 'net-next' a helper was created (netdev_alloc_pcpu_stats()) which takes care of this. The two wireless conflicts were overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>	2014-03-05 20:32:02 -05:00
Hannes Frederic Sowa	1b34657635	ipv4: yet another new IP_MTU_DISCOVER option IP_PMTUDISC_OMIT IP_PMTUDISC_INTERFACE has a design error: because it does not allow the generation of fragments if the interface mtu is exceeded, it is very hard to make use of this option in already deployed name server software for which I introduced this option. This patch adds yet another new IP_MTU_DISCOVER option to not honor any path mtu information and not accepting new icmp notifications destined for the socket this option is enabled on. But we allow outgoing fragmentation in case the packet size exceeds the outgoing interface mtu. As such this new option can be used as a drop-in replacement for IP_PMTUDISC_DONT, which is currently in use by most name server software making the adoption of this option very smooth and easy. The original advantage of IP_PMTUDISC_INTERFACE is still maintained: ignoring incoming path MTU updates and not honoring discovered path MTUs in the output path. Fixes: `482fc6094a` ("ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE") Cc: Florian Weimer <fweimer@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-02-26 15:51:00 -05:00
Hannes Frederic Sowa	69647ce46a	ipv4: use ip_skb_dst_mtu to determine mtu in ip_fragment ip_skb_dst_mtu mostly falls back to ip_dst_mtu_maybe_forward if no socket is attached to the skb (in case of forwarding) or determines the mtu like we do in ip_finish_output, which actually checks if we should branch to ip_fragment. Thus use the same function to determine the mtu here, too. This is important for the introduction of IP_PMTUDISC_OMIT, where we want the packets getting cut in pieces of the size of the outgoing interface mtu. IPv6 already does this correctly. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-02-26 15:51:00 -05:00
Florian Westphal	478b360a47	netfilter: nf_tables: fix nf_trace always-on with XT_TRACE=n When using nftables with CONFIG_NETFILTER_XT_TARGET_TRACE=n, we get lots of "TRACE: filter:output:policy:1 IN=..." warnings as several places will leave skb->nf_trace uninitialised. Unlike iptables tracing functionality is not conditional in nftables, so always copy/zero nf_trace setting when nftables is enabled. Move this into __nf_copy() helper. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2014-02-17 11:20:12 +01:00
WANG Cong	72c1d3bdd5	ipv4: register igmp_notifier even when !CONFIG_PROC_FS We still need this notifier even when we don't config PROC_FS. It should be rare to have a kernel without PROC_FS, so just for completeness. Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: David S. Miller <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-01-14 15:03:33 -08:00
Hannes Frederic Sowa	f87c10a8aa	ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against pmtu spoofing While forwarding we should not use the protocol path mtu to calculate the mtu for a forwarded packet but instead use the interface mtu. We mark forwarded skbs in ip_forward with IPSKB_FORWARDED, which was introduced for multicast forwarding. But as it does not conflict with our usage in unicast code path it is perfect for reuse. I moved the functions ip_sk_accept_pmtu, ip_sk_use_pmtu and ip_skb_dst_mtu along with the new ip_dst_mtu_maybe_forward to net/ip.h to fix circular dependencies because of IPSKB_FORWARDED. Because someone might have written a software which does probe destinations manually and expects the kernel to honour those path mtus I introduced a new per-namespace "ip_forward_use_pmtu" knob so someone can disable this new behaviour. We also still use mtus which are locked on a route for forwarding. The reason for this change is, that path mtus information can be injected into the kernel via e.g. icmp_err protocol handler without verification of local sockets. As such, this could cause the IPv4 forwarding path to wrongfully emit fragmentation needed notifications or start to fragment packets along a path. Tunnel and ipsec output paths clear IPCB again, thus IPSKB_FORWARDED won't be set and further fragmentation logic will use the path mtu to determine the fragmentation size. They also recheck packet size with help of path mtu discovery and report appropriate errors. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Cc: John Heffner <johnwheffner@gmail.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-01-13 11:22:54 -08:00
Hannes Frederic Sowa	61e7f09d0f	ipv4: consistent reporting of pmtu data in case of corking We report different pmtu values back on the first write and on further writes on an corked socket. Also don't include the dst.header_len (respectively exthdrlen) as this should already be dealt with by the interface mtu of the outgoing (virtual) interface and policy of that interface should dictate if fragmentation should happen. Instead reduce the pmtu data by IP options as we do for IPv6. Make the same changes for ip_append_data, where we did not care about options or dst.header_len at all. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-12-22 18:52:09 -05:00
Hannes Frederic Sowa	482fc6094a	ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE Sockets marked with IP_PMTUDISC_INTERFACE won't do path mtu discovery, their sockets won't accept and install new path mtu information and they will always use the interface mtu for outgoing packets. It is guaranteed that the packet is not fragmented locally. But we won't set the DF-Flag on the outgoing frames. Florian Weimer had the idea to use this flag to ensure DNS servers are never generating outgoing fragments. They may well be fragmented on the path, but the server never stores or usees path mtu values, which could well be forged in an attack. (The root of the problem with path MTU discovery is that there is no reliable way to authenticate ICMP Fragmentation Needed But DF Set messages because they are sent from intermediate routers with their source addresses, and the IMCP payload will not always contain sufficient information to identify a flow.) Recent research in the DNS community showed that it is possible to implement an attack where DNS cache poisoning is feasible by spoofing fragments. This work was done by Amir Herzberg and Haya Shulman: <https://sites.google.com/site/hayashulman/files/fragmentation-poisoning.pdf> This issue was previously discussed among the DNS community, e.g. <http://www.ietf.org/mail-archive/web/dnsext/current/msg01204.html>, without leading to fixes. This patch depends on the patch "ipv4: fix DO and PROBE pmtu mode regarding local fragmentation with UFO/CORK" for the enforcement of the non-fragmentable checks. If other users than ip_append_page/data should use this semantic too, we have to add a new flag to IPCB(skb)->flags to suppress local fragmentation and check for this in ip_finish_output. Many thanks to Florian Weimer for the idea and feedback while implementing this patch. Cc: David S. Miller <davem@davemloft.net> Suggested-by: Florian Weimer <fweimer@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-11-05 21:52:27 -05:00
Hannes Frederic Sowa	daba287b29	ipv4: fix DO and PROBE pmtu mode regarding local fragmentation with UFO/CORK UFO as well as UDP_CORK do not respect IP_PMTUDISC_DO and IP_PMTUDISC_PROBE well enough. UFO enabled packet delivery just appends all frags to the cork and hands it over to the network card. So we just deliver non-DF udp fragments (DF-flag may get overwritten by hardware or virtual UFO enabled interface). UDP_CORK does enqueue the data until the cork is disengaged. At this point it sets the correct IP_DF and local_df flags and hands it over to ip_fragment which in this case will generate an icmp error which gets appended to the error socket queue. This is not reflected in the syscall error (of course, if UFO is enabled this also won't happen). Improve this by checking the pmtudisc flags before appending data to the socket and if we still can fit all data in one packet when IP_PMTUDISC_DO or IP_PMTUDISC_PROBE is set, only then proceed. We use (mtu-fragheaderlen) to check for the maximum length because we ensure not to generate a fragment and non-fragmented data does not need to have its length aligned on 64 bit boundaries. Also the passed in ip_options are already aligned correctly. Maybe, we can relax some other checks around ip_fragment. This needs more research. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-10-29 00:15:22 -04:00
David S. Miller	c3fa32b976	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/usb/qmi_wwan.c include/net/dst.h Trivial merge conflicts, both were overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-10-23 16:49:34 -04:00
Jiri Pirko	e93b7d748b	ip_output: do skb ufo init for peeked non ufo skb as well Now, if user application does: sendto len<mtu flag MSG_MORE sendto len>mtu flag 0 The skb is not treated as fragmented one because it is not initialized that way. So move the initialization to fix this. introduced by: commit `e89e9cf539` "[IPv4/IPv6]: UFO Scatter-gather approach" Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-10-19 19:20:52 -04:00
Francesco Fusco	aa66158145	ipv4: processing ancillary IP_TOS or IP_TTL If IP_TOS or IP_TTL are specified as ancillary data, then sendmsg() sends out packets with the specified TTL or TOS overriding the socket values specified with the traditional setsockopt(). The struct inet_cork stores the values of TOS, TTL and priority that are passed through the struct ipcm_cookie. If there are user-specified TOS (tos != -1) or TTL (ttl != 0) in the struct ipcm_cookie, these values are used to override the per-socket values. In case of TOS also the priority is changed accordingly. Two helper functions get_rttos and get_rtconn_flags are defined to take into account the presence of a user specified TOS value when computing RT_TOS and RT_CONN_FLAGS. Signed-off-by: Francesco Fusco <ffusco@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-09-28 15:21:52 -07:00
Ansis Atteka	703133de33	ip: generate unique IP identificator if local fragmentation is allowed If local fragmentation is allowed, then ip_select_ident() and ip_select_ident_more() need to generate unique IDs to ensure correct defragmentation on the peer. For example, if IPsec (tunnel mode) has to encrypt large skbs that have local_df bit set, then all IP fragments that belonged to different ESP datagrams would have used the same identificator. If one of these IP fragments would get lost or reordered, then peer could possibly stitch together wrong IP fragments that did not belong to the same datagram. This would lead to a packet loss or data corruption. Signed-off-by: Ansis Atteka <aatteka@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-09-19 14:11:15 -04:00
Ansis Atteka	749154aa56	ip: use ip_hdr() in __ip_make_skb() to retrieve IP header skb->data already points to IP header, but for the sake of consistency we can also use ip_hdr() to retrieve it. Signed-off-by: Ansis Atteka <aatteka@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-09-19 14:11:15 -04:00
Hannes Frederic Sowa	0ea9d5e3e0	xfrm: introduce helper for safe determination of mtu skb->sk socket can be of AF_INET or AF_INET6 address family. Thus we always have to make sure we a referring to the correct interpretation of skb->sk. We only depend on header defines to query the mtu, so we don't introduce a new dependency to ipv6 by this change. Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2013-08-14 13:09:07 +02:00
Denis Efremov	2fbd967973	ipv4: ip_output: remove inline marking of EXPORT_SYMBOL functions EXPORT_SYMBOL and inline directives are contradictory to each other. The patch fixes this inconsistency. Found by Linux Driver Verification project (linuxtesting.org). Signed-off-by: Denis Efremov <yefremov.denis@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-11 16:12:44 -07:00
Gao feng	f016588861	netfilter: use IS_ENABLE to replace if defined in TRACE target Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2013-04-02 01:03:45 +02:00
Pravin B Shelar	c9af6db4c1	net: Fix possible wrong checksum generation. Patch `cef401de7b` (net: fix possible wrong checksum generation) fixed wrong checksum calculation but it broke TSO by defining new GSO type but not a netdev feature for that type. net_gso_ok() would not allow hardware checksum/segmentation offload of such packets without the feature. Following patch fixes TSO and wrong checksum. This patch uses same logic that Eric Dumazet used. Patch introduces new flag SKBTX_SHARED_FRAG if at least one frag can be modified by the user. but SKBTX_SHARED_FRAG flag is kept in skb shared info tx_flags rather than gso_type. tx_flags is better compared to gso_type since we can have skb with shared frag without gso packet. It does not link SHARED_FRAG to GSO, So there is no need to define netdev feature for this. Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-13 13:30:10 -05:00
Alexander Duyck	fc70fb640b	net: Handle encapsulated offloads before fragmentation or handing to lower dev This change allows the VXLAN to enable Tx checksum offloading even on devices that do not support encapsulated checksum offloads. The advantage to this is that it allows for the lower device to change due to routing table changes without impacting features on the VXLAN itself. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-12-09 00:20:28 -05:00
Julian Anastasov	155e8336c3	ipv4: introduce rt_uses_gateway Add new flag to remember when route is via gateway. We will use it to allow rt_gateway to contain address of directly connected host for the cases when DST_NOCACHE is used or when the NH exception caches per-destination route without DST_NOCACHE flag, i.e. when routes are not used for other destinations. By this way we force the neighbour resolving to work with the routed destination but we can use different address in the packet, feature needed for IPVS-DR where original packet for virtual IP is routed via route to real IP. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-10-08 17:42:36 -04:00
Eric Dumazet	5640f76858	net: use a per task frag allocator We currently use a per socket order-0 page cache for tcp_sendmsg() operations. This page is used to build fragments for skbs. Its done to increase probability of coalescing small write() into single segments in skbs still in write queue (not yet sent) But it wastes a lot of memory for applications handling many mostly idle sockets, since each socket holds one page in sk->sk_sndmsg_page Its also quite inefficient to build TSO 64KB packets, because we need about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit page allocator more than wanted. This patch adds a per task frag allocator and uses bigger pages, if available. An automatic fallback is done in case of memory pressure. (up to 32768 bytes per frag, thats order-3 pages on x86) This increases TCP stream performance by 20% on loopback device, but also benefits on other network devices, since 8x less frags are mapped on transmit and unmapped on tx completion. Alexander Duyck mentioned a probable performance win on systems with IOMMU enabled. Its possible some SG enabled hardware cant cope with bigger fragments, but their ndo_start_xmit() should already handle this, splitting a fragment in sub fragments, since some arches have PAGE_SIZE=65536 Successfully tested on various ethernet devices. (ixgbe, igb, bnx2x, tg3, mellanox mlx4) Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Vijay Subramanian <subramanian.vijay@gmail.com> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-09-24 16:31:37 -04:00
Patrick McHardy	5f2d04f1f9	ipv4: fix path MTU discovery with connection tracking IPv4 conntrack defragments incoming packet at the PRE_ROUTING hook and (in case of forwarded packets) refragments them at POST_ROUTING independent of the IP_DF flag. Refragmentation uses the dst_mtu() of the local route without caring about the original fragment sizes, thereby breaking PMTUD. This patch fixes this by keeping track of the largest received fragment with IP_DF set and generates an ICMP fragmentation required error during refragmentation if that size exceeds the MTU. Signed-off-by: Patrick McHardy <kaber@trash.net> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: David S. Miller <davem@davemloft.net>	2012-08-26 19:13:55 +02:00
Eric Dumazet	a9915a1b52	ipv4: fix ip header ident selection in __ip_make_skb() Christian Casteyde reported a kmemcheck 32-bit read from uninitialized memory in __ip_select_ident(). It turns out that __ip_make_skb() called ip_select_ident() before properly initializing iph->daddr. This is a bug uncovered by commit `1d861aa4b3` (inet: Minimize use of cached route inetpeer.) Addresses https://bugzilla.kernel.org/show_bug.cgi?id=46131 Reported-by: Christian Casteyde <casteyde.christian@free.fr> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-21 14:51:06 -07:00
Eric Dumazet	b5ec8eeac4	ipv4: fix ip_send_skb() ip_send_skb() can send orphaned skb, so we must pass the net pointer to avoid possible NULL dereference in error path. Bug added by commit `3a7c384ffd` (ipv4: tcp: unicast_sock should not land outside of TCP stack) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-10 14:08:57 -07:00
Eric Dumazet	3a7c384ffd	ipv4: tcp: unicast_sock should not land outside of TCP stack commit `be9f4a44e7` (ipv4: tcp: remove per net tcp_sock) added a selinux regression, reported and bisected by John Stultz selinux_ip_postroute_compat() expect to find a valid sk->sk_security pointer, but this field is NULL for unicast_sock It turns out that unicast_sock are really temporary stuff to be able to reuse part of IP stack (ip_append_data()/ip_push_pending_frames()) Fact is that frames sent by ip_send_unicast_reply() should be orphaned to not fool LSM. Note IPv6 never had this problem, as tcp_v6_send_response() doesnt use a fake socket at all. I'll probably implement tcp_v4_send_response() to remove these unicast_sock in linux-3.7 Reported-by: John Stultz <johnstul@us.ibm.com> Bisected-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Paul Moore <paul@paul-moore.com> Cc: Eric Paris <eparis@parisplace.org> Cc: "Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-09 20:56:08 -07:00
Vasiliy Kulikov	9871f1ad67	ip: fix error handling in ip_finish_output2() __neigh_create() returns either a pointer to struct neighbour or PTR_ERR(). But the caller expects it to return either a pointer or NULL. Replace the NULL check with IS_ERR() check. The bug was introduced in `a263b30936` ("ipv4: Make neigh lookups directly in output packet path."). Signed-off-by: Vasily Kulikov <segoon@openwall.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-08-06 13:30:01 -07:00
David S. Miller	5e9965c15b	Merge branch 'kill_rtcache' The ipv4 routing cache is non-deterministic, performance wise, and is subject to reasonably easy to launch denial of service attacks. The routing cache works great for well behaved traffic, and the world was a much friendlier place when the tradeoffs that led to the routing cache's design were considered. What it boils down to is that the performance of the routing cache is a product of the traffic patterns seen by a system rather than being a product of the contents of the routing tables. The former of which is controllable by external entitites. Even for "well behaved" legitimate traffic, high volume sites can see hit rates in the routing cache of only ~%10. The general flow of this patch series is that first the routing cache is removed. We build a completely new rtable entry every lookup request. Next we make some simplifications due to the fact that removing the routing cache causes several members of struct rtable to become no longer necessary. Then we need to make some amends such that we can legally cache pre-constructed routes in the FIB nexthops. Firstly, we need to invalidate routes which are hit with nexthop exceptions. Secondly we have to change the semantics of rt->rt_gateway such that zero means that the destination is on-link and non-zero otherwise. Now that the preparations are ready, we start caching precomputed routes in the FIB nexthops. Output and input routes need different kinds of care when determining if we can legally do such caching or not. The details are in the commit log messages for those changes. The patch series then winds down with some more struct rtable simplifications and other tidy ups that remove unnecessary overhead. On a SPARC-T3 output route lookups are ~876 cycles. Input route lookups are ~1169 cycles with rpfilter disabled, and about ~1468 cycles with rpfilter enabled. These measurements were taken with the kbench_mod test module in the net_test_tools GIT tree: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net_test_tools.git That GIT tree also includes a udpflood tester tool and stresses route lookups on packet output. For example, on the same SPARC-T3 system we can run: time ./udpflood -l 10000000 10.2.2.11 with routing cache: real 1m21.955s user 0m6.530s sys 1m15.390s without routing cache: real 1m31.678s user 0m6.520s sys 1m25.140s Performance undoubtedly can easily be improved further. For example fib_table_lookup() performs a lot of excessive computations with all the masking and shifting, some of it conditionalized to deal with edge cases. Also, Eric's no-ref optimization for input route lookups can be re-instated for the FIB nexthop caching code path. I would be really pleased if someone would work on that. In fact anyone suitable motivated can just fire up perf on the loading of the test net_test_tools benchmark kernel module. I spend much of my time going: bash# perf record insmod ./kbench_mod.ko dst=172.30.42.22 src=74.128.0.1 iif=2 bash# perf report Thanks to helpful feedback from Joe Perches, Eric Dumazet, Ben Hutchings, and others. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-22 17:04:15 -07:00
Eric Dumazet	0980e56e50	ipv4: tcp: set unicast_sock uc_ttl to -1 Set unicast_sock uc_ttl to -1 so that we select the right ttl, instead of sending packets with a 0 ttl. Bug added in commit `be9f4a44e7` (ipv4: tcp: remove per net tcp_sock) Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-22 12:06:21 -07:00
David S. Miller	f8126f1d51	ipv4: Adjust semantics of rt->rt_gateway. In order to allow prefixed routes, we have to adjust how rt_gateway is set and interpreted. The new interpretation is: 1) rt_gateway == 0, destination is on-link, nexthop is iph->daddr 2) rt_gateway != 0, destination requires a nexthop gateway Abstract the fetching of the proper nexthop value using a new inline helper, rt_nexthop(), as suggested by Joe Perches. Signed-off-by: David S. Miller <davem@davemloft.net> Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>	2012-07-20 13:31:20 -07:00
Eric Dumazet	be9f4a44e7	ipv4: tcp: remove per net tcp_sock tcp_v4_send_reset() and tcp_v4_send_ack() use a single socket per network namespace. This leads to bad behavior on multiqueue NICS, because many cpus contend for the socket lock and once socket lock is acquired, extra false sharing on various socket fields slow down the operations. To better resist to attacks, we use a percpu socket. Each cpu can run without contention, using appropriate memory (local node) Additional features : 1) We also mirror the queue_mapping of the incoming skb, so that answers use the same queue if possible. 2) Setting SOCK_USE_WRITE_QUEUE socket flag speedup sock_wfree() 3) We now limit the number of in-flight RST/ACK [1] packets per cpu, instead of per namespace, and we honor the sysctl_wmem_default limit dynamically. (Prior to this patch, sysctl_wmem_default value was copied at boot time, so any further change would not affect tcp_sock limit) [1] These packets are only generated when no socket was matched for the incoming packet. Reported-by: Bill Sommerfeld <wsommerfeld@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-19 10:35:30 -07:00
David S. Miller	5110effee8	net: Do delayed neigh confirmation. When a dst_confirm() happens, mark the confirmation as pending in the dst. Then on the next packet out, when we have the neigh in-hand, do the update. This removes the dependency in dst_confirm() of dst's having an attached neigh. While we're here, remove the explicit 'dst' NULL check, all except 2 or 3 call sites ensure it's not NULL. So just fix those cases up. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-05 01:03:06 -07:00
David S. Miller	a263b30936	ipv4: Make neigh lookups directly in output packet path. Do not use the dst cached neigh, we'll be getting rid of that. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-07-05 01:02:12 -07:00
David S. Miller	70e7341673	ipv4: Show that ip_send_reply() is purely unicast routine. Rename it to ip_send_unicast_reply() and add explicit 'saddr' argument. This removed one of the few users of rt->rt_spec_dst. Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-28 03:21:41 -07:00
Michel Machado	95603e2293	net-next: add dev_loopback_xmit() to avoid duplicate code Add dev_loopback_xmit() in order to deduplicate functions ip_dev_loopback_xmit() (in net/ipv4/ip_output.c) and ip6_dev_loopback_xmit() (in net/ipv6/ip6_output.c). I was about to reinvent the wheel when I noticed that ip_dev_loopback_xmit() and ip6_dev_loopback_xmit() do exactly what I need and are not IP-only functions, but they were not available to reuse elsewhere. ip6_dev_loopback_xmit() does not have line "skb_dst_force(skb);", but I understand that this is harmless, and should be in dev_loopback_xmit(). Signed-off-by: Michel Machado <michel@digirati.com.br> CC: "David S. Miller" <davem@davemloft.net> CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> CC: James Morris <jmorris@namei.org> CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> CC: Patrick McHardy <kaber@trash.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jpirko@redhat.com> CC: "Michał Mirosław" <mirq-linux@rere.qmqm.pl> CC: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-12 18:51:09 -07:00
Eric Dumazet	5d0ba55b64	net: use consume_skb() in place of kfree_skb() Remove some dropwatch/drop_monitor false positives. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-06-04 11:27:40 -04:00
Joe Perches	e87cc4728f	net: Convert net_ratelimit uses to net_<level>_ratelimited Standardize the net core ratelimited logging functions. Coalesce formats, align arguments. Change a printk then vprintk sequence to use printf extension %pV. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-05-15 13:45:03 -04:00
David Howells	9ffc93f203	Remove all #inclusions of asm/system.h Remove all #inclusions of asm/system.h preparatory to splitting and killing it. Performed with the following command: perl -p -i -e 's!^#\sinclude\s<asm/system[.]h>.\n!!' `grep -Irl '^#\sinclude\s<asm/system[.]h>' ` Signed-off-by: David Howells <dhowells@redhat.com>	2012-03-28 18:30:03 +01:00
David Miller	2721745501	net: Rename dst_get_neighbour{, _raw} to dst_get_neighbour_noref{, _raw}. To reflect the fact that a refrence is not obtained to the resulting neighbour entry. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Roland Dreier <roland@purestorage.com>	2011-12-05 15:20:19 -05:00
Eric Dumazet	84f9307c5d	ipv4: use a 64bit load/store in output path gcc compiler is smart enough to use a single load/store if we memcpy(dptr, sptr, 8) on x86_64, regardless of CONFIG_CC_OPTIMIZE_FOR_SIZE In IP header, daddr immediately follows saddr, this wont change in the future. We only need to make sure our flowi4 (saddr,daddr) fields wont break the rule. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-12-01 13:28:54 -05:00
Eric Dumazet	66b13d99d9	ipv4: tcp: fix TOS value in ACK messages sent from TIME_WAIT There is a long standing bug in linux tcp stack, about ACK messages sent on behalf of TIME_WAIT sockets. In the IP header of the ACK message, we choose to reflect TOS field of incoming message, and this might break some setups. Example of things that were broken : - Routing using TOS as a selector - Firewalls - Trafic classification / shaping We now remember in timewait structure the inet tos field and use it in ACK generation, and route lookup. Notes : - We still reflect incoming TOS in RST messages. - We could extend MuraliRaja Muniraju patch to report TOS value in netlink messages for TIME_WAIT sockets. - A patch is needed for IPv6 Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-24 03:06:21 -04:00
Eric Dumazet	9e903e0852	net: add skb frag size accessors To ease skb->truesize sanitization, its better to be able to localize all references to skb frags size. Define accessors : skb_frag_size() to fetch frag size, and skb_frag_size_{set\|add\|sub}() to manipulate it. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-10-19 03:10:46 -04:00
Ian Campbell	aff65da0f1	net: ipv4: convert to SKB frag APIs Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-24 17:52:11 -07:00
Julian Anastasov	d52fbfc9e5	ipv4: use dst with ref during bcast/mcast loopback Make sure skb dst has reference when moving to another context. Currently, I don't see protocols that can hit it when sending broadcasts/multicasts to loopback using noref dsts, so it is just a precaution. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-07 22:52:32 -07:00
Eric Dumazet	f2c31e32b3	net: fix NULL dereferences in check_peer_redir() Gergely Kalman reported crashes in check_peer_redir(). It appears commit `f39925dbde` (ipv4: Cache learned redirect information in inetpeer.) added a race, leading to possible NULL ptr dereference. Since we can now change dst neighbour, we should make sure a reader can safely use a neighbour. Add RCU protection to dst neighbour, and make sure check_peer_redir() can be called safely by different cpus in parallel. As neighbours are already freed after one RCU grace period, this patch should not add typical RCU penalty (cache cold effects) Many thanks to Gergely for providing a pretty report pointing to the bug. Reported-by: Gergely Kalman <synapse@hippy.csoma.elte.hu> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-08-03 03:34:12 -07:00
Bill Sommerfeld	d9be4f7a6f	ipv4: Constrain UFO fragment sizes to multiples of 8 bytes Because the ip fragment offset field counts 8-byte chunks, ip fragments other than the last must contain a multiple of 8 bytes of payload. ip_ufo_append_data wasn't respecting this constraint and, depending on the MTU and ip option sizes, could create malformed non-final fragments. Google-Bug-Id: 5009328 Signed-off-by: Bill Sommerfeld <wsommerfeld@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-21 21:31:41 -07:00
David S. Miller	69cce1d140	net: Abstract dst->neighbour accesses behind helpers. dst_{get,set}_neighbour() Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-17 23:11:35 -07:00
David S. Miller	05e3aa0949	net: Create and use new helper, neigh_output(). Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-16 17:26:00 -07:00
David S. Miller	fec8292d9c	ipv4: Use calculated 'neigh' instead of re-evaluating dst->neighbour Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-16 14:25:54 -07:00
David S. Miller	f6b72b6217	net: Embed hh_cache inside of struct neighbour. Now that there is a one-to-one correspondance between neighbour and hh_cache entries, we no longer need: 1) dynamic allocation 2) attachment to dst->hh 3) refcounting Initialization of the hh_cache entry is indicated by hh_len being non-zero, and such initialization is always done with the neighbour's lock held as a writer. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-14 07:53:20 -07:00
David S. Miller	e12fe68ce3	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6	2011-07-05 23:23:37 -07:00
Steffen Klassert	c146066ab8	ipv4: Don't use ufo handling on later transformed packets We might call ip_ufo_append_data() for packets that will be IPsec transformed later. This function should be used just for real udp packets. So we check for rt->dst.header_len which is only nonzero on IPsec handling and call ip_ufo_append_data() just if rt->dst.header_len is zero. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-07-01 17:33:19 -07:00
Steffen Klassert	353e5c9abd	ipv4: Fix IPsec slowpath fragmentation problem ip_append_data() builds packets based on the mtu from dst_mtu(rt->dst.path). On IPsec the effective mtu is lower because we need to add the protocol headers and trailers later when we do the IPsec transformations. So after the IPsec transformations the packet might be too big, which leads to a slowpath fragmentation then. This patch fixes this by building the packets based on the lower IPsec mtu from dst_mtu(&rt->dst) and adapts the exthdr handling to this. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-06-27 20:34:26 -07:00
Steffen Klassert	33f99dc7fd	ipv4: Fix packet size calculation in __ip_append_data Git commit `59104f06` (ip: take care of last fragment in ip_append_data) added a check to see if we exceed the mtu when we add trailer_len. However, the mtu is already subtracted by the trailer length when the xfrm transfomation bundles are set up. So IPsec packets with mtu size get fragmented, or if the DF bit is set the packets will not be send even though they match the mtu perfectly fine. This patch actually reverts commit `59104f06`. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-06-27 20:34:25 -07:00
Paul Gortmaker	56f8a75c17	ip: introduce ip_is_fragment helper inline function There are enough instances of this: iph->frag_off & htons(IP_MF \| IP_OFFSET) that a helper function is probably warranted. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-06-21 20:33:34 -07:00
Steffen Klassert	96d7303e9c	ipv4: Fix packet size calculation for raw IPsec packets in __ip_append_data We assume that transhdrlen is positive on the first fragment which is wrong for raw packets. So we don't add exthdrlen to the packet size for raw packets. This leads to a reallocation on IPsec because we have not enough headroom on the skb to place the IPsec headers. This patch fixes this by adding exthdrlen to the packet size whenever the send queue of the socket is empty. This issue was introduced with git commit `1470ddf7` (inet: Remove explicit write references to sk/inet in ip_append_data) Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-06-09 14:49:59 -07:00
David S. Miller	22f728f8f3	ipv4: Always call ip_options_build() after rest of IP header is filled in. This will allow ip_options_build() to reliably look at the values of iph->{daddr,saddr} Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-13 17:21:27 -04:00
David S. Miller	0a5ebb8000	ipv4: Pass explicit daddr arg to ip_send_reply(). This eliminates an access to rt->rt_src. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-10 13:32:46 -07:00
David S. Miller	f5fca60865	ipv4: Pass flow key down into ip_append_*(). This way rt->rt_dst accesses are unnecessary. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-08 21:24:07 -07:00
David S. Miller	77968b7824	ipv4: Pass flow keys down into datagram packet building engine. This way ip_output.c no longer needs rt->rt_{src,dst}. We already have these keys sitting, ready and waiting, on the stack or in a socket structure. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-08 21:24:06 -07:00
David S. Miller	ea4fc0d619	ipv4: Don't use rt->rt_{src,dst} in ip_queue_xmit(). Now we can pick it out of the provided flow key. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-08 15:28:28 -07:00
David S. Miller	d9d8da805d	inet: Pass flowi to ->queue_xmit(). This allows us to acquire the exact route keying information from the protocol, however that might be managed. It handles all of the possibilities, from the simplest case of storing the key in inet->cork.fl to the more complex setup SCTP has where individual transports determine the flow. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-08 15:28:28 -07:00
David S. Miller	b57ae01a8a	ipv4: Use cork flow in ip_queue_xmit() All invokers of ip_queue_xmit() must make certain that the socket is locked. All of SCTP, TCP, DCCP, and L2TP now make sure this is the case. Therefore we can use the cork flow during output route lookup in ip_queue_xmit() when the socket route check fails. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-08 14:05:14 -07:00
David S. Miller	706527280e	ipv4: Initialize cork->opt using NULL not 0. Noticed by Joe Perches. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-06 16:01:15 -07:00
David S. Miller	b80d72261a	ipv4: Initialize on-stack cork more efficiently. ip_setup_cork() explicitly initializes every member of inet_cork except flags, addr, and opt. So we can simply set those three members to zero instead of using a memset() via an empty struct assignment. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>	2011-05-06 15:37:57 -07:00
David S. Miller	bdc712b4c2	inet: Decrease overhead of on-stack inet_cork. When we fast path datagram sends to avoid locking by putting the inet_cork on the stack we use up lots of space that isn't necessary. This is because inet_cork contains a "struct flowi" which isn't used in these code paths. Split inet_cork to two parts, "inet_cork" and "inet_cork_full". Only the latter of which has the "struct flowi" and is what is stored in inet_sock. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>	2011-05-06 15:37:57 -07:00
David S. Miller	dd927a2694	ipv4: In ip_build_and_send_pkt() use 'saddr' and 'daddr' args passed in. Instead of rt->rt_{dst,src} The only tricky part is source route option handling. If the source route option is enabled we can't just use plain 'daddr', we have to use opt->opt.faddr. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-04 12:03:30 -07:00
David S. Miller	31e4543db2	ipv4: Make caller provide on-stack flow key to ip_route_output_ports(). Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-03 20:25:42 -07:00
Eric Dumazet	f6d8bd051c	inet: add RCU protection to inet->opt We lack proper synchronization to manipulate inet->opt ip_options Problem is ip_make_skb() calls ip_setup_cork() and ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options), without any protection against another thread manipulating inet->opt. Another thread can change inet->opt pointer and free old one under us. Use RCU to protect inet->opt (changed to inet->inet_opt). Instead of handling atomic refcounts, just copy ip_options when necessary, to avoid cache line dirtying. We cant insert an rcu_head in struct ip_options since its included in skb->cb[], so this patch is large because I had to introduce a new ip_options_rcu structure. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-28 13:16:35 -07:00
David S. Miller	1c01a80cfe	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/smsc911x.c	2011-04-11 13:44:25 -07:00
Lucas De Marchi	25985edced	Fix common misspellings Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>	2011-03-31 11:26:23 -03:00
David S. Miller	538de0e01f	ipv4: Use flowi4_init_output() in ip_send_reply() Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-31 04:53:37 -07:00
David S. Miller	9cce96df5b	net: Put fl4_* macros to struct flowi4 and use them again. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-12 15:08:54 -08:00
David S. Miller	9d6ec93801	ipv4: Use flowi4 in public route lookup interfaces. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-12 15:08:48 -08:00
David S. Miller	6281dcc94a	net: Make flowi ports AF dependent. Create two sets of port member accessors, one set prefixed by fl4_* and the other prefixed by fl6_* This will let us to create AF optimal flow instances. It will work because every context in which we access the ports, we have to be fully aware of which AF the flowi is anyways. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-12 15:08:46 -08:00
David S. Miller	1d28f42c1b	net: Put flowi_* prefix on AF independent members of struct flowi I intend to turn struct flowi into a union of AF specific flowi structs. There will be a common structure that each variant includes first, much like struct sock_common. This is the first step to move in that direction. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-12 15:08:44 -08:00
David S. Miller	78fbfd8a65	ipv4: Create and use route lookup helpers. The idea here is this minimizes the number of places one has to edit in order to make changes to how flows are defined and used. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-12 15:08:42 -08:00
David S. Miller	b23dd4fe42	ipv4: Make output route lookup return rtable directly. Instead of on the stack. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-02 14:31:35 -08:00
Herbert Xu	07df5294a7	inet: Replace left-over references to inet->cork The patch to replace inet->cork with cork left out two spots in __ip_append_data that can result in bogus packet construction. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-01 23:00:58 -08:00
David S. Miller	273447b352	ipv4: Kill can_sleep arg to ip_route_output_flow() This boolean state is now available in the flow flags. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-01 14:27:04 -08:00
David S. Miller	420d44daa7	ipv4: Make final arg to ip_route_output_flow to be boolean "can_sleep" Since that is what the current vague "flags" argument means. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-01 14:19:23 -08:00
Herbert Xu	1c32c5ad6f	inet: Add ip_make_skb and ip_finish_skb This patch adds the helper ip_make_skb which is like ip_append_data and ip_push_pending_frames all rolled into one, except that it does not send the skb produced. The sending part is carried out by ip_send_skb, which the transport protocol can call after it has tweaked the skb. It is meant to be called in cases where corking is not used should have a one-to-one correspondence to sendmsg. This patch also adds the helper ip_finish_skb which is meant to be replace ip_push_pending_frames when corking is required. Previously the protocol stack would peek at the socket write queue and add its header to the first packet. With ip_finish_skb, the protocol stack can directly operate on the final skb instead, just like the non-corking case with ip_make_skb. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-01 12:35:03 -08:00
Herbert Xu	1470ddf7f8	inet: Remove explicit write references to sk/inet in ip_append_data In order to allow simultaneous calls to ip_append_data on the same socket, it must not modify any shared state in sk or inet (other than those that are designed to allow that such as atomic counters). This patch abstracts out write references to sk and inet_sk in ip_append_data and its friends so that we may use the underlying code in parallel. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-01 12:35:02 -08:00
Herbert Xu	5a2ef92023	inet: Remove unused sk_sndmsg_* from UFO UFO doesn't really use the sk_sndmsg_* parameters so touching them is pointless. It can't use them anyway since the whole point of UFO is to use the original pages without copying. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-01 12:35:02 -08:00
David S. Miller	323e126f0c	ipv4: Don't pre-seed hoplimit metric. Always go through a new ip4_dst_hoplimit() helper, just like ipv6. This allowed several simplifications: 1) The interim dst_metric_hoplimit() can go as it's no longer userd. 2) The sysctl_ip_default_ttl entry no longer needs to use ipv4_doint_and_flush, since the sysctl is not cached in routing cache metrics any longer. 3) ipv4_doint_and_flush no longer needs to be exported and therefore can be marked static. When ipv4_doint_and_flush_strategy was removed some time ago, the external declaration in ip.h was mistakenly left around so kill that off too. We have to move the sysctl_ip_default_ttl declaration into ipv4's route cache definition header net/route.h, because currently net/ip.h (where the declaration lives now) has a back dependency on net/route.h Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-12 22:08:17 -08:00
David S. Miller	5170ae824d	net: Abstract RTAX_HOPLIMIT metric accesses behind helper. Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-12 21:35:57 -08:00
Changli Gao	5811662b15	net: use the macros defined for the members of flowi Use the macros defined for the members of flowi to clean the code up. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-17 12:27:45 -08:00
David S. Miller	e40051d134	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/qlcnic/qlcnic_init.c net/ipv4/ip_output.c	2010-09-27 01:03:03 -07:00
Eric Dumazet	59104f0624	ip: take care of last fragment in ip_append_data While investigating a bit, I found ip_fragment() slow path was taken because ip_append_data() provides following layout for a send(MTU + N*(MTU - 20)) syscall : - one skb with 1500 (mtu) bytes - N fragments of 1480 (mtu-20) bytes (before adding IP header) last fragment gets 17 bytes of trail data because of following bit: if (datalen == length + fraggap) alloclen += rt->dst.trailer_len; Then esp4 adds 16 bytes of data (while trailer_len is 17... hmm... another bug ?) In ip_fragment(), we notice last fragment is too big (1496 + 20) > mtu, so we take slow path, building another skb chain. In order to avoid taking slow path, we should correct ip_append_data() to make sure last fragment has real trail space, under mtu... Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-24 14:41:04 -07:00
Eric Dumazet	3d13008e73	ip: fix truesize mismatch in ip fragmentation Special care should be taken when slow path is hit in ip_fragment() : When walking through frags, we transfert truesize ownership from skb to frags. Then if we hit a slow_path condition, we must undo this or risk uncharging frags->truesize twice, and in the end, having negative socket sk_wmem_alloc counter, or even freeing socket sooner than expected. Many thanks to Nick Bowler, who provided a very clean bug report and test program. Thanks to Jarek for reviewing my first patch and providing a V2 While Nick bisection pointed to commit `2b85a34e91` (net: No more expensive sock_hold()/sock_put() on each tx), underlying bug is older (2.6.12-rc5) A side effect is to extend work done in commit `b2722b1c3a` (ip_fragment: also adjust skb->truesize for packets not owned by a socket) to ipv6 as well. Reported-and-bisected-by: Nick Bowler <nbowler@elliptictech.com> Tested-by: Nick Bowler <nbowler@elliptictech.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Jarek Poplawski <jarkao2@gmail.com> CC: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-21 15:05:50 -07:00
Eric Dumazet	ec550d246e	net: ip_append_data() optim Compiler is not smart enough to avoid a conditional branch. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-08-24 14:45:09 -07:00
David S. Miller	21dc330157	net: Rename skb_has_frags to skb_has_frag_list SKBs can be "fragmented" in two ways, via a page array (called skb_shinfo(skb)->frags[]) and via a list of SKBs (called skb_shinfo(skb)->frag_list). Since skb_has_frags() tests the latter, it's name is confusing since it sounds more like it's testing the former. Signed-off-by: David S. Miller <davem@davemloft.net>	2010-08-23 00:13:46 -07:00
Oliver Hartkopp	2244d07bfa	net: simplify flags for tx timestamping This patch removes the abstraction introduced by the union skb_shared_tx in the shared skb data. The access of the different union elements at several places led to some confusion about accessing the shared tx_flags e.g. in skb_orphan_try(). http://marc.info/?l=linux-netdev&m=128084897415886&w=2 Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-08-19 00:08:30 -07:00
Changli Gao	c893b8066c	ip_fragment: fix subtracting PPPOE_SES_HLEN from mtu twice `6c79bf0f24` subtracts PPPOE_SES_HLEN from mtu at the front of ip_fragment(). So the later subtraction should be removed. The MTU of 802.1q is also 1500, so MTU should not be changed. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Bart De Schuymer <bdschuym@pandora.bo> ---- net/ipv4/ip_output.c \| 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) Signed-off-by: Bart De Schuymer <bdschuym@pandora.bo> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-08-02 17:25:07 -07:00
Eric Dumazet	4bc2f18ba4	net/ipv4: EXPORT_SYMBOL cleanups CodingStyle cleanups EXPORT_SYMBOL should immediately follow the symbol declaration. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-07-12 12:57:54 -07:00
George Kadianakis	49085bd7d4	net/ipv4/ip_output.c: Removal of unused variable in ip_fragment() Removal of unused integer variable in ip_fragment(). Signed-off-by: George Kadianakis <desnacked@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-07-07 15:44:59 -07:00
Eric Dumazet	fe76cda308	ipv4: use skb_dst_copy() in ip_copy_metadata() Avoid touching dst refcount in ip_fragment(). Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-07-05 18:50:56 -07:00
David S. Miller	8244132ea8	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: net/ipv4/ip_output.c	2010-06-23 18:26:27 -07:00
Herbert Xu	26cde9f7e2	udp: Fix bogus UFO packet generation It has been reported that the new UFO software fallback path fails under certain conditions with NFS. I tracked the problem down to the generation of UFO packets that are smaller than the MTU. The software fallback path simply discards these packets. This patch fixes the problem by not generating such packets on the UFO path. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-06-21 13:57:34 -07:00
Changli Gao	d8d1f30b95	net-next: remove useless union keyword remove useless union keyword in rtable, rt6_info and dn_route. Since there is only one member in a union, the union keyword isn't useful. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-06-10 23:31:35 -07:00
Eric Dumazet	ab6e3feba1	net: No dst refcounting in ip_queue_xmit() TCP outgoing packets can avoid two atomic ops, and dirtying of previously higly contended cache line using new refdst infrastructure. Note 1: loopback device excluded because of !IFF_XMIT_DST_RELEASE Note 2: UDP packets dsts are built before ip_queue_xmit(). Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-05-17 17:18:52 -07:00
Patrick McHardy	1e4b105712	Merge branch 'master' of /repos/git/net-next-2.6 Conflicts: net/bridge/br_device.c net/bridge/br_forward.c Signed-off-by: Patrick McHardy <kaber@trash.net>	2010-05-10 18:39:28 +02:00
David S. Miller	87eb367003	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/wireless/iwlwifi/iwl-6000.c net/core/dev.c	2010-04-21 01:14:25 -07:00
Bart De Schuymer	6c79bf0f24	netfilter: bridge-netfilter: fix refragmenting IP traffic encapsulated in PPPoE traffic The MTU for IP traffic encapsulated inside PPPoE traffic is smaller than the MTU of the Ethernet device (1500). Connection tracking gathers all IP packets and sometimes will refragment them in ip_fragment(). We then need to subtract the length of the encapsulating header from the mtu used in ip_fragment(). The check in br_nf_dev_queue_xmit() which determines if ip_fragment() has to be called is also updated for the PPPoE-encapsulated packets. nf_bridge_copy_header() is also updated to make sure the PPPoE data length field has the correct value. Signed-off-by: Bart De Schuymer <bdschuym@pandora.be> Signed-off-by: Patrick McHardy <kaber@trash.net>	2010-04-20 16:22:01 +02:00
Patrick McHardy	6291055465	Merge branch 'master' of /repos/git/net-next-2.6 Conflicts: Documentation/feature-removal-schedule.txt net/ipv6/netfilter/ip6t_REJECT.c net/netfilter/xt_limit.c Signed-off-by: Patrick McHardy <kaber@trash.net>	2010-04-20 16:02:01 +02:00
Jan Engelhardt	cd58bcd978	netfilter: xt_TEE: have cloned packet travel through Xtables too Since Xtables is now reentrant/nestable, the cloned packet can also go through Xtables and be subject to rules itself. Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: Patrick McHardy <kaber@trash.net>	2010-04-19 16:06:52 +02:00
Jan Engelhardt	e281b19897	netfilter: xtables: inclusion of xt_TEE xt_TEE can be used to clone and reroute a packet. This can for example be used to copy traffic at a router for logging purposes to another dedicated machine. References: http://www.gossamer-threads.com/lists/iptables/devel/68781 Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: Patrick McHardy <kaber@trash.net>	2010-04-19 14:17:47 +02:00
Shan Wei	4e15ed4d93	net: replace ipfragok with skb->local_df As Herbert Xu said: we should be able to simply replace ipfragok with skb->local_df. commit f88037(sctp: Drop ipfargok in sctp_xmit function) has droped ipfragok and set local_df value properly. The patch kills the ipfragok parameter of .queue_xmit(). Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-04-15 23:36:37 -07:00
Eric Dumazet	e30b38c298	ip: Fix ip_dev_loopback_xmit() Eric Paris got following trace with a linux-next kernel [ 14.203970] BUG: using smp_processor_id() in preemptible [00000000] code: avahi-daemon/2093 [ 14.204025] caller is netif_rx+0xfa/0x110 [ 14.204035] Call Trace: [ 14.204064] [<ffffffff81278fe5>] debug_smp_processor_id+0x105/0x110 [ 14.204070] [<ffffffff8142163a>] netif_rx+0xfa/0x110 [ 14.204090] [<ffffffff8145b631>] ip_dev_loopback_xmit+0x71/0xa0 [ 14.204095] [<ffffffff8145b892>] ip_mc_output+0x192/0x2c0 [ 14.204099] [<ffffffff8145d610>] ip_local_out+0x20/0x30 [ 14.204105] [<ffffffff8145d8ad>] ip_push_pending_frames+0x28d/0x3d0 [ 14.204119] [<ffffffff8147f1cc>] udp_push_pending_frames+0x14c/0x400 [ 14.204125] [<ffffffff814803fc>] udp_sendmsg+0x39c/0x790 [ 14.204137] [<ffffffff814891d5>] inet_sendmsg+0x45/0x80 [ 14.204149] [<ffffffff8140af91>] sock_sendmsg+0xf1/0x110 [ 14.204189] [<ffffffff8140dc6c>] sys_sendmsg+0x20c/0x380 [ 14.204233] [<ffffffff8100ad82>] system_call_fastpath+0x16/0x1b While current linux-2.6 kernel doesnt emit this warning, bug is latent and might cause unexpected failures. ip_dev_loopback_xmit() runs in process context, preemption enabled, so must call netif_rx_ni() instead of netif_rx(), to make sure that we process pending software interrupt. Same change for ip6_dev_loopback_xmit() Reported-by: Eric Paris <eparis@redhat.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-04-15 14:25:22 -07:00
Tejun Heo	5a0e3ad6af	include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>	2010-03-30 22:02:32 +09:00
Jan Engelhardt	9bbc768aa9	netfilter: ipv4: use NFPROTO values for NF_HOOK invocation The semantic patch that was used: // <smpl> @@ @@ (NF_HOOK \|NF_HOOK_COND \|nf_hook )( -PF_INET, +NFPROTO_IPV4, ...) // </smpl> Signed-off-by: Jan Engelhardt <jengelh@medozas.de>	2010-03-25 16:00:30 +01:00
Octavian Purdila	7ad6848c7e	ip: fix mc_loop checks for tunnels with multicast outer addresses When we have L3 tunnels with different inner/outer families (i.e. IPV4/IPV6) which use a multicast address as the outer tunnel destination address, multicast packets will be loopbacked back to the sending socket even if IP*_MULTICAST_LOOP is set to disabled. The mc_loop flag is present in the family specific part of the socket (e.g. the IPv4 or IPv4 specific part). setsockopt sets the inner family mc_loop flag. When the packet is pushed through the L3 tunnel it will eventually be processed by the outer family which if different will check the flag in a different part of the socket then it was set. Signed-off-by: Octavian Purdila <opurdila@ixiacom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-01-06 20:37:01 -08:00
David S. Miller	ff9c38bba3	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: net/mac80211/ht.c	2009-12-01 22:13:38 -08:00
Patrick McHardy	b2722b1c3a	ip_fragment: also adjust skb->truesize for packets not owned by a socket When a large packet gets reassembled by ip_defrag(), the head skb accounts for all the fragments in skb->truesize. If this packet is refragmented again, skb->truesize is not re-adjusted to reflect only the head size since its not owned by a socket. If the head fragment then gets recycled and reused for another received fragment, it might exceed the defragmentation limits due to its large truesize value. skb_recycle_check() explicitly checks for linear skbs, so any recycled skb should reflect its true size in skb->truesize. Change ip_fragment() to also adjust the truesize value of skbs not owned by a socket. Reported-and-tested-by: Ben Menchaca <ben@bigfootnetworks.com> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-12-01 15:53:57 -08:00
Joe Perches	9d4fb27db9	net/ipv4: Move && and \|\| to end of previous line On Sun, 2009-11-22 at 16:31 -0800, David Miller wrote: > It should be of the form: > if (x && > y) > > or: > if (x && y) > > Fix patches, rather than complaints, for existing cases where things > do not follow this pattern are certainly welcome. Also collapsed some multiple tabs to single space. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-11-23 10:41:23 -08:00
Eric Dumazet	c720c7e838	inet: rename some inet_sock fields In order to have better cache layouts of struct sock (separate zones for rx/tx paths), we need this preliminary patch. Goal is to transfert fields used at lookup time in the first read-mostly cache line (inside struct sock_common) and move sk_refcnt to a separate cache line (only written by rx path) This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr, sport and id fields. This allows a future patch to define these fields as macros, like sk_refcnt, without name clashes. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-10-18 18:52:53 -07:00
Atis Elsts	914a9ab386	net: Use sk_mark for routing lookup in more places This patch against v2.6.31 adds support for route lookup using sk_mark in some more places. The benefits from this patch are the following. First, SO_MARK option now has effect on UDP sockets too. Second, ip_queue_xmit() and inet_sk_rebuild_header() could fail to do routing lookup correctly if TCP sockets with SO_MARK were used. Signed-off-by: Atis Elsts <atis@mikrotik.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>	2009-10-01 15:16:49 -07:00
Linus Torvalds	d7e9660ad9	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1623 commits) netxen: update copyright netxen: fix tx timeout recovery netxen: fix file firmware leak netxen: improve pci memory access netxen: change firmware write size tg3: Fix return ring size breakage netxen: build fix for INET=n cdc-phonet: autoconfigure Phonet address Phonet: back-end for autoconfigured addresses Phonet: fix netlink address dump error handling ipv6: Add IFA_F_DADFAILED flag net: Add DEVTYPE support for Ethernet based devices mv643xx_eth.c: remove unused txq_set_wrr() ucc_geth: Fix hangs after switching from full to half duplex ucc_geth: Rearrange some code to avoid forward declarations phy/marvell: Make non-aneg speed/duplex forcing work for 88E1111 PHYs drivers/net/phy: introduce missing kfree drivers/net/wan: introduce missing kfree net: force bridge module(s) to be GPL Subject: [PATCH] appletalk: Fix skb leak when ipddp interface is not loaded ... Fixed up trivial conflicts: - arch/x86/include/asm/socket.h converted to <asm-generic/socket.h> in the x86 tree. The generic header has the same new #define's, so that works out fine. - drivers/net/tun.c fix conflict between `89f56d1e9` ("tun: reuse struct sock fields") that switched over to using 'tun->socket.sk' instead of the redundantly available (and thus removed) 'tun->sk', and `2b980dbd` ("lsm: Add hooks to the TUN driver") which added a new 'tun->sk' use. Noted in 'next' by Stephen Rothwell.	2009-09-14 10:37:28 -07:00
Eric Dumazet	6ce9e7b5fe	ip: Report qdisc packet drops Christoph Lameter pointed out that packet drops at qdisc level where not accounted in SNMP counters. Only if application sets IP_RECVERR, drops are reported to user (-ENOBUFS errors) and SNMP counters updated. IP_RECVERR is used to enable extended reliable error message passing, but these are not needed to update system wide SNMP stats. This patch changes things a bit to allow SNMP counters to be updated, regardless of IP_RECVERR being set or not on the socket. Example after an UDP tx flood # netstat -s ... IP: 1487048 outgoing packets dropped ... Udp: ... SndbufErrors: 1487048 send() syscalls, do however still return an OK status, to not break applications. Note : send() manual page explicitly says for -ENOBUFS error : "The output queue for a network interface was full. This generally indicates that the interface has stopped sending, but may be caused by transient congestion. (Normally, this does not occur in Linux. Packets are just silently dropped when a device queue overflows.) " This is not true for IP_RECVERR enabled sockets : a send() syscall that hit a qdisc drop returns an ENOBUFS error. Many thanks to Christoph, David, and last but not least, Alexey ! Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-09-02 18:05:33 -07:00
Julien TINNES	788d908f28	ipv4: make ip_append_data() handle NULL routing table Add a check in ip_append_data() for NULL *rtp to prevent future bugs in callers from being exploitable. Signed-off-by: Julien Tinnes <julien@cr0.org> Signed-off-by: Tavis Ormandy <taviso@sdf.lonestar.org> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-08-27 12:23:43 -07:00
Eric Dumazet	e51a67a9c8	net: ip_push_pending_frames() fix After commit `2b85a34e91` (net: No more expensive sock_hold()/sock_put() on each tx) we do not take any more references on sk->sk_refcnt on outgoing packets. I forgot to delete two __sock_put() from ip_push_pending_frames() and ip6_push_pending_frames(). Reported-by: Emil S Tantilov <emils.tantilov@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Tested-by: Emil S Tantilov <emils.tantilov@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-07-11 20:26:21 -07:00
Eric Dumazet	2b85a34e91	net: No more expensive sock_hold()/sock_put() on each tx One of the problem with sock memory accounting is it uses a pair of sock_hold()/sock_put() for each transmitted packet. This slows down bidirectional flows because the receive path also needs to take a refcount on socket and might use a different cpu than transmit path or transmit completion path. So these two atomic operations also trigger cache line bounces. We can see this in tx or tx/rx workloads (media gateways for example), where sock_wfree() can be in top five functions in profiles. We use this sock_hold()/sock_put() so that sock freeing is delayed until all tx packets are completed. As we also update sk_wmem_alloc, we could offset sk_wmem_alloc by one unit at init time, until sk_free() is called. Once sk_free() is called, we atomic_dec_and_test(sk_wmem_alloc) to decrement initial offset and atomicaly check if any packets are in flight. skb_set_owner_w() doesnt call sock_hold() anymore sock_wfree() doesnt call sock_put() anymore, but check if sk_wmem_alloc reached 0 to perform the final freeing. Drawback is that a skb->truesize error could lead to unfreeable sockets, or even worse, prematurely calling __sk_free() on a live socket. Nice speedups on SMP. tbench for example, going from 2691 MB/s to 2711 MB/s on my 8 cpu dev machine, even if tbench was not really hitting sk_refcnt contention point. 5 % speedup on a UDP transmit workload (depends on number of flows), lowering TX completion cpu usage. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-06-11 02:55:43 -07:00
David S. Miller	d7fcf1a5ca	ipv4: Use frag list abstraction interfaces. Signed-off-by: David S. Miller <davem@davemloft.net>	2009-06-09 00:19:37 -07:00
Eric Dumazet	adf30907d6	net: skb->dst accessors Define three accessors to get/set dst attached to a skb struct dst_entry skb_dst(const struct sk_buff skb) void skb_dst_set(struct sk_buff skb, struct dst_entry dst) void skb_dst_drop(struct sk_buff *skb) This one should replace occurrences of : dst_release(skb->dst) skb->dst = NULL; Delete skb->dst field Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-06-03 02:51:04 -07:00
Eric Dumazet	511c3f92ad	net: skb->rtable accessor Define skb_rtable(const struct sk_buff *skb) accessor to get rtable from skb Delete skb->rtable field Setting rtable is not allowed, just set dst instead as rtable is an alias. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-06-03 02:51:02 -07:00
Neil Horman	edf391ff17	snmp: add missing counters for RFC 4293 The IP MIB (RFC 4293) defines stats for InOctets, OutOctets, InMcastOctets and OutMcastOctets: http://tools.ietf.org/html/rfc4293 But it seems we don't track those in any way that easy to separate from other protocols. This patch adds those missing counters to the stats file. Tested successfully by me With help from Eric Dumazet. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-04-27 02:45:02 -07:00
Patrick Ohly	51f31cabe3	ip: support for TX timestamps on UDP and RAW sockets Instructions for time stamping outgoing packets are take from the socket layer and later copied into the new skb. Signed-off-by: Patrick Ohly <patrick.ohly@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-02-15 22:43:38 -08:00
Eric Dumazet	a21bba9454	net: avoid a pair of dst_hold()/dst_release() in ip_push_pending_frames() We can reduce pressure on dst entry refcount that slowdown UDP transmit path on SMP machines. This pressure is visible on RTP servers when delivering content to mediagateways, especially big ones, handling thousand of streams. Several cpus send UDP frames to the same destination, hence use the same dst entry. This patch makes ip_push_pending_frames() steal the refcount its callers had to take when filling inet->cork.dst. This doesnt avoid all refcounting, but still gives speedups on SMP, on UDP/RAW transmit path. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-11-24 16:07:50 -08:00
Eric Dumazet	2e77d89b2f	net: avoid a pair of dst_hold()/dst_release() in ip_append_data() We can reduce pressure on dst entry refcount that slowdown UDP transmit path on SMP machines. This pressure is visible on RTP servers when delivering content to mediagateways, especially big ones, handling thousand of streams. Several cpus send UDP frames to the same destination, hence use the same dst entry. This patch makes ip_append_data() eventually steal the refcount its callers had to take on the dst entry. This doesnt avoid all refcounting, but still gives speedups on SMP, on UDP/RAW transmit path Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-11-24 15:52:46 -08:00
Jianjun Kong	d9319100c1	net: clean up net/ipv4/ah4.c esp4.c fib_semantics.c inet_connection_sock.c inetpeer.c ip_output.c Signed-off-by: Jianjun Kong <jianjun@zeuux.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-11-03 00:23:42 -08:00
KOVACS Krisztian	86b08d867d	ipv4: Make Netfilter's ip_route_me_harder() non-local address compatible Netfilter's ip_route_me_harder() tries to re-route packets either generated or re-routed by Netfilter. This patch changes ip_route_me_harder() to handle packets from non-locally-bound sockets with IP_TRANSPARENT set as local and to set the appropriate flowi flags when re-doing the routing lookup. Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-10-01 07:44:42 -07:00
Ilpo Järvinen	547b792cac	net: convert BUG_TRAP to generic WARN_ON Removes legacy reinvent-the-wheel type thing. The generic machinery integrates much better to automated debugging aids such as kerneloops.org (and others), and is unambiguous due to better naming. Non-intuively BUG_TRAP() is actually equal to WARN_ON() rather than BUG_ON() though some might actually be promoted to BUG_ON() but I left that to future. I could make at least one BUILD_BUG_ON conversion. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-25 21:43:18 -07:00
Pavel Emelyanov	5e38e27044	mib: add net to IP_INC_STATS All the callers already have either the net itself, or the place where to get it from. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-16 20:19:49 -07:00
Pavel Emelyanov	0388b00426	icmp: add struct net argument to icmp_out_count This routine deals with ICMP statistics, but doesn't have a struct net at hands, so add one. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-14 23:05:13 -07:00
Adrian Bunk	0b04082995	net: remove CVS keywords This patch removes CVS keywords that weren't updated for a long time from comments. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-06-11 21:00:38 -07:00
Kostya B	be9164e769	[IPv4] UFO: prevent generation of chained skb destined to UFO device Problem: ip_append_data() could wrongly generate a chained skb for devices which support UFO. When sk_write_queue is not empty (e.g. MSG_MORE), __instead__ of appending data into the next nr_frag of the queued skb, a new chained skb is created. I would normally assume UFO device should get data in nr_frags and not in frag_list. Later the udp4_hwcsum_outgoing() resets csum to NONE and skb_gso_segment() has oops. Proposal: 1. Even length is less than mtu, employ ip_ufo_append_data() and append data to the __existed__ skb in the sk_write_queue. 2. ip_ufo_append_data() is fixed due to a wrong manipulation of peek-ing and later enqueue-ing of the same skb. Now, enqueuing is always performed, because on error the further ip_flush_pending_frames() would release the queued skb. Signed-off-by: Kostya B <bkostya@hotmail.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-04-29 22:36:30 -07:00
YOSHIFUJI Hideaki	3b1e0a655f	[NET] NETNS: Omit sock->sk_net without CONFIG_NET_NS. Introduce per-sock inlines: sock_net(), sock_net_set() and per-inet_timewait_sock inlines: twsk_net(), twsk_net_set(). Without CONFIG_NET_NS, no namespace other than &init_net exists. Let's explicitly define them to help compiler optimizations. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>	2008-03-26 04:39:55 +09:00
YOSHIFUJI Hideaki	c8cdaf998d	[IPV4,IPV6]: Share cork.rt between IPv4 and IPv6. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>	2008-03-25 10:23:59 +09:00
Denis V. Lunev	cb84663e4d	[NETNS]: Process IP layer in the context of the correct namespace. Replace all the rest of the init_net with a proper net on the IP layer. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-03-24 15:31:00 -07:00
Eric Dumazet	ee6b967301	[IPV4]: Add 'rtable' field in struct sk_buff to alias 'dst' and avoid casts (Anonymous) unions can help us to avoid ugly casts. A common cast it the (struct rtable )skb->dst one. Defining an union like : union { struct dst_entry dst; struct rtable *rtable; }; permits to use skb->rtable in place. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-03-05 18:30:47 -08:00
Laszlo Attila Toth	4a19ec5800	[NET]: Introducing socket mark socket option. A userspace program may wish to set the mark for each packets its send without using the netfilter MARK target. Changing the mark can be used for mark based routing without netfilter or for packet filtering. It requires CAP_NET_ADMIN capability. Signed-off-by: Laszlo Attila Toth <panther@balabit.hu> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-31 19:27:19 -08:00
Herbert Xu	29ffe1a5c5	[INET]: Prevent out-of-sync truesize on ip_fragment slow path When ip_fragment has to hit the slow path the value of skb->truesize may go out of sync because we would have updated it without changing the packet length. This violates the constraints on truesize. This patch postpones the update of skb->truesize to prevent this. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-31 19:27:07 -08:00
Denis V. Lunev	dde1bc0e6f	[NETNS]: Add namespace for ICMP replying code. All needed API is done, the namespace is available when required from the device on the DST entry from the incoming packet. So, just replace init_net with proper namespace. Other protocols will follow. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 15:11:13 -08:00
Denis V. Lunev	f206351a50	[NETNS]: Add namespace parameter to ip_route_output_key. Needed to propagate it down to the ip_route_output_flow. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 15:11:07 -08:00
Denis V. Lunev	f1b050bf7a	[NETNS]: Add namespace parameter to ip_route_output_flow. Needed to propagate it down to the __ip_route_output_key. Signed_off_by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 15:11:06 -08:00
Ilpo Järvinen	a067d9ac39	[NET]: Remove obsolete comment It seems that ip_build_xmit is no longer used in here and ip_append_data is used. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 15:00:45 -08:00
Patrick McHardy	6e23ae2a48	[NETFILTER]: Introduce NF_INET_ hook values The IPv4 and IPv6 hook values are identical, yet some code tries to figure out the "correct" value by looking at the address family. Introduce NF_INET_* values for both IPv4 and IPv6. The old values are kept in a #ifndef __KERNEL__ section for userspace compatibility. Signed-off-by: Patrick McHardy <kaber@trash.net> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 14:53:55 -08:00
Herbert Xu	c439cb2e4b	[IPV4]: Add ip_local_out Most callers of the LOCAL_OUT chain will set the IP packet length and header checksum before doing so. They also share the same output function dst_output. This patch creates a new function called ip_local_out which does all of that and converts the appropriate users over to it. Apart from removing duplicate code, it will also help in merging the IPsec output path once the same thing is done for IPv6. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-28 14:53:47 -08:00
Herbert Xu	f945fa7ad9	[INET]: Fix truesize setting in ip_append_data As it is ip_append_data only counts page fragments to the skb that allocated it. As such it means that the first skb gets hit with a 4K charge even though it might have only used a fraction of it while all subsequent skb's that use the same page gets away with no charge at all. This bug was exposed by the UDP accounting patch. [ The wmem_alloc bumping needs to be moved with the truesize, noticed by Takahiro Yasui. -DaveM ] Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-23 03:11:43 -08:00
David S. Miller	1e34a11d55	[IPV4]: Add missing skb->truesize increment in ip_append_page(). And as noted by Takahiro Yasui, we thus need to bump the sk->sk_wmem_alloc at this spot as well. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-01-23 03:11:40 -08:00
Pavel Emelyanov	429f08e950	[IPV4]: Consolidate the ip cork destruction in ip_output.c The ip_push_pending_frames and ip_flush_pending_frames do the same things to flush the sock's cork. Move this into a separate function and save ~80 bytes from the .text Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-11-07 04:08:25 -08:00
Chuck Lever	c2636b4d9e	[NET]: Treat the sign of the result of skb_headroom() consistently In some places, the result of skb_headroom() is compared to an unsigned integer, and in others, the result is compared to a signed integer. Make the comparisons consistent and correct. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-23 21:27:55 -07:00
Patrick McHardy	861d048607	[IPV4]: Uninline netfilter okfns Now that we don't pass double skb pointers to nf_hook_slow anymore, gcc can generate tail calls for some of the netfilter hook okfn invocations, so there is no need to inline the functions anymore. This caused huge code bloat since we ended up with one inlined version and one out-of-line version since we pass the address to nf_hook_slow. Before: text data bss dec hex filename 8997385 1016524 524652 10538561 a0ce41 vmlinux After: text data bss dec hex filename 8994009 1016524 524652 10535185 a0c111 vmlinux ------------------------------------------------------- -3376 All cases have been verified to generate tail-calls with and without netfilter. The okfns in ipmr and xfrm4_input still remain inline because gcc can't generate tail-calls for them. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-15 12:26:35 -07:00
Stephen Hemminger	3b04ddde02	[NET]: Move hardware header operations out of netdevice. Since hardware header operations are part of the protocol class not the device instance, make them into a separate object and save memory. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:52:52 -07:00
David L Stevens	96793b4825	[IPV4]: Add ICMPMsgStats MIB (RFC 4293) Background: RFC 4293 deprecates existing individual, named ICMP type counters to be replaced with the ICMPMsgStatsTable. This table includes entries for both IPv4 and IPv6, and requires counting of all ICMP types, whether or not the machine implements the type. These patches "remove" (but not really) the existing counters, and replace them with the ICMPMsgStats tables for v4 and v6. It includes the named counters in the /proc places they were, but gets the values for them from the new tables. It also counts packets generated from raw socket output (e.g., OutEchoes, MLD queries, RA's from radvd, etc). Changes: 1) create icmpmsg_statistics mib 2) create icmpv6msg_statistics mib 3) modify existing counters to use these 4) modify /proc/net/snmp to add "IcmpMsg" with all ICMP types listed by number for easy SNMP parsing 5) modify /proc/net/snmp printing for "Icmp" to get the named data from new counters. Signed-off-by: David L Stevens <dlstevens@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-10-10 16:51:28 -07:00
Jesper Juhl	f49f9967b2	[IPV4]: Clean up duplicate includes in net/ipv4/ This patch cleans up duplicate includes in net/ipv4/ Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-08-13 22:52:02 -07:00
Jozsef Kadlecsik	ba9dda3ab5	[NETFILTER]: x_tables: add TRACE target The TRACE target can be used to follow IP and IPv6 packets through the ruleset. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Patrick NcHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-07-10 22:17:14 -07:00
Stephen Hemminger	d212f87b06	[NET]: IPV6 checksum offloading in network devices The existing model for checksum offload does not correctly handle devices that can offload IPV4 and IPV6 only. The NETIF_F_HW_CSUM flag implies device can do any arbitrary protocol. This patch: * adds NETIF_F_IPV6_CSUM for those devices * fixes bnx2 and tg3 devices that need it * add NETIF_F_IPV6_CSUM to ipv6 output (incl GSO) * fixes assumptions about NETIF_F_ALL_CSUM in nat * adjusts bridge union of checksumming computation Signed-off-by: David S. Miller <davem@davemloft.net>	2007-07-10 22:15:52 -07:00
Patrick McHardy	f0e48dbfc5	[TCP]: Honour sk_bound_dev_if in tcp_v4_send_ack A time_wait socket inherits sk_bound_dev_if from the original socket, but it is not used when sending ACK packets using ip_send_reply. Fix by passing the oif to ip_send_reply in struct ip_reply_arg and use it for output routing. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-06-07 13:38:51 -07:00
Mitsuru Chinen	80787ebc2b	[IPV4] SNMP: Support OutMcastPkts and OutBcastPkts A transmitted IP multicast datagram should be counted as OutMcastPkts. By the same token, a transmitted IP broadcast datagram should be counted as OutBcastPkts. Signed-off-by: Mitsuru Chinen <mitch@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-30 00:58:32 -07:00
John Heffner	628a5c5618	[INET]: Add IP(V6)_PMTUDISC_RPOBE Add IP(V6)_PMTUDISC_PROBE value for IP(V6)_MTU_DISCOVER. This option forces us not to fragment, but does not make use of the kernel path MTU discovery. That is, it allows for user-mode MTU probing (or, packetization-layer path MTU discovery). This is particularly useful for diagnostic utilities, like traceroute/tracepath. Signed-off-by: John Heffner <jheffner@psc.edu> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:29:10 -07:00
Arnaldo Carvalho de Melo	d626f62b11	[SK_BUFF]: Introduce skb_copy_from_linear_data{_offset} To clearly state the intent of copying from linear sk_buffs, _offset being a overly long variant but interesting for the sake of saving some bytes. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2007-04-25 22:28:23 -07:00
Arnaldo Carvalho de Melo	b0e380b1d8	[SK_BUFF]: unions of just one member don't get anything done, kill them Renaming skb->h to skb->transport_header, skb->nh to skb->network_header and skb->mac to skb->mac_header, to match the names of the associated helpers (skb[_[re]set]_{transport,network,mac}_header). Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:26:20 -07:00
Arnaldo Carvalho de Melo	cfe1fc7759	[SK_BUFF]: Introduce skb_network_header_len For the common sequence "skb->h.raw - skb->nh.raw", similar to skb->mac_len, that is precalculated tho, don't think we need to bloat skb with one more member, so just use this new helper, reducing the number of non-skbuff.h references to the layer headers even more. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:26:19 -07:00
Arnaldo Carvalho de Melo	bff9b61ce3	[SK_BUFF]: Use the helpers to get the layer header pointer Some more cases... Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:26:18 -07:00
Yasuyuki Kozakai	e7ac05f340	[NETFILTER]: nf_conntrack: add nf_copy() to safely copy members in skb This unifies the codes to copy netfilter related datas. Before copying, nf_copy() puts original members in destination skb. Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:25:55 -07:00
Arnaldo Carvalho de Melo	9c70220b73	[SK_BUFF]: Introduce skb_transport_header(skb) For the places where we need a pointer to the transport header, it is still legal to touch skb->h.raw directly if just adding to, subtracting from or setting it to another layer header. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:25:31 -07:00
Arnaldo Carvalho de Melo	aa8223c7bb	[SK_BUFF]: Introduce tcp_hdr(), remove skb->h.th Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:25:26 -07:00
Arnaldo Carvalho de Melo	967b05f64e	[SK_BUFF]: Introduce skb_set_transport_header For the cases where the transport header is being set to a offset from skb->data. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:25:17 -07:00
Arnaldo Carvalho de Melo	badff6d01a	[SK_BUFF]: Introduce skb_reset_transport_header(skb) For the common, open coded 'skb->h.raw = skb->data' operation, so that we can later turn skb->h.raw into a offset, reducing the size of struct sk_buff in 64bit land while possibly keeping it as a pointer on 32bit. This one touches just the most simple cases: skb->h.raw = skb->data; skb->h.raw = {skb_push\|[__]skb_pull}() The next ones will handle the slightly more "complex" cases. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:25:15 -07:00
Arnaldo Carvalho de Melo	eddc9ec53b	[SK_BUFF]: Introduce ip_hdr(), remove skb->nh.iph Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:25:10 -07:00
Arnaldo Carvalho de Melo	c14d2450cb	[SK_BUFF]: Introduce skb_set_network_header For the cases where the network header is being set to a offset from skb->data. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:25:01 -07:00
Arnaldo Carvalho de Melo	d56f90a7c9	[SK_BUFF]: Introduce skb_network_header() For the places where we need a pointer to the network header, it is still legal to touch skb->nh.raw directly if just adding to, subtracting from or setting it to another layer header. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:24:59 -07:00
Arnaldo Carvalho de Melo	bbe735e424	[SK_BUFF]: Introduce skb_network_offset() For the quite common 'skb->nh.raw - skb->data' sequence. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:24:58 -07:00
Arnaldo Carvalho de Melo	8856dfa3e9	[SK_BUFF]: Use skb_reset_network_header after skb_push Some more cases where skb->nh.iph was being set that were converted to using skb_reset_network_header. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:24:53 -07:00
Arnaldo Carvalho de Melo	2ca9e6f2c2	[SK_BUFF]: Some more skb_put cases converted to skb_reset_network_header Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:24:51 -07:00
Arnaldo Carvalho de Melo	e2d1bca7e6	[SK_BUFF]: Use skb_reset_network_header in skb_push cases skb_push updates and returns skb->data, so we can just call skb_reset_network_header after the call to skb_push. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:24:47 -07:00
Arnaldo Carvalho de Melo	c1d2bbe1cd	[SK_BUFF]: Introduce skb_reset_network_header(skb) For the common, open coded 'skb->nh.raw = skb->data' operation, so that we can later turn skb->nh.raw into a offset, reducing the size of struct sk_buff in 64bit land while possibly keeping it as a pointer on 32bit. This one touches just the most simple case, next will handle the slightly more "complex" cases. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:24:46 -07:00
Arnaldo Carvalho de Melo	459a98ed88	[SK_BUFF]: Introduce skb_reset_mac_header(skb) For the common, open coded 'skb->mac.raw = skb->data' operation, so that we can later turn skb->mac.raw into a offset, reducing the size of struct sk_buff in 64bit land while possibly keeping it as a pointer on 32bit. This one touches just the most simple case, next will handle the slightly more "complex" cases. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:24:32 -07:00
Stephen Hemminger	132adf5463	[IPV4]: cleanup Add whitespace around keywords. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-04-25 22:24:11 -07:00
Tim Schmielau	cd354f1ae7	[PATCH] remove many unneeded #includes of sched.h After Al Viro (finally) succeeded in removing the sched.h #include in module.h recently, it makes sense again to remove other superfluous sched.h includes. There are quite a lot of files which include it but don't actually need anything defined in there. Presumably these includes were once needed for macros that used to live in sched.h, but moved to other header files in the course of cleaning it up. To ease the pain, this time I did not fiddle with any header files and only removed #includes from .c-files, which tend to cause less trouble. Compile tested against 2.6.20-rc2 and 2.6.20-rc2-mm2 (with offsets) on alpha, arm, i386, ia64, mips, powerpc, and x86_64 with allnoconfig, defconfig, allmodconfig, and allyesconfig as well as a few randconfigs on x86_64 and all configs in arch/arm/configs on arm. I also checked that no new warnings were introduced by the patch (actually, some warnings are removed that were emitted by unnecessarily included header files). Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de> Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-02-14 08:09:54 -08:00
YOSHIFUJI Hideaki	e905a9edab	[NET] IPV4: Fix whitespace errors. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2007-02-10 23:19:39 -08:00
David S. Miller	e89862f4c5	[TCP]: Restore SKB socket owner setting in tcp_transmit_skb(). Revert `931731123a` We can't elide the skb_set_owner_w() here because things like certain netfilter targets (such as owner MATCH) need a socket to be set on the SKB for correct operation. Thanks to Jan Engelhardt and other netfilter list members for pointing this out. Signed-off-by: David S. Miller <davem@davemloft.net>	2007-01-26 01:04:55 -08:00
Stephen Hemminger	3644f0cee7	[NET]: Convert hh_lock to seqlock. The hard header cache is in the main output path, so using seqlock instead of reader/writer lock should reduce overhead. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-12-08 17:19:20 -08:00
Al Viro	a1f8e7f7fb	[PATCH] severing skbuff.h -> highmem.h Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2006-12-04 02:00:29 -05:00
Al Viro	5084205faf	[NET]: Annotate callers of csum_partial_copy_...() and csum_and_copy...() in net/* Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-12-02 21:23:33 -08:00
Al Viro	44bb93633f	[NET]: Annotate csum_partial() callers in net/* Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-12-02 21:23:32 -08:00
Al Viro	d3bc23e7ee	[NET]: Annotate callers of csum_fold() in net/* Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-12-02 21:23:27 -08:00
Al Viro	714e85be35	[IPV6]: Assorted trivial endianness annotations. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-12-02 21:22:50 -08:00
David S. Miller	931731123a	[TCP]: Don't set SKB owner in tcp_transmit_skb(). The data itself is already charged to the SKB, doing the skb_set_owner_w() just generates a lot of noise and extra atomics we don't really need. Lmbench improvements on lat_tcp are minimal: before: TCP latency using localhost: 23.2701 microseconds TCP latency using localhost: 23.1994 microseconds TCP latency using localhost: 23.2257 microseconds after: TCP latency using localhost: 22.8380 microseconds TCP latency using localhost: 22.9465 microseconds TCP latency using localhost: 22.8462 microseconds Signed-off-by: David S. Miller <davem@davemloft.net>	2006-12-02 21:21:52 -08:00
Thomas Graf	82e91ffef6	[NET]: Turn nfmark into generic mark nfmark is being used in various subsystems and has become the defacto mark field for all kinds of packets. Therefore it makes sense to rename it to `mark' and remove the dependency on CONFIG_NETFILTER. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-12-02 21:21:38 -08:00
Al Viro	3ca3c68e76	[IPV4]: struct ip_options annotations ->faddr is net-endian; annotated as such, variables inferred to be net-endian annotated. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-09-28 18:01:53 -07:00
Al Viro	13d8eaa06a	[IPV4]: ip_build_and_send_pkt() annotations saddr and daddr are net-endian Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-09-28 18:01:19 -07:00
Stephen Hemminger	9bcfcaf5e9	[NETFILTER] bridge: simplify nf_bridge_pad Do some simple optimization on the nf_bridge_pad() function and don't use magic constants. Eliminate a double call and the #ifdef'd code for CONFIG_BRIDGE_NETFILTER. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-09-22 15:18:31 -07:00
Brian Haley	ab32ea5d8a	[NET/IPV4/IPV6]: Change some sysctl variables to __read_mostly Change net/core, ipv4 and ipv6 sysctl variables to __read_mostly. Couldn't actually measure any performance increase while testing (.3% I consider noise), but seems like the right thing to do. Signed-off-by: Brian Haley <brian.haley@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-09-22 14:55:03 -07:00
Patrick McHardy	84fa7933a3	[NET]: Replace CHECKSUM_HW by CHECKSUM_PARTIAL/CHECKSUM_COMPLETE Replace CHECKSUM_HW by CHECKSUM_PARTIAL (for outgoing packets, whose checksum still needs to be completed) and CHECKSUM_COMPLETE (for incoming packets, device supplied full checksum). Patch originally from Herbert Xu, updated by myself for 2.6.18-rc3. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-09-22 14:53:53 -07:00
Venkat Yekkirala	beb8d13bed	[MLSXFRM]: Add flow labeling This labels the flows that could utilize IPSec xfrms at the points the flows are defined so that IPSec policy and SAs at the right label can be used. The following protos are currently not handled, but they should continue to be able to use single-labeled IPSec like they currently do. ipmr ip_gre ipip igmp sit sctp ip6_tunnel (IPv6 over IPv6 tunnel device) decnet Signed-off-by: Venkat Yekkirala <vyekkirala@TrustedCS.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-09-22 14:53:27 -07:00
Wei Dong	0668b47205	[IPV4]: Fix SNMPv2 "ipFragFails" counter error When I tested Linux kernel 2.6.17.7 about statistics "ipFragFails",found that this counter couldn't increase correctly. The criteria is RFC2011: RFC2011 ipFragFails OBJECT-TYPE SYNTAX Counter32 MAX-ACCESS read-only STATUS current DESCRIPTION "The number of IP datagrams that have been discarded because they needed to be fragmented at this entity but could not be, e.g., because their Don't Fragment flag was set." ::= { ip 18 } When I send big IP packet to a router with DF bit set to 1 which need to be fragmented, and router just sends an ICMP error message ICMP_FRAG_NEEDED but no increments for this counter(in the function ip_fragment). Signed-off-by: Wei Dong <weid@nanjing-fnst.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-08-31 15:24:48 -07:00
Herbert Xu	e9fa4f7bd2	[INET]: Use pskb_trim_unique when trimming paged unique skbs The IPv4/IPv6 datagram output path was using skb_trim to trim paged packets because they know that the packet has not been cloned yet (since the packet hasn't been given to anything else in the system). This broke because skb_trim no longer allows paged packets to be trimmed. Paged packets must be given to one of the pskb_trim functions instead. This patch adds a new pskb_trim_unique function to cover the IPv4/IPv6 datagram output path scenario and replaces the corresponding skb_trim calls with it. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-08-13 20:12:58 -07:00
Wei Dong	dafee49085	[IPV6]: SNMPv2 "ipv6IfStatsOutFragCreates" counter error When I tested linux kernel 2.6.71.7 about statistics "ipv6IfStatsOutFragCreates", and found that it couldn't increase correctly. The criteria is RFC 2465: ipv6IfStatsOutFragCreates OBJECT-TYPE SYNTAX Counter32 MAX-ACCESS read-only STATUS current DESCRIPTION "The number of output datagram fragments that have been generated as a result of fragmentation at this output interface." ::= { ipv6IfStatsEntry 15 } I think there are two issues in Linux kernel. 1st: RFC2465 specifies the counter is "The number of output datagram fragments...". I think increasing this counter after output a fragment successfully is better. And it should not be increased even though a fragment is created but failed to output. 2nd: If we send a big ICMP/ICMPv6 echo request to a host, and receive ICMP/ICMPv6 echo reply consisted of some fragments. As we know that in Linux kernel first fragmentation occurs in ICMP layer(maybe saying transport layer is better), but this is not the "real" fragmentation,just do some "pre-fragment" -- allocate space for date, and form a frag_list, etc. The "real" fragmentation happens in IP layer -- set offset and MF flag and so on. So I think in "fast path" for ip_fragment/ip6_fragment, if we send a fragment which "pre-fragment" by upper layer we should also increase "ipv6IfStatsOutFragCreates". Signed-off-by: Wei Dong <weid@nanjing-fnst.com> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-08-02 13:41:21 -07:00
Herbert Xu	89114afd43	[NET] gso: Add skb_is_gso This patch adds the wrapper function skb_is_gso which can be used instead of directly testing skb_shinfo(skb)->gso_size. This makes things a little nicer and allows us to change the primary key for indicating whether an skb is GSO (if we ever want to do that). Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-07-08 13:34:32 -07:00
Linus Torvalds	e37a72de84	Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 * master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: [IPV6]: Added GSO support for TCPv6 [NET]: Generalise TSO-specific bits from skb_setup_caps [IPV6]: Added GSO support for TCPv6 [IPV6]: Remove redundant length check on input [NETFILTER]: SCTP conntrack: fix crash triggered by packet without chunks [TG3]: Update version and reldate [TG3]: Add TSO workaround using GSO [TG3]: Turn on hw fix for ASF problems [TG3]: Add rx BD workaround [TG3]: Add tg3_netif_stop() in vlan functions [TCP]: Reset gso_segs if packet is dodgy	2006-06-30 15:40:17 -07:00
Herbert Xu	f83ef8c0b5	[IPV6]: Added GSO support for TCPv6 This patch adds GSO support for IPv6 and TCPv6. This is based on a patch by Ananda Raju <Ananda.Raju@neterion.com>. His original description is: This patch enables TSO over IPv6. Currently Linux network stacks restricts TSO over IPv6 by clearing of the NETIF_F_TSO bit from "dev->features". This patch will remove this restriction. This patch will introduce a new flag NETIF_F_TSO6 which will be used to check whether device supports TSO over IPv6. If device support TSO over IPv6 then we don't clear of NETIF_F_TSO and which will make the TCP layer to create TSO packets. Any device supporting TSO over IPv6 will set NETIF_F_TSO6 flag in "dev->features" along with NETIF_F_TSO. In case when user disables TSO using ethtool, NETIF_F_TSO will get cleared from "dev->features". So even if we have NETIF_F_TSO6 we don't get TSO packets created by TCP layer. SKB_GSO_TCPV4 renamed to SKB_GSO_TCP to make it generic GSO packet. SKB_GSO_UDPV4 renamed to SKB_GSO_UDP as UFO is not a IPv4 feature. UFO is supported over IPv6 also The following table shows there is significant improvement in throughput with normal frames and CPU usage for both normal and jumbo. -------------------------------------------------- \| \| 1500 \| 9600 \| \| ------------------\|-------------------\| \| \| thru CPU \| thru CPU \| -------------------------------------------------- \| TSO OFF \| 2.00 5.5% id \| 5.66 20.0% id \| -------------------------------------------------- \| TSO ON \| 2.63 78.0 id \| 5.67 39.0% id \| -------------------------------------------------- Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-06-30 14:12:10 -07:00
Jörn Engel	6ab3d5624e	Remove obsolete #include <linux/config.h> Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>	2006-06-30 19:25:36 +02:00
Herbert Xu	7967168cef	[NET]: Merge TSO/UFO fields in sk_buff Having separate fields in sk_buff for TSO/UFO (tso_size/ufo_size) is not going to scale if we add any more segmentation methods (e.g., DCCP). So let's merge them. They were used to tell the protocol of a packet. This function has been subsumed by the new gso_type field. This is essentially a set of netdev feature bits (shifted by 16 bits) that are required to process a specific skb. As such it's easy to tell whether a given device can process a GSO skb: you just have to and the gso_type field and the netdev's features field. I've made gso_type a conjunction. The idea is that you have a base type (e.g., SKB_GSO_TCPV4) that can be modified further to support new features. For example, if we add a hardware TSO type that supports ECN, they would declare NETIF_F_TSO \| NETIF_F_TSO_ECN. All TSO packets with CWR set would have a gso_type of SKB_GSO_TCPV4 \| SKB_GSO_TCPV4_ECN while all other TSO packets would be SKB_GSO_TCPV4. This means that only the CWR packets need to be emulated in software. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-06-23 02:07:29 -07:00
Herbert Xu	8648b3053b	[NET]: Add NETIF_F_GEN_CSUM and NETIF_F_ALL_CSUM The current stack treats NETIF_F_HW_CSUM and NETIF_F_NO_CSUM identically so we test for them in quite a few places. For the sake of brevity, I'm adding the macro NETIF_F_GEN_CSUM for these two. We also test the disjunct of NETIF_F_IP_CSUM and the other two in various places, for that purpose I've added NETIF_F_ALL_CSUM. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-06-17 22:06:05 -07:00
James Morris	984bc16cc9	[SECMARK]: Add secmark support to core networking. Add a secmark field to the skbuff structure, to allow security subsystems to place security markings on network packets. This is similar to the nfmark field, except is intended for implementing security policy, rather than than networking policy. This patch was already acked in principle by Dave Miller. Signed-off-by: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-06-17 21:29:57 -07:00
Zach Brown	3d9dd7564d	[PATCH] ip_output: account for fraggap when checking to add trailer_len During other work I noticed that ip_append_data() seemed to be forgetting to include the frag gap in its calculation of a fragment that consumes the rest of the payload. Herbert confirmed that this was a bug that snuck in during a previous rework. Signed-off-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-04-14 16:04:18 -07:00
Patrick McHardy	2e2f7aefa8	[NETFILTER]: Fix fragmentation issues with bridge netfilter The conntrack code doesn't do re-fragmentation of defragmented packets anymore but relies on fragmentation in the IP layer. Purely bridged packets don't pass through the IP layer, so the bridge netfilter code needs to take care of fragmentation itself. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-04-09 22:25:23 -07:00
Alexey Kuznetsov	1a55d57b10	[TCP]: Do not use inet->id of global tcp_socket when sending RST. The problem is in ip_push_pending_frames(), which uses: if (!df) { __ip_select_ident(iph, &rt->u.dst, 0); } else { iph->id = htons(inet->id++); } instead of ip_select_ident(). Right now I think the code is a nonsense. Most likely, I copied it from old ip_build_xmit(), where it was really special, we had to decide whether to generate unique ID when generating the first (well, the last) fragment. In ip_push_pending_frames() it does not make sense, it should use plain ip_select_ident() instead. Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-03-22 14:27:59 -08:00
Patrick McHardy	baa829d892	[IPV4/6]: Fix UFO error propagation When ufo_append_data fails err is uninitialized, but returned back. Strangely gcc doesn't notice it. Coverity #901 and #902 Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-03-12 20:39:40 -08:00
Patrick McHardy	48d5cad87c	[XFRM]: Fix SNAT-related crash in xfrm4_output_finish When a packet matching an IPsec policy is SNATed so it doesn't match any policy anymore it looses its xfrm bundle, which makes xfrm4_output_finish crash because of a NULL pointer dereference. This patch directs these packets to the original output path instead. Since the packets have already passed the POST_ROUTING hook, but need to start at the beginning of the original output path which includes another POST_ROUTING invocation, a flag is added to the IPCB to indicate that the packet was rerouted and doesn't need to pass the POST_ROUTING hook again. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-02-15 15:10:22 -08:00
Patrick McHardy	cfacb0577e	[IPV4]: ip_output.c needs xfrm.h This patch fixes a warning from my IPsec patches: CC net/ipv4/ip_output.o net/ipv4/ip_output.c: In function 'ip_finish_output': net/ipv4/ip_output.c:208: warning: implicit declaration of function 'xfrm4_output_finish' Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-01-09 14:16:28 -08:00
Adrian Bunk	97dc627fb3	[IPV4]: make ip_fragment() static Since there's no longer any external user of ip_fragment() we can make it static. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-01-07 13:23:39 -08:00
Patrick McHardy	5c901daaea	[NETFILTER]: Redo policy lookups after NAT when neccessary When NAT changes the key used for the xfrm lookup it needs to be done again. If a new policy is returned in POST_ROUTING the packet needs to be passed to xfrm4_output_one manually after all hooks were called because POST_ROUTING is called with fixed okfn (ip_finish_output). Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-01-07 12:57:35 -08:00
Alexey Dobriyan	76ab608d86	[NET]: Endian-annotate struct iphdr And fix trivial warnings that emerged. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-01-06 13:24:29 -08:00
Patrick McHardy	1bd9bef6f9	[NETFILTER]: Call POST_ROUTING hook before fragmentation Call POST_ROUTING hook before fragmentation to get rid of the okfn use in ip_refrag and save the useless fragmentation/defragmentation step when NAT is used. The patch introduces one user-visible change, the POSTROUTING chain in the mangle table gets entire packets, not fragments, which should simplify use of the MARK and CLASSIFY targets for queueing as a nice side-effect. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-01-05 12:20:59 -08:00
Herbert Xu	89cee8b1cb	[IPV4]: Safer reassembly Another spin of Herbert Xu's "safer ip reassembly" patch for 2.6.16. (The original patch is here: http://marc.theaimsgroup.com/?l=linux-netdev&m=112281936522415&w=2 and my only contribution is to have tested it.) This patch (optionally) does additional checks before accepting IP fragments, which can greatly reduce the possibility of reassembling fragments which originated from different IP datagrams. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Arthur Kepner <akepner@sgi.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2006-01-03 13:10:31 -08:00

... 3 4 5 6 7 ...

473 Commits