Commit graph

29 commits

Author SHA1 Message Date
Christophe Leroy
a1ae431705 powerpc: Use rol32() instead of opencoding in csum_fold()
rol32(x, 16) will do the rotate using rlwinm.

No need to open code using inline assembly.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/794337eff7bb803d2c4e67d9eee635390c4c48fe.1646812553.git.christophe.leroy@csgroup.eu
2022-05-08 22:15:40 +10:00
Christophe Leroy
f206fdd9d4 powerpc: Reduce csum_add() complexity for PPC64
PPC64 does everything in C, gcc is able to skip calculation
when one of the operands in zero.

Move the constant folding in PPC32 part.

This helps GCC and reduces ppc64_defconfig by 170 bytes.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/a4ca63dd4c4b09e1906d08fb814af5a41d0f3fcb.1644651363.git.christophe.leroy@csgroup.eu
2022-05-06 00:00:20 +10:00
Christophe Leroy
3af722cb73 powerpc/net: Implement powerpc specific csum_shift() to remove branch
Today's implementation of csum_shift() leads to branching based on
parity of 'offset'

	000002f8 <csum_block_add>:
	     2f8:	70 a5 00 01 	andi.   r5,r5,1
	     2fc:	41 a2 00 08 	beq     304 <csum_block_add+0xc>
	     300:	54 84 c0 3e 	rotlwi  r4,r4,24
	     304:	7c 63 20 14 	addc    r3,r3,r4
	     308:	7c 63 01 94 	addze   r3,r3
	     30c:	4e 80 00 20 	blr

Use first bit of 'offset' directly as input of the rotation instead of
branching.

	000002f8 <csum_block_add>:
	     2f8:	54 a5 1f 38 	rlwinm  r5,r5,3,28,28
	     2fc:	20 a5 00 20 	subfic  r5,r5,32
	     300:	5c 84 28 3e 	rotlw   r4,r4,r5
	     304:	7c 63 20 14 	addc    r3,r3,r4
	     308:	7c 63 01 94 	addze   r3,r3
	     30c:	4e 80 00 20 	blr

And change to left shift instead of right shift to skip one more
instruction. This has no impact on the final sum.

	000002f8 <csum_block_add>:
	     2f8:	54 a5 1f 38 	rlwinm  r5,r5,3,28,28
	     2fc:	5c 84 28 3e 	rotlw   r4,r4,r5
	     300:	7c 63 20 14 	addc    r3,r3,r4
	     304:	7c 63 01 94 	addze   r3,r3
	     308:	4e 80 00 20 	blr

Seems like only powerpc benefits from a branchless implementation.
Other main architectures like ARM or X86 get better code with
the generic implementation and its branch.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-11 10:57:22 +00:00
Christophe Leroy
4423eff71c powerpc: Force inlining of csum_add()
Commit 328e7e487a ("powerpc: force inlining of csum_partial() to
avoid multiple csum_partial() with GCC10") inlined csum_partial().

Now that csum_partial() is inlined, GCC outlines csum_add() when
called by csum_partial().

c064fb28 <csum_add>:
c064fb28:	7c 63 20 14 	addc    r3,r3,r4
c064fb2c:	7c 63 01 94 	addze   r3,r3
c064fb30:	4e 80 00 20 	blr

c0665fb8 <csum_add>:
c0665fb8:	7c 63 20 14 	addc    r3,r3,r4
c0665fbc:	7c 63 01 94 	addze   r3,r3
c0665fc0:	4e 80 00 20 	blr

c066719c:	7c 9a c0 2e 	lwzx    r4,r26,r24
c06671a0:	38 60 00 00 	li      r3,0
c06671a4:	7f 1a c2 14 	add     r24,r26,r24
c06671a8:	4b ff ee 11 	bl      c0665fb8 <csum_add>
c06671ac:	80 98 00 04 	lwz     r4,4(r24)
c06671b0:	4b ff ee 09 	bl      c0665fb8 <csum_add>
c06671b4:	80 98 00 08 	lwz     r4,8(r24)
c06671b8:	4b ff ee 01 	bl      c0665fb8 <csum_add>
c06671bc:	a0 98 00 0c 	lhz     r4,12(r24)
c06671c0:	4b ff ed f9 	bl      c0665fb8 <csum_add>
c06671c4:	7c 63 18 f8 	not     r3,r3
c06671c8:	81 3f 00 68 	lwz     r9,104(r31)
c06671cc:	81 5f 00 a0 	lwz     r10,160(r31)
c06671d0:	7d 29 18 14 	addc    r9,r9,r3
c06671d4:	7d 29 01 94 	addze   r9,r9
c06671d8:	91 3f 00 68 	stw     r9,104(r31)
c06671dc:	7d 1a 50 50 	subf    r8,r26,r10
c06671e0:	83 01 00 10 	lwz     r24,16(r1)
c06671e4:	83 41 00 18 	lwz     r26,24(r1)

The sum with 0 is useless, should have been skipped.
And there is even one completely unused instance of csum_add().

In file included from ./include/net/checksum.h:22,
                 from ./include/linux/skbuff.h:28,
                 from ./include/linux/icmp.h:16,
                 from net/ipv6/ip6_tunnel.c:23:
./arch/powerpc/include/asm/checksum.h: In function '__ip6_tnl_rcv':
./arch/powerpc/include/asm/checksum.h:94:22: warning: inlining failed in call to 'csum_add': call is unlikely and code size would grow [-Winline]
   94 | static inline __wsum csum_add(__wsum csum, __wsum addend)
      |                      ^~~~~~~~
./arch/powerpc/include/asm/checksum.h:172:31: note: called from here
  172 |                         sum = csum_add(sum, (__force __wsum)*(const u32 *)buff);
      |                               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
./arch/powerpc/include/asm/checksum.h:94:22: warning: inlining failed in call to 'csum_add': call is unlikely and code size would grow [-Winline]
   94 | static inline __wsum csum_add(__wsum csum, __wsum addend)
      |                      ^~~~~~~~
./arch/powerpc/include/asm/checksum.h:177:31: note: called from here
  177 |                         sum = csum_add(sum, (__force __wsum)
      |                               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  178 |                                             *(const u32 *)(buff + 4));
      |                                             ~~~~~~~~~~~~~~~~~~~~~~~~~
./arch/powerpc/include/asm/checksum.h:94:22: warning: inlining failed in call to 'csum_add': call is unlikely and code size would grow [-Winline]
   94 | static inline __wsum csum_add(__wsum csum, __wsum addend)
      |                      ^~~~~~~~
./arch/powerpc/include/asm/checksum.h:183:31: note: called from here
  183 |                         sum = csum_add(sum, (__force __wsum)
      |                               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  184 |                                             *(const u32 *)(buff + 8));
      |                                             ~~~~~~~~~~~~~~~~~~~~~~~~~
./arch/powerpc/include/asm/checksum.h:94:22: warning: inlining failed in call to 'csum_add': call is unlikely and code size would grow [-Winline]
   94 | static inline __wsum csum_add(__wsum csum, __wsum addend)
      |                      ^~~~~~~~
./arch/powerpc/include/asm/checksum.h:186:31: note: called from here
  186 |                         sum = csum_add(sum, (__force __wsum)
      |                               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  187 |                                             *(const u16 *)(buff + 12));
      |                                             ~~~~~~~~~~~~~~~~~~~~~~~~~~

Force inlining of csum_add().

     94c:	80 df 00 a0 	lwz     r6,160(r31)
     950:	7d 28 50 2e 	lwzx    r9,r8,r10
     954:	7d 48 52 14 	add     r10,r8,r10
     958:	80 aa 00 04 	lwz     r5,4(r10)
     95c:	80 ff 00 68 	lwz     r7,104(r31)
     960:	7d 29 28 14 	addc    r9,r9,r5
     964:	7d 29 01 94 	addze   r9,r9
     968:	7d 08 30 50 	subf    r8,r8,r6
     96c:	80 aa 00 08 	lwz     r5,8(r10)
     970:	a1 4a 00 0c 	lhz     r10,12(r10)
     974:	7d 29 28 14 	addc    r9,r9,r5
     978:	7d 29 01 94 	addze   r9,r9
     97c:	7d 29 50 14 	addc    r9,r9,r10
     980:	7d 29 01 94 	addze   r9,r9
     984:	7d 29 48 f8 	not     r9,r9
     988:	7c e7 48 14 	addc    r7,r7,r9
     98c:	7c e7 01 94 	addze   r7,r7
     990:	90 ff 00 68 	stw     r7,104(r31)

In the non-inlined version, the first sum with 0 was performed.
Here it is skipped.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Segher Boessenkool <segher@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/f7f4d4e364de6e473da874468b903da6e5d97adc.1620713272.git.christophe.leroy@csgroup.eu
2021-06-16 00:16:47 +10:00
Christophe Leroy
328e7e487a powerpc: force inlining of csum_partial() to avoid multiple csum_partial() with GCC10
ppc-linux-objdump -d vmlinux | grep -e "<csum_partial>" -e "<__csum_partial>"

With gcc9 I get:

	c0017ef8 <__csum_partial>:
	c00182fc:	4b ff fb fd 	bl      c0017ef8 <__csum_partial>
	c0018478:	4b ff fa 80 	b       c0017ef8 <__csum_partial>
	c03e8458:	4b c2 fa a0 	b       c0017ef8 <__csum_partial>
	c03e8518:	4b c2 f9 e1 	bl      c0017ef8 <__csum_partial>
	c03ef410:	4b c2 8a e9 	bl      c0017ef8 <__csum_partial>
	c03f0b24:	4b c2 73 d5 	bl      c0017ef8 <__csum_partial>
	c04279a4:	4b bf 05 55 	bl      c0017ef8 <__csum_partial>
	c0429820:	4b be e6 d9 	bl      c0017ef8 <__csum_partial>
	c0429944:	4b be e5 b5 	bl      c0017ef8 <__csum_partial>
	c042b478:	4b be ca 81 	bl      c0017ef8 <__csum_partial>
	c042b554:	4b be c9 a5 	bl      c0017ef8 <__csum_partial>
	c045f15c:	4b bb 8d 9d 	bl      c0017ef8 <__csum_partial>
	c0492190:	4b b8 5d 69 	bl      c0017ef8 <__csum_partial>
	c0492310:	4b b8 5b e9 	bl      c0017ef8 <__csum_partial>
	c0495594:	4b b8 29 65 	bl      c0017ef8 <__csum_partial>
	c049c420:	4b b7 ba d9 	bl      c0017ef8 <__csum_partial>
	c049c870:	4b b7 b6 89 	bl      c0017ef8 <__csum_partial>
	c049c930:	4b b7 b5 c9 	bl      c0017ef8 <__csum_partial>
	c04a9ca0:	4b b6 e2 59 	bl      c0017ef8 <__csum_partial>
	c04bdde4:	4b b5 a1 15 	bl      c0017ef8 <__csum_partial>
	c04be480:	4b b5 9a 79 	bl      c0017ef8 <__csum_partial>
	c04be710:	4b b5 97 e9 	bl      c0017ef8 <__csum_partial>
	c04c969c:	4b b4 e8 5d 	bl      c0017ef8 <__csum_partial>
	c04ca2fc:	4b b4 db fd 	bl      c0017ef8 <__csum_partial>
	c04cf5bc:	4b b4 89 3d 	bl      c0017ef8 <__csum_partial>
	c04d0440:	4b b4 7a b9 	bl      c0017ef8 <__csum_partial>

With gcc10 I get:

	c0018d08 <__csum_partial>:
	c0019020 <csum_partial>:
	c0019020:	4b ff fc e8 	b       c0018d08 <__csum_partial>
	c001914c:	4b ff fe d4 	b       c0019020 <csum_partial>
	c0019250:	4b ff fd d1 	bl      c0019020 <csum_partial>
	c03e404c <csum_partial>:
	c03e404c:	4b c3 4c bc 	b       c0018d08 <__csum_partial>
	c03e4050:	4b ff ff fc 	b       c03e404c <csum_partial>
	c03e40fc:	4b ff ff 51 	bl      c03e404c <csum_partial>
	c03e6680:	4b ff d9 cd 	bl      c03e404c <csum_partial>
	c03e68c4:	4b ff d7 89 	bl      c03e404c <csum_partial>
	c03e7934:	4b ff c7 19 	bl      c03e404c <csum_partial>
	c03e7bf8:	4b ff c4 55 	bl      c03e404c <csum_partial>
	c03eb148:	4b ff 8f 05 	bl      c03e404c <csum_partial>
	c03ecf68:	4b c2 bd a1 	bl      c0018d08 <__csum_partial>
	c04275b8 <csum_partial>:
	c04275b8:	4b bf 17 50 	b       c0018d08 <__csum_partial>
	c0427884:	4b ff fd 35 	bl      c04275b8 <csum_partial>
	c0427b18:	4b ff fa a1 	bl      c04275b8 <csum_partial>
	c0427bd8:	4b ff f9 e1 	bl      c04275b8 <csum_partial>
	c0427cd4:	4b ff f8 e5 	bl      c04275b8 <csum_partial>
	c0427e34:	4b ff f7 85 	bl      c04275b8 <csum_partial>
	c045a1c0:	4b bb eb 49 	bl      c0018d08 <__csum_partial>
	c0489464 <csum_partial>:
	c0489464:	4b b8 f8 a4 	b       c0018d08 <__csum_partial>
	c04896b0:	4b ff fd b5 	bl      c0489464 <csum_partial>
	c048982c:	4b ff fc 39 	bl      c0489464 <csum_partial>
	c0490158:	4b b8 8b b1 	bl      c0018d08 <__csum_partial>
	c0492f0c <csum_partial>:
	c0492f0c:	4b b8 5d fc 	b       c0018d08 <__csum_partial>
	c049326c:	4b ff fc a1 	bl      c0492f0c <csum_partial>
	c049333c:	4b ff fb d1 	bl      c0492f0c <csum_partial>
	c0493b18:	4b ff f3 f5 	bl      c0492f0c <csum_partial>
	c0493f50:	4b ff ef bd 	bl      c0492f0c <csum_partial>
	c0493ffc:	4b ff ef 11 	bl      c0492f0c <csum_partial>
	c04a0f78:	4b b7 7d 91 	bl      c0018d08 <__csum_partial>
	c04b3e3c:	4b b6 4e cd 	bl      c0018d08 <__csum_partial>
	c04b40d0 <csum_partial>:
	c04b40d0:	4b b6 4c 38 	b       c0018d08 <__csum_partial>
	c04b4448:	4b ff fc 89 	bl      c04b40d0 <csum_partial>
	c04b46f4:	4b ff f9 dd 	bl      c04b40d0 <csum_partial>
	c04bf448:	4b b5 98 c0 	b       c0018d08 <__csum_partial>
	c04c5264:	4b b5 3a a5 	bl      c0018d08 <__csum_partial>
	c04c61e4:	4b b5 2b 25 	bl      c0018d08 <__csum_partial>

gcc10 defines multiple versions of csum_partial() which are just
an unconditionnal branch to __csum_partial().

To enforce inlining of that branch to __csum_partial(),
mark csum_partial() as __always_inline.

With this patch with gcc10:

	c0018d08 <__csum_partial>:
	c0019148:	4b ff fb c0 	b       c0018d08 <__csum_partial>
	c001924c:	4b ff fa bd 	bl      c0018d08 <__csum_partial>
	c03e40ec:	4b c3 4c 1d 	bl      c0018d08 <__csum_partial>
	c03e4120:	4b c3 4b e8 	b       c0018d08 <__csum_partial>
	c03eb004:	4b c2 dd 05 	bl      c0018d08 <__csum_partial>
	c03ecef4:	4b c2 be 15 	bl      c0018d08 <__csum_partial>
	c0427558:	4b bf 17 b1 	bl      c0018d08 <__csum_partial>
	c04286e4:	4b bf 06 25 	bl      c0018d08 <__csum_partial>
	c0428cd8:	4b bf 00 31 	bl      c0018d08 <__csum_partial>
	c0428d84:	4b be ff 85 	bl      c0018d08 <__csum_partial>
	c045a17c:	4b bb eb 8d 	bl      c0018d08 <__csum_partial>
	c0489450:	4b b8 f8 b9 	bl      c0018d08 <__csum_partial>
	c0491860:	4b b8 74 a9 	bl      c0018d08 <__csum_partial>
	c0492eec:	4b b8 5e 1d 	bl      c0018d08 <__csum_partial>
	c04a0eac:	4b b7 7e 5d 	bl      c0018d08 <__csum_partial>
	c04b3e34:	4b b6 4e d5 	bl      c0018d08 <__csum_partial>
	c04b426c:	4b b6 4a 9d 	bl      c0018d08 <__csum_partial>
	c04b463c:	4b b6 46 cd 	bl      c0018d08 <__csum_partial>
	c04c004c:	4b b5 8c bd 	bl      c0018d08 <__csum_partial>
	c04c0368:	4b b5 89 a1 	bl      c0018d08 <__csum_partial>
	c04c5254:	4b b5 3a b5 	bl      c0018d08 <__csum_partial>
	c04c60d4:	4b b5 2c 35 	bl      c0018d08 <__csum_partial>

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Segher Boessenkool  <segher@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/a1d31f84ddb0926813b17fcd5cc7f3fa7b4deac2.1602759123.git.christophe.leroy@csgroup.eu
2020-12-15 22:50:11 +11:00
Al Viro
70d65cd555 ppc: propagate the calling conventions change down to csum_partial_copy_generic()
... and get rid of the pointless fallback in the wrappers.  On error it used
to zero the unwritten area and calculate the csum of the entire thing.  Not
wanting to do it in assembler part had been very reasonable; doing that in
the first place, OTOH...  In case of an error the caller discards the data
we'd copied, along with whatever checksum it might've had.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-08-20 15:45:22 -04:00
Al Viro
c693cc4676 saner calling conventions for csum_and_copy_..._user()
All callers of these primitives will
	* discard anything we might've copied in case of error
	* ignore the csum value in case of error
	* always pass 0xffffffff as the initial sum, so the
resulting csum value (in case of success, that is) will never be 0.

That suggest the following calling conventions:
	* don't pass err_ptr - just return 0 on error.
	* don't bother with zeroing destination, etc. in case of error
	* don't pass the initial sum - just use 0xffffffff.

This commit does the minimal conversion in the instances of csum_and_copy_...();
the changes of actual asm code behind them are done later in the series.
Note that this asm code is often shared with csum_partial_copy_nocheck();
the difference is that csum_partial_copy_nocheck() passes 0 for initial
sum while csum_and_copy_..._user() pass 0xffffffff.  Fortunately, we are
free to pass 0xffffffff in all cases and subsequent patches will use that
freedom without any special comments.

A part that could be split off: parisc and uml/i386 claimed to have
csum_and_copy_to_user() instances of their own, but those were identical
to the generic one, so we simply drop them.  Not sure if it's worth
a separate commit...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-08-20 15:45:15 -04:00
Al Viro
cc44c17baf csum_partial_copy_nocheck(): drop the last argument
It's always 0.  Note that we theoretically could use ~0U as well -
result will be the same modulo 0xffff, _if_ the damn thing did the
right thing for any value of initial sum; later we'll make use of
that when convenient.

However, unlike csum_and_copy_..._user(), there are instances that
did not work for arbitrary initial sums; c6x is one such.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-08-20 15:45:14 -04:00
Al Viro
6e41c585e3 unify generic instances of csum_partial_copy_nocheck()
quite a few architectures have the same csum_partial_copy_nocheck() -
simply memcpy() the data and then return the csum of the copy.

hexagon, parisc, ia64, s390, um: explicitly spelled out that way.

arc, arm64, csky, h8300, m68k/nommu, microblaze, mips/GENERIC_CSUM, nds32,
nios2, openrisc, riscv, unicore32: end up picking the same thing spelled
out in lib/checksum.h (with varying amounts of perversions along the way).

everybody else (alpha, arm, c6x, m68k/mmu, mips/!GENERIC_CSUM, powerpc,
sh, sparc, x86, xtensa) have non-generic variants.  For all except c6x
the declaration is in their asm/checksum.h.  c6x uses the wrapper
from asm-generic/checksum.h that would normally lead to the lib/checksum.h
instance, but in case of c6x we end up using an asm function from arch/c6x
instead.

Screw that mess - have architectures with private instances define
_HAVE_ARCH_CSUM_AND_COPY in their asm/checksum.h and have the default
one right in net/checksum.h conditional on _HAVE_ARCH_CSUM_AND_COPY
*not* defined.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-08-20 15:45:14 -04:00
Thomas Gleixner
2874c5fd28 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 3029 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:26:32 -07:00
Christophe Leroy
d065ee93aa powerpc: drop unused GENERIC_CSUM Kconfig item
Commit d4fde568a3 ("powerpc/64: Use optimized checksum routines on
little-endian") converted last powerpc user of GENERIC_CSUM.

This patch does a final cleanup dropping the Kconfig GENERIC_CSUM
option which is always 'n', and associated piece of code in
asm/checksum.h

Fixes: d4fde568a3 ("powerpc/64: Use optimized checksum routines on little-endian")
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-02-22 00:10:14 +11:00
Christophe Leroy
e9c4943a10 powerpc: Implement csum_ipv6_magic in assembly
The generic csum_ipv6_magic() generates a pretty bad result

00000000 <csum_ipv6_magic>: (PPC32)
   0:	81 23 00 00 	lwz     r9,0(r3)
   4:	81 03 00 04 	lwz     r8,4(r3)
   8:	7c e7 4a 14 	add     r7,r7,r9
   c:	7d 29 38 10 	subfc   r9,r9,r7
  10:	7d 4a 51 10 	subfe   r10,r10,r10
  14:	7d 27 42 14 	add     r9,r7,r8
  18:	7d 2a 48 50 	subf    r9,r10,r9
  1c:	80 e3 00 08 	lwz     r7,8(r3)
  20:	7d 08 48 10 	subfc   r8,r8,r9
  24:	7d 4a 51 10 	subfe   r10,r10,r10
  28:	7d 29 3a 14 	add     r9,r9,r7
  2c:	81 03 00 0c 	lwz     r8,12(r3)
  30:	7d 2a 48 50 	subf    r9,r10,r9
  34:	7c e7 48 10 	subfc   r7,r7,r9
  38:	7d 4a 51 10 	subfe   r10,r10,r10
  3c:	7d 29 42 14 	add     r9,r9,r8
  40:	7d 2a 48 50 	subf    r9,r10,r9
  44:	80 e4 00 00 	lwz     r7,0(r4)
  48:	7d 08 48 10 	subfc   r8,r8,r9
  4c:	7d 4a 51 10 	subfe   r10,r10,r10
  50:	7d 29 3a 14 	add     r9,r9,r7
  54:	7d 2a 48 50 	subf    r9,r10,r9
  58:	81 04 00 04 	lwz     r8,4(r4)
  5c:	7c e7 48 10 	subfc   r7,r7,r9
  60:	7d 4a 51 10 	subfe   r10,r10,r10
  64:	7d 29 42 14 	add     r9,r9,r8
  68:	7d 2a 48 50 	subf    r9,r10,r9
  6c:	80 e4 00 08 	lwz     r7,8(r4)
  70:	7d 08 48 10 	subfc   r8,r8,r9
  74:	7d 4a 51 10 	subfe   r10,r10,r10
  78:	7d 29 3a 14 	add     r9,r9,r7
  7c:	7d 2a 48 50 	subf    r9,r10,r9
  80:	81 04 00 0c 	lwz     r8,12(r4)
  84:	7c e7 48 10 	subfc   r7,r7,r9
  88:	7d 4a 51 10 	subfe   r10,r10,r10
  8c:	7d 29 42 14 	add     r9,r9,r8
  90:	7d 2a 48 50 	subf    r9,r10,r9
  94:	7d 08 48 10 	subfc   r8,r8,r9
  98:	7d 4a 51 10 	subfe   r10,r10,r10
  9c:	7d 29 2a 14 	add     r9,r9,r5
  a0:	7d 2a 48 50 	subf    r9,r10,r9
  a4:	7c a5 48 10 	subfc   r5,r5,r9
  a8:	7c 63 19 10 	subfe   r3,r3,r3
  ac:	7d 29 32 14 	add     r9,r9,r6
  b0:	7d 23 48 50 	subf    r9,r3,r9
  b4:	7c c6 48 10 	subfc   r6,r6,r9
  b8:	7c 63 19 10 	subfe   r3,r3,r3
  bc:	7c 63 48 50 	subf    r3,r3,r9
  c0:	54 6a 80 3e 	rotlwi  r10,r3,16
  c4:	7c 63 52 14 	add     r3,r3,r10
  c8:	7c 63 18 f8 	not     r3,r3
  cc:	54 63 84 3e 	rlwinm  r3,r3,16,16,31
  d0:	4e 80 00 20 	blr

0000000000000000 <.csum_ipv6_magic>: (PPC64)
   0:	81 23 00 00 	lwz     r9,0(r3)
   4:	80 03 00 04 	lwz     r0,4(r3)
   8:	81 63 00 08 	lwz     r11,8(r3)
   c:	7c e7 4a 14 	add     r7,r7,r9
  10:	7f 89 38 40 	cmplw   cr7,r9,r7
  14:	7d 47 02 14 	add     r10,r7,r0
  18:	7d 30 10 26 	mfocrf  r9,1
  1c:	55 29 f7 fe 	rlwinm  r9,r9,30,31,31
  20:	7d 4a 4a 14 	add     r10,r10,r9
  24:	7f 80 50 40 	cmplw   cr7,r0,r10
  28:	7d 2a 5a 14 	add     r9,r10,r11
  2c:	80 03 00 0c 	lwz     r0,12(r3)
  30:	81 44 00 00 	lwz     r10,0(r4)
  34:	7d 10 10 26 	mfocrf  r8,1
  38:	55 08 f7 fe 	rlwinm  r8,r8,30,31,31
  3c:	7d 29 42 14 	add     r9,r9,r8
  40:	81 04 00 04 	lwz     r8,4(r4)
  44:	7f 8b 48 40 	cmplw   cr7,r11,r9
  48:	7d 29 02 14 	add     r9,r9,r0
  4c:	7d 70 10 26 	mfocrf  r11,1
  50:	55 6b f7 fe 	rlwinm  r11,r11,30,31,31
  54:	7d 29 5a 14 	add     r9,r9,r11
  58:	7f 80 48 40 	cmplw   cr7,r0,r9
  5c:	7d 29 52 14 	add     r9,r9,r10
  60:	7c 10 10 26 	mfocrf  r0,1
  64:	54 00 f7 fe 	rlwinm  r0,r0,30,31,31
  68:	7d 69 02 14 	add     r11,r9,r0
  6c:	7f 8a 58 40 	cmplw   cr7,r10,r11
  70:	7c 0b 42 14 	add     r0,r11,r8
  74:	81 44 00 08 	lwz     r10,8(r4)
  78:	7c f0 10 26 	mfocrf  r7,1
  7c:	54 e7 f7 fe 	rlwinm  r7,r7,30,31,31
  80:	7c 00 3a 14 	add     r0,r0,r7
  84:	7f 88 00 40 	cmplw   cr7,r8,r0
  88:	7d 20 52 14 	add     r9,r0,r10
  8c:	80 04 00 0c 	lwz     r0,12(r4)
  90:	7d 70 10 26 	mfocrf  r11,1
  94:	55 6b f7 fe 	rlwinm  r11,r11,30,31,31
  98:	7d 29 5a 14 	add     r9,r9,r11
  9c:	7f 8a 48 40 	cmplw   cr7,r10,r9
  a0:	7d 29 02 14 	add     r9,r9,r0
  a4:	7d 70 10 26 	mfocrf  r11,1
  a8:	55 6b f7 fe 	rlwinm  r11,r11,30,31,31
  ac:	7d 29 5a 14 	add     r9,r9,r11
  b0:	7f 80 48 40 	cmplw   cr7,r0,r9
  b4:	7d 29 2a 14 	add     r9,r9,r5
  b8:	7c 10 10 26 	mfocrf  r0,1
  bc:	54 00 f7 fe 	rlwinm  r0,r0,30,31,31
  c0:	7d 29 02 14 	add     r9,r9,r0
  c4:	7f 85 48 40 	cmplw   cr7,r5,r9
  c8:	7c 09 32 14 	add     r0,r9,r6
  cc:	7d 50 10 26 	mfocrf  r10,1
  d0:	55 4a f7 fe 	rlwinm  r10,r10,30,31,31
  d4:	7c 00 52 14 	add     r0,r0,r10
  d8:	7f 80 30 40 	cmplw   cr7,r0,r6
  dc:	7d 30 10 26 	mfocrf  r9,1
  e0:	55 29 ef fe 	rlwinm  r9,r9,29,31,31
  e4:	7c 09 02 14 	add     r0,r9,r0
  e8:	54 03 80 3e 	rotlwi  r3,r0,16
  ec:	7c 03 02 14 	add     r0,r3,r0
  f0:	7c 03 00 f8 	not     r3,r0
  f4:	78 63 84 22 	rldicl  r3,r3,48,48
  f8:	4e 80 00 20 	blr

This patch implements it in assembly for both PPC32 and PPC64

Link: https://github.com/linuxppc/linux/issues/9
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Segher Boessenkool <segher@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-06-04 00:39:19 +10:00
Christophe Leroy
55a0edf083 powerpc/64: optimises from64to32()
The current implementation of from64to32() gives a poor result:

0000000000000270 <.from64to32>:
 270:	38 00 ff ff 	li      r0,-1
 274:	78 69 00 22 	rldicl  r9,r3,32,32
 278:	78 00 00 20 	clrldi  r0,r0,32
 27c:	7c 60 00 38 	and     r0,r3,r0
 280:	7c 09 02 14 	add     r0,r9,r0
 284:	78 09 00 22 	rldicl  r9,r0,32,32
 288:	7c 00 4a 14 	add     r0,r0,r9
 28c:	78 03 00 20 	clrldi  r3,r0,32
 290:	4e 80 00 20 	blr

This patch modifies from64to32() to operate in the same
spirit as csum_fold()

It swaps the two 32-bit halves of sum then it adds it with the
unswapped sum. If there is a carry from adding the two 32-bit halves,
it will carry from the lower half into the upper half, giving us the
correct sum in the upper half.

The resulting code is:

0000000000000260 <.from64to32>:
 260:	78 60 00 02 	rotldi  r0,r3,32
 264:	7c 60 1a 14 	add     r3,r0,r3
 268:	78 63 00 22 	rldicl  r3,r3,32,32
 26c:	4e 80 00 20 	blr

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-06-04 00:39:16 +10:00
Christophe Leroy
96f391cf40 Revert "powerpc/64: Fix checksum folding in csum_add()"
This reverts commit 6ad966d730.

That commit was pointless, because csum_add() sums two 32 bits
values, so the sum is 0x1fffffffe at the maximum.
And then when adding upper part (1) and lower part (0xfffffffe),
the result is 0xffffffff which doesn't carry.
Any lower value will not carry either.

And behind the fact that this commit is useless, it also kills the
whole purpose of having an arch specific inline csum_add()
because the resulting code gets even worse than what is obtained
with the generic implementation of csum_add()

0000000000000240 <.csum_add>:
 240:	38 00 ff ff 	li      r0,-1
 244:	7c 84 1a 14 	add     r4,r4,r3
 248:	78 00 00 20 	clrldi  r0,r0,32
 24c:	78 89 00 22 	rldicl  r9,r4,32,32
 250:	7c 80 00 38 	and     r0,r4,r0
 254:	7c 09 02 14 	add     r0,r9,r0
 258:	78 09 00 22 	rldicl  r9,r0,32,32
 25c:	7c 00 4a 14 	add     r0,r0,r9
 260:	78 03 00 20 	clrldi  r3,r0,32
 264:	4e 80 00 20 	blr

In comparison, the generic implementation of csum_add() gives:

0000000000000290 <.csum_add>:
 290:	7c 63 22 14 	add     r3,r3,r4
 294:	7f 83 20 40 	cmplw   cr7,r3,r4
 298:	7c 10 10 26 	mfocrf  r0,1
 29c:	54 00 ef fe 	rlwinm  r0,r0,29,31,31
 2a0:	7c 60 1a 14 	add     r3,r0,r3
 2a4:	78 63 00 20 	clrldi  r3,r3,32
 2a8:	4e 80 00 20 	blr

And the reverted implementation for PPC64 gives:

0000000000000240 <.csum_add>:
 240:	7c 84 1a 14 	add     r4,r4,r3
 244:	78 80 00 22 	rldicl  r0,r4,32,32
 248:	7c 80 22 14 	add     r4,r0,r4
 24c:	78 83 00 20 	clrldi  r3,r4,32
 250:	4e 80 00 20 	blr

Fixes: 6ad966d730 ("powerpc/64: Fix checksum folding in csum_add()")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-05-18 00:09:04 +10:00
Shile Zhang
6ad966d730 powerpc/64: Fix checksum folding in csum_add()
Paul's patch to fix checksum folding, commit b492f7e4e0 ("powerpc/64:
Fix checksum folding in csum_tcpudp_nofold and ip_fast_csum_nofold")
missed a case in csum_add(). Fix it.

Signed-off-by: Shile Zhang <shile.zhang@nokia.com>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-04 23:07:17 +11:00
Paul Mackerras
d4fde568a3 powerpc/64: Use optimized checksum routines on little-endian
Currently we have optimized hand-coded assembly checksum routines for
big-endian 64-bit systems, but for little-endian we use the generic C
routines. This modifies the optimized routines to work for
little-endian. With this, we no longer need to enable
CONFIG_GENERIC_CSUM. This also fixes a couple of comments in
checksum_64.S so they accurately reflect what the associated instruction
does.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
[mpe: Use the more common __BIG_ENDIAN__]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-25 13:34:18 +11:00
Paul Mackerras
b492f7e4e0 powerpc/64: Fix checksum folding in csum_tcpudp_nofold and ip_fast_csum_nofold
These functions compute an IP checksum by computing a 64-bit sum and
folding it to 32 bits (the "nofold" in their names refers to folding
down to 16 bits).  However, doing (u32) (s + (s >> 32)) is not
sufficient to fold a 64-bit sum to 32 bits correctly.  The addition
can produce a carry out from bit 31, which needs to be added in to
the sum to produce the correct result.

To fix this, we copy the from64to32() function from lib/checksum.c
and use that.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-01-25 13:34:18 +11:00
Ivan Vecera
f9d4286b95 arch/powerpc: Update parameters for csum_tcpudp_magic & csum_tcpudp_nofold
Commit 01cfbad "ipv4: Update parameters for csum_tcpudp_magic to their
original types" changed parameters for csum_tcpudp_magic and
csum_tcpudp_nofold for many platforms but not for PowerPC.

Fixes: 01cfbad "ipv4: Update parameters for csum_tcpudp_magic to their original types"
Cc: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 17:06:23 -04:00
Christophe Leroy
7e393220b6 powerpc: optimise csum_partial() call when len is constant
csum_partial is often called for small fixed length packets
for which it is suboptimal to use the generic csum_partial()
function.

For instance, in my configuration, I got:
* One place calling it with constant len 4
* Seven places calling it with constant len 8
* Three places calling it with constant len 14
* One place calling it with constant len 20
* One place calling it with constant len 24
* One place calling it with constant len 32

This patch renames csum_partial() to __csum_partial() and
implements csum_partial() as a wrapper inline function which
* uses csum_add() for small 16bits multiple constant length
* uses ip_fast_csum() for other 32bits multiple constant
* uses __csum_partial() in all other cases

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
2016-03-09 10:44:18 -06:00
Christophe Leroy
5a8847c83c powerpc: simplify csum_add(a, b) in case a or b is constant 0
Simplify csum_add(a, b) in case a or b is constant 0

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
2016-03-04 23:04:00 -06:00
Christophe Leroy
37e08cad8f powerpc: inline ip_fast_csum()
In several architectures, ip_fast_csum() is inlined
There are functions like ip_send_check() which do nothing
much more than calling ip_fast_csum().
Inlining ip_fast_csum() allows the compiler to optimise better

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[scottwood: whitespace and cast fixes]
Signed-off-by: Scott Wood <oss@buserror.net>
2016-03-04 21:49:49 -06:00
Christophe Leroy
03bc8b0fc8 powerpc32: checksum_wrappers_64 becomes checksum_wrappers
The powerpc64 checksum wrapper functions adds csum_and_copy_to_user()
which otherwise is implemented in include/net/checksum.h by using
csum_partial() then copy_to_user()

Those two wrapper fonctions are also applicable to powerpc32 as it is
based on the use of csum_partial_copy_generic() which also
exists on powerpc32

This patch renames arch/powerpc/lib/checksum_wrappers_64.c to
arch/powerpc/lib/checksum_wrappers.c and
makes it non-conditional to CONFIG_WORD_SIZE

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
2016-03-04 21:47:47 -06:00
Christophe Leroy
11dfbf588a powerpc: mark xer clobbered in csum_add()
addc uses carry so xer is clobbered in csum_add()

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
2016-03-04 21:47:27 -06:00
LEROY Christophe
501c8de7b0 powerpc: add support for csum_add()
The C version of csum_add() as defined in include/net/checksum.h gives
the following assembly in ppc32:
       0:       7c 04 1a 14     add     r0,r4,r3
       4:       7c 64 00 10     subfc   r3,r4,r0
       8:       7c 63 19 10     subfe   r3,r3,r3
       c:       7c 63 00 50     subf    r3,r3,r0
and the following in ppc64:
   0xc000000000001af8 <+0>:	add     r3,r3,r4
   0xc000000000001afc <+4>:	cmplw   cr7,r3,r4
   0xc000000000001b00 <+8>:	mfcr    r4
   0xc000000000001b04 <+12>:	rlwinm  r4,r4,29,31,31
   0xc000000000001b08 <+16>:	add     r3,r4,r3
   0xc000000000001b0c <+20>:	clrldi  r3,r3,32
   0xc000000000001b10 <+24>:	blr

include/net/checksum.h also offers the possibility to define an arch
specific function.  This patch provides a specific csum_add() inline
function.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
2015-08-07 22:59:19 -05:00
LEROY Christophe
92c985f1d7 powerpc: put csum_tcpudp_magic inline
csum_tcpudp_magic() is only a few instructions, and does modify
really few registers. So it is not worth having it as a separate
function and suffer function branching and saving of volatile
registers.

This patch makes it inline by use of the already existing
csum_tcpudp_nofold() function.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
2015-08-07 22:59:19 -05:00
Anton Blanchard
7a332b0c9a powerpc: Use generic checksum code in little endian
We need to fix some endian issues in our checksum code. For now
just enable the generic checksum routines for little endian builds.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-10-11 16:48:39 +11:00
Anton Blanchard
8c77391475 powerpc: Add 64bit csum_and_copy_to_user
This adds the equivalent of csum_and_copy_from_user for the receive side so we
can copy and checksum in one pass. It is modelled on the generic checksum
routine.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-09-02 14:07:30 +10:00
Anton Blanchard
fdd374b62c powerpc: Optimise 64bit csum_partial_copy_generic and add csum_and_copy_from_user
We use the same core loop as the new csum_partial, adding in the
stores and exception handling code. To keep things simple we do all the
exception fixup in csum_and_copy_from_user. This wrapper function is
modelled on the generic checksum code and is careful to always calculate
a complete checksum even if we only copied part of the data to userspace.

To test this I forced checksumming on over loopback and ran socklib (a
simple TCP benchmark). On a POWER6 575 throughput improved by 19% with
this patch. If I forced both the sender and receiver onto the same cpu
(with the hope of shifting the benchmark from being cache bandwidth limited
to cpu limited), adding this patch improved performance by 55%

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-09-02 14:07:30 +10:00
Stephen Rothwell
b8b572e101 powerpc: Move include files to arch/powerpc/include/asm
from include/asm-powerpc.  This is the result of a

mkdir arch/powerpc/include/asm
git mv include/asm-powerpc/* arch/powerpc/include/asm

Followed by a few documentation/comment fixups and a couple of places
where <asm-powepc/...> was being used explicitly.  Of the latter only
one was outside the arch code and it is a driver only built for powerpc.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2008-08-04 12:02:00 +10:00
Renamed from include/asm-powerpc/checksum.h (Browse further)