115 lines
		
	
	
	
		
			3.9 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
			
		
		
	
	
			115 lines
		
	
	
	
		
			3.9 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
| Copyright 2001 Free Software Foundation, Inc.
 | |
| 
 | |
| This file is part of the GNU MP Library.
 | |
| 
 | |
| The GNU MP Library is free software; you can redistribute it and/or modify
 | |
| it under the terms of the GNU Lesser General Public License as published by
 | |
| the Free Software Foundation; either version 2.1 of the License, or (at your
 | |
| option) any later version.
 | |
| 
 | |
| The GNU MP Library is distributed in the hope that it will be useful, but
 | |
| WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
 | |
| or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
 | |
| License for more details.
 | |
| 
 | |
| You should have received a copy of the GNU Lesser General Public License
 | |
| along with the GNU MP Library; see the file COPYING.LIB.  If not, write to
 | |
| the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
 | |
| 02110-1301, USA.
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
|                    INTEL PENTIUM-4 MPN SUBROUTINES
 | |
| 
 | |
| 
 | |
| This directory contains mpn functions optimized for Intel Pentium-4.
 | |
| 
 | |
| The mmx subdirectory has routines using MMX instructions, the sse2
 | |
| subdirectory has routines using SSE2 instructions.  All P4s have these, the
 | |
| separate directories are just so configure can omit that code if the
 | |
| assembler doesn't support it.
 | |
| 
 | |
| 
 | |
| STATUS
 | |
| 
 | |
|                                 cycles/limb
 | |
| 
 | |
| 	mpn_add_n/sub_n            4 normal, 6 in-place
 | |
| 
 | |
| 	mpn_mul_1                  4 normal, 6 in-place
 | |
| 	mpn_addmul_1               6
 | |
| 	mpn_submul_1               7
 | |
| 
 | |
| 	mpn_mul_basecase           6 cycles/crossproduct (approx)
 | |
| 
 | |
| 	mpn_sqr_basecase           3.5 cycles/crossproduct (approx)
 | |
|                                    or 7.0 cycles/triangleproduct (approx)
 | |
| 
 | |
| 	mpn_l/rshift               1.75
 | |
| 
 | |
| 
 | |
| 
 | |
| The shifts ought to be able to go at 1.5 c/l, but not much effort has been
 | |
| applied to them yet.
 | |
| 
 | |
| In-place operations, and all addmul, submul, mul_basecase and sqr_basecase
 | |
| calls, suffer from pipeline anomalies associated with write combining and
 | |
| movd reads and writes to the same or nearby locations.  The movq
 | |
| instructions do not trigger the same hardware problems.  Unfortunately,
 | |
| using movq and splitting/combining seems to require too many extra
 | |
| instructions to help.  Perhaps future chip steppings will be better.
 | |
| 
 | |
| 
 | |
| 
 | |
| NOTES
 | |
| 
 | |
| The Pentium-4 pipeline "Netburst", provides for quite a number of surprises.
 | |
| Many traditional x86 instructions run very slowly, requiring use of
 | |
| alterative instructions for acceptable performance.
 | |
| 
 | |
| adcl and sbbl are quite slow at 8 cycles for reg->reg.  paddq of 32-bits
 | |
| within a 64-bit mmx register seems better, though the combination
 | |
| paddq/psrlq when propagating a carry is still a 4 cycle latency.
 | |
| 
 | |
| incl and decl should be avoided, instead use add $1 and sub $1.  Apparently
 | |
| the carry flag is not separately renamed, so incl and decl depend on all
 | |
| previous flags-setting instructions.
 | |
| 
 | |
| shll and shrl have a 4 cycle latency, or 8 times the latency of the fastest
 | |
| integer instructions (addl, subl, orl, andl, and some more).  shldl and
 | |
| shrdl seem to have 13 and 15 cycles latency, respectively.  Bizarre.
 | |
| 
 | |
| movq mmx -> mmx does have 6 cycle latency, as noted in the documentation.
 | |
| pxor/por or similar combination at 2 cycles latency can be used instead.
 | |
| The movq however executes in the float unit, thereby saving MMX execution
 | |
| resources.  With the right juggling, data moves shouldn't be on a dependent
 | |
| chain.
 | |
| 
 | |
| L1 is write-through, but the write-combining sounds like it does enough to
 | |
| not require explicit destination prefetching.
 | |
| 
 | |
| xmm registers so far haven't found a use, but not much effort has been
 | |
| expended.  A configure test for whether the operating system knows
 | |
| fxsave/fxrestor will be needed if they're used.
 | |
| 
 | |
| 
 | |
| 
 | |
| REFERENCES
 | |
| 
 | |
| Intel Pentium-4 processor manuals,
 | |
| 
 | |
| 	http://developer.intel.com/design/pentium4/manuals
 | |
| 
 | |
| "Intel Pentium 4 Processor Optimization Reference Manual", Intel, 2001,
 | |
| order number 248966.  Available on-line:
 | |
| 
 | |
| 	http://developer.intel.com/design/pentium4/manuals/248966.htm
 | |
| 
 | |
| 
 | |
| 
 | |
| ----------------
 | |
| Local variables:
 | |
| mode: text
 | |
| fill-column: 76
 | |
| End:
 |