blob: e888e17e5b42a7798a16f510e194ef64e427816d (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
|
==============================================================================
Cycle counts and throughput for low-level routines in GNU MP as currently
implemented.
A range means that the timing is data-dependent. The slower number of such
an interval is usually the best performance estimate.
The throughput value, measured in Gb/s (gigabits per second) has a meaning
only for comparison between CPUs.
A star before a line means that all values on that line are estimates. A
star before a number means that that number is an estimate. A `p' before a
number means that the code is not complete, but the timing is believed to be
accurate.
| mpn_lshift mpn_add_n mpn_mul_1 mpn_addmul_1
| mpn_rshift mpn_sub_n mpn_submul_1
------------+-----------------------------------------------------------------
DEC/Alpha |
EV4 | 4.75 cycles/64b 7.75 cycles/64b 42 cycles/64b 42 cycles/64b
200MHz | 2.7 Gb/s 1.65 Gb/s 20 Gb/s 20 Gb/s
EV5 old code| 4.0 cycles/64b 5.5 cycles/64b 18 cycles/64b 18 cycles/64b
267MHz | 4.27 Gb/s 3.10 Gb/s 61 Gb/s 61 Gb/s
417MHz | 6.67 Gb/s 4.85 Gb/s 95 Gb/s 95 Gb/s
EV5 tuned | 3.25 cycles/64b 4.75 cycles/64b
267MHz | 5.25 Gb/s 3.59 Gb/s as above
417MHz | 8.21 Gb/s 5.61 Gb/s
------------+-----------------------------------------------------------------
Sun/SPARC |
SPARC v7 | 14.0 cycles/32b 8.5 cycles/32b 37-54 cycl/32b 37-54 cycl/32b
SuperSPARC | 3 cycles/32b 2.5 cycles/32b 8.2 cycles/32b 10.8 cycles/32b
50MHz | 0.53 Gb/s 0.64 Gb/s 6.2 Gb/s 4.7 Gb/s
**SuperSPARC| tuned addmul and submul will take: 9.25 cycles/32b
MicroSPARC2 | ? 6.65 cycles/32b 30 cycles/32b 31.5 cycles/32b
110MHz | ? 0.53 Gb/s 3.75 Gb/s 3.58 Gb/s
SuperSPARC2 | ? ? ? ?
Ultra/32 (4)| 2.5 cycles/32b 6.5 cycles/32b 13-27 cyc/32b 16-30 cyc/32b
182MHz | 2.33 Gb/s 0.896 Gb/s 14.3-6.9 Gb/s
Ultra/64 (5)| 2.5 cycles/64b 10 cycles/64b 40-70 cyc/64b 46-76 cyc/64b
182MHz | 4.66 Gb/s 1.16 Gb/s 18.6-11 Gb/s
HalSPARC64 | ? ? ? ?
------------+-----------------------------------------------------------------
SGI/MIPS |
R3000 | 6 cycles/32b 9.25 cycles/32b 16 cycles/32b 16 cycles/32b
40MHz | 0.21 Gb/s 0.14 Gb/s 2.56 Gb/s 2.56 Gb/s
R4400/32 | 8.6 cycles/32b 10 cycles/32b 16-18 19-21
200MHz | 0.74 Gb/s 0.64 Gb/s 13-11 Gb/s 11-9.6 Gb/s
*R4400/64 | 8.6 cycles/64b 10 cycles/64b 22 cycles/64b 22 cycles/64b
*200MHz | 1.48 Gb/s 1.28 Gb/s 37 Gb/s 37 Gb/s
R4600/32 | 6 cycles/64b 9.25 cycles/32b 15 cycles/32b 19 cycles/32b
134MHz | 0.71 Gb/s 0.46 Gb/s 9.1 Gb/s 7.2 Gb/s
R4600/64 | 6 cycles/64b 9.25 cycles/64b ? ?
134MHz | 1.4 Gb/s 0.93 Gb/s ? ?
R8000/64 | 3 cycles/64b 4.6 cycles/64b 8 cycles/64b 8 cycles/64b
75MHz | 1.6 Gb/s 1.0 Gb/s 38 Gb/s 38 Gb/s
*R10000/64 | 2 cycles/64b 3 cycles/64b 11 cycles/64b 11 cycles/64b
*200MHz | 6.4 Gb/s 4.27 Gb/s 74 Gb/s 74 Gb/s
*250MHz | 8.0 Gb/s 5.33 Gb/s 93 Gb/s 93 Gb/s
------------+-----------------------------------------------------------------
Motorola |
MC68020 | ? 24 cycles/32b 62 cycles/32b 70 cycles/32b
MC68040 | ? 6 cycles/32b 24 cycles/32b 25 cycles/32b
MC88100 | >5 cycles/32b 4.6 cycles/32b 16/21 cyc/32b p 18/23 cyc/32b
MC88110 wt | ? 3.75 cycles/32b 6 cycles/32b 8.5 cyc/32b
*MC88110 wb | ? 2.25 cycles/32b 4 cycles/32b 5 cycles/32b
------------+-----------------------------------------------------------------
HP/PA-RISC |
PA7000 | 4 cycles/32b 5 cycles/32b 9 cycles/32b 11 cycles/32b
67MHz | 0.53 Gb/s 0.43 Gb/s 7.6 Gb/s 6.2 Gb/s
PA7100 | 3.25 cycles/32b 4.25 cycles/32b 7 cycles/32b 8 cycles/32b
99MHz | 0.97 Gb/s 0.75 Gb/s 14 Gb/s 12.8 Gb/s
PA7100LC | ? ? ? ?
PA7200 (3) | 3 cycles/32b 4 cycles/32b 7 cycles/32b 6.5 cycles/32b
100MHz | 1.07 Gb/s 0.80 14 Gb/s 15.8 Gb/s
PA7300LC | ? ? ? ?
*PA8000 | 3 cycles/64b 4 cycles/64b 7 cycles/64b 6.5 cycles/64b
180MHz | 3.84 Gb/s 2.88 Gb/s 105 Gb/s 113 Gb/s
------------+-----------------------------------------------------------------
Intel/x86 |
386DX | 20 cycles/32b 17 cycles/32b 41-70 cycl/32b 50-79 cycl/32b
16.7MHz | 0.027 Gb/s 0.031 Gb/s 0.42-0.24 Gb/s 0.34-0.22 Gb/s
486DX | ? ? ? ?
486DX4 | 9.5 cycles/32b 9.25 cycles/32b 17-23 cycl/32b 20-26 cycl/32b
100MHz | 0.34 Gb/s 0.35 Gb/s 6.0-4.5 Gb/s 5.1-3.9 Gb/s
Pentium | 2/6 cycles/32b 2.5 cycles/32b 13 cycles/32b 14 cycles/32b
167MHz | 2.7/0.89 Gb/s 2.1 Gb/s 13.1 Gb/s 12.2 Gb/s
Pentium Pro | 2.5 cycles/32b 3.5 cycles/32b 6 cycles/32b 9 cycles/32b
200MHz | 2.6 Gb/s 1.8 Gb/s 34 Gb/s 23 Gb/s
------------+-----------------------------------------------------------------
IBM/POWER |
RIOS 1 | 3 cycles/32b 4 cycles/32b 11.5-12.5 c/32b 14.5/15.5 c/32b
RIOS 2 | 2 cycles/32b 2 cycles/32b 7 cycles/32b 8.5 cycles/32b
------------+-----------------------------------------------------------------
PowerPC |
PPC601 (1) | 3 cycles/32b 6 cycles/32b 11-16 cycl/32b 14-19 cycl/32b
PPC601 (2) | 5 cycles/32b 6 cycles/32b 13-22 cycl/32b 16-25 cycl/32b
67MHz (2) | 0.43 Gb/s 0.36 Gb/s 5.3-3.0 Gb/s 4.3-2.7 Gb/s
PPC603 | ? ? ? ?
*PPC604 | 2 3 2 3
*167MHz | 57 Gb/s
PPC620 | ? ? ? ?
------------+-----------------------------------------------------------------
Tege |
Model 1 | 2 cycles/64b 3 cycles/64b 2 cycles/64b 3 cycles/64b
250MHz | 8 Gb/s 5.3 Gb/s 500 Gb/s 340 Gb/s
500MHz | 16 Gb/s 11 Gb/s 1000 Gb/s 680 Gb/s
____________|_________________________________________________________________
(1) Using POWER and PowerPC instructions
(2) Using only PowerPC instructions
(3) Actual timing for shift/add/sub depends on code alignment. PA7000 code
is smaller and therefore often faster on this CPU.
(4) Multiplication routines modified for bogus UltraSPARC early-out
optimization. Smaller operand is put in rs1, not rs2 as it should
according to the SPARC architecture manuals.
(5) Preliminary timings, since there is no stable 64-bit environment.
(6) Use mulu.d at least for mpn_lshift. With mak/extu/or, we can only get
to 2 cycles/32b.
=============================================================================
Estimated theoretical asymptotic cycle counts for low-level routines:
| mpn_lshift mpn_add_n mpn_mul_1 mpn_addmul_1
| mpn_rshift mpn_sub_n mpn_submul_1
------------+-----------------------------------------------------------------
DEC/Alpha |
EV4 | 3 cycles/64b 5 cycles/64b 42 cycles/64b 42 cycles/64b
EV5 | 3 cycles/64b 4 cycles/64b 18 cycles/64b 18 cycles/64b
------------+-----------------------------------------------------------------
Sun/SPARC |
SuperSPARC | 2.5 cycles/32b 2 cycles/32b 8 cycles/32b 9 cycles/32b
------------+-----------------------------------------------------------------
SGI/MIPS |
R4400/32 | 5 cycles/64b 8 cycles/64b 16 cycles/64b 16 cycles/64b
R4400/64 | 5 cycles/64b 8 cycles/64b 22 cycles/64b 22 cycles/64b
R4600 |
------------+-----------------------------------------------------------------
HP/PA-RISC |
PA7100 | 3 cycles/32b 4 cycles/32b 6.5 cycles/32b 7.5 cycles/32b
PA7100LC |
------------+-----------------------------------------------------------------
Motorola |
MC88110 | 1.5 cyc/32b (6) 1.5 cycle/32b 1.5 cycles/32b 2.25 cycles/32b
------------+-----------------------------------------------------------------
Intel/x86 |
486DX4 |
Pentium P5x | 5 cycles/32b 2 cycles/32b 11.5 cycles/32b 13 cycles/32b
Pentium Pro | 2 cycles/32b 3 cycles/32b 4 cycles/32b 6 cycles/32b
------------+-----------------------------------------------------------------
IBM/POWER |
RIOS 1 | 3 cycles/32b 4 cycles/32b
RIOS 2 | 1.5 cycles/32b 2 cycles/32b 4.5 cycles/32b 5.5 cycles/32b
------------+-----------------------------------------------------------------
PowerPC |
PPC601 (1) | 3 cycles/32b ?4 cycles/32b
PPC601 (2) | 4 cycles/32b ?4 cycles/32b
____________|_________________________________________________________________
|