1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
|
.\" $OpenBSD: pctr.4,v 1.3 2007/10/21 09:50:22 jmc Exp $
.\"
.\" Pentium performance counter driver for OpenBSD.
.\" Copyright 1996 David Mazieres <dm@lcs.mit.edu>.
.\"
.\" Modification and redistribution in source and binary forms is
.\" permitted provided that due credit is given to the author and the
.\" OpenBSD project by leaving this copyright notice intact.
.\"
.Dd $Mdocdate: October 21 2007 $
.Dt PCTR 4 amd64
.Os
.Sh NAME
.Nm pctr
.Nd driver for CPU performance counters
.Sh SYNOPSIS
.Cd "pseudo-device pctr 1"
.Sh DESCRIPTION
The
.Nm
device provides access to the performance counters on AMD and Intel brand
processors, and to the TSC on others.
.Pp
Intel processors have two 40-bit performance counters which can be
programmed to count events such as cache misses, branch target buffer hits,
TLB misses, dual-issues, interrupts, pipeline flushes, and more.
While AMD processors have four 48-bit counters, their precision is decreased
to 40 bits.
.Pp
There is one
.Em ioctl
call to read the status of all counters, and one
.Em ioctl
call to program the function of each counter.
All require the following includes:
.Bd -literal -offset indent
#include <sys/types.h>
#include <machine/cpu.h>
#include <machine/pctr.h>
.Ed
.Pp
The current state of all counters can be read with the
.Dv PCIOCRD
.Em ioctl ,
which takes an argument of type
.Dv "struct pctrst" :
.Bd -literal -offset indent
#define PCTR_NUM 4
struct pctrst {
u_int pctr_fn[PCTR_NUM];
pctrval pctr_tsc;
pctrval pctr_hwc[PCTR_NUM];
pctrval pctr_idl;
};
.Ed
.Pp
In this structure,
.Em ctr_fn
contains the functions of the counters, as previously set by the
.Dv PCIOCS0 ,
.Dv PCIOCS1 ,
.Dv PCIOCS2
and
.Dv PCIOCS3
ioctls (see below).
.Em pctr_hwc
contains the actual value of the hardware counters.
.Em pctr_tsc
is a free-running, 64-bit cycle counter.
Finally,
.Em pctr_idl
is a 64-bit count of idle-loop iterations.
.Pp
The functions of the counters can be programmed with ioctls
.Dv PCIOCS0 ,
.Dv PCIOCS1 ,
.Dv PCIOCS2
and
.Dv PCIOCS3
which require a writeable file descriptor and take an argument of type
.Dv "unsigned int" . \&
The meaning of this integer is dependent on the particular CPU.
.Ss Time stamp counter
The time stamp counter is available on most of the AMD and Intel CPUs.
It is set to zero at boot time, and then increments with each cycle.
Because the counter is 64-bits wide, it does not overflow.
.Pp
The time stamp counter can be read directly from user-mode using
the
.Fn rdtsc
macro, which returns a 64-bit value of type
.Em pctrval .
The following example illustrates a simple use of
.Fn rdtsc
to measure the execution time of a hypothetical subroutine called
.Fn functionx :
.Bd -literal -offset indent
void
time_functionx(void)
{
pctrval tsc;
tsc = rdtsc();
functionx();
tsc = rdtsc() - tsc;
printf("Functionx took %llu cycles.\en", tsc);
}
.Ed
.Pp
The value of the time stamp counter is also returned by the
.Dv PCIOCRD
.Em ioctl ,
so that one can get an exact timestamp on readings of the hardware
event counters.
.Pp
The performance counters can be read directly from user-mode without
need to invoke the kernel.
The macro
.Fn rdpmc ctr
takes 0, 1, 2 or 3 as an argument to specify a counter, and returns that
counter's 40-bit value (which will be of type
.Em pctrval ) .
This is generally preferable to making a system call as it introduces
less distortion in measurements.
.Pp
Counter functions supported by these CPUs contain several parts.
The most significant byte (an 8-bit integer shifted left by
.Dv PCTR_CM_SHIFT )
contains a
.Em "counter mask" .
If non-zero, this sets a threshold for the number of times an event
must occur in one cycle for the counter to be incremented.
The
.Em "counter mask"
can therefore be used to count cycles in which an event
occurs at least some number of times.
The next byte contains several flags:
.Bl -tag -width PCTR_EN
.It Dv PCTR_U
Enables counting of events that occur in user mode.
.It Dv PCTR_K
Enables counting of events that occur in kernel mode.
You must set at least one of
.Dv PCTR_K
and
.Dv PCTR_U
to count anything.
.It Dv PCTR_E
Counts edges rather than cycles.
For some functions this allows you
to get an estimate of the number of events rather than the number of
cycles occupied by those events.
.It Dv PCTR_EN
Enable counters.
This bit must be set in the function for counter 0
in order for either of the counters to be enabled.
This bit should probably be set in counter 1 as well.
.It Dv PCTR_I
Inverts the sense of the
.Em "counter mask" . \&
When this bit is set, the counter only increments on cycles in which
there are no
.Em more
events than specified in the
.Em "counter mask" .
.El
.Pp
The next byte (shifted left by the
.Dv PCTR_UM_SHIFT )
contains flags specific to the event being counted, also known as the
.Em "unit mask" .
.Pp
For events dealing with the L2 cache, the following flags are valid
on Intel brand processors:
.Bl -tag -width PCTR_UM_M
.It Dv PCTR_UM_M
Count events involving modified cache coherency state lines.
.It Dv PCTR_UM_E
Count events involving exclusive cache coherency state lines.
.It Dv PCTR_UM_S
Count events involving shared cache coherency state lines.
.It Dv PCTR_UM_I
Count events involving invalid cache coherency state lines.
.El
.Pp
To measure all L2 cache activity, all these bits should be set.
They can be set with the macro
.Dv PCTR_UM_MESI
which contains the bitwise or of all of the above.
.Pp
For event types dealing with bus transactions, there is another flag
that can be set in the
.Em "unit mask" :
.Bl -tag -width PCTR_UM_A
.It Dv PCTR_UM_A
Count all appropriate bus events, not just those initiated by the
processor.
.El
.Pp
Events marked
.Em (MESI)
require the
.Dv PCTR_UM_[MESI]
bits in the
.Em "unit mask" . \&
Events marked
.Em (A)
can take the
.Dv PCTR_UM_A
bit.
.Pp
Finally, the least significant byte of the counter function is the
event type to count.
A list of possible event functions could be obtained by running a
.Xr pctr 1
command with
.Fl l
option.
.Sh FILES
.Bl -tag -width /dev/pctr -compact
.It Pa /dev/pctr
.El
.Sh ERRORS
.Bl -tag -width "[ENODEV]"
.It Bq Er ENODEV
An attempt was made to set the counter functions on a CPU that does
not support counters.
.It Bq Er EINVAL
An invalid counter function was provided as an argument to the
.Dv PCIOCSx
.Em ioctl .
.It Bq Er EPERM
An attempt was made to set the counter functions, but the device was
not open for writing.
.El
.Sh SEE ALSO
.Xr pctr 1 ,
.Xr ioctl 2
.Sh HISTORY
A
.Nm
device first appeared in
.Ox 2.0 .
Support for amd64 architecture appeared in
.Ox 4.2 .
.Sh AUTHORS
.An -nosplit
The
.Nm
device was written by
.An David Mazieres Aq dm@lcs.mit.edu .
Support for amd64 architecture was written by
.An Mike Belopuhov Aq mikeb@openbsd.org .
.Sh BUGS
Not all counter functions are completely accurate.
Some of the functions may not make any sense at all.
Also you should be aware of the possibility of an interrupt between
invocations of
.Fn rdpmc
and/or
.Fn rdtsc
that can potentially decrease the accuracy of measurements.
|