1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
|
.\" $OpenBSD: sosplice.9,v 1.7 2013/07/17 20:21:55 schwarze Exp $
.\"
.\" Copyright (c) 2011-2013 Alexander Bluhm <bluhm@openbsd.org>
.\"
.\" Permission to use, copy, modify, and distribute this software for any
.\" purpose with or without fee is hereby granted, provided that the above
.\" copyright notice and this permission notice appear in all copies.
.\"
.\" THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
.\" WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
.\" MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
.\" ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
.\" WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
.\" ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
.\" OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
.\"
.Dd $Mdocdate: July 17 2013 $
.Dt SOSPLICE 9
.Os
.Sh NAME
.Nm sosplice ,
.Nm somove
.Nd splice two sockets for zero-copy data transfer
.Sh SYNOPSIS
.Ft int
.Fn sosplice "struct socket *so" "int fd" "off_t max" "struct timeval *tv"
.Ft int
.Fn somove "struct socket *so" "int wait"
.Sh DESCRIPTION
The function
.Fn sosplice
is used to splice together a source and a drain socket.
The source socket is passed as the
.Fa so
argument;
the file descriptor of the drain is passed in
.Fa fd .
If
.Fa fd
is negative, an existing splicing gets dissolved.
If
.Fa max
is positive, at most that many bytes will get transferred.
If
.Fa tv
is not NULL, a
.Xr timeout 9
is scheduled to dissolve splicing in the case when no data can be
transferred for the specified period of time.
Socket splicing can be invoked from userland via the
.Xr setsockopt 2
system-call at the
.Dv SOL_SOCKET
level with the socket option
.Dv SO_SPLICE .
.Pp
Before connecting both sockets, several checks are executed.
See the
.Sx ERRORS
section for possible failures.
The connection between both sockets is implemented by setting these
additional fields in
.Vt struct socket :
.Pp
.Bl -dash -compact -offset indent
.It
.Vt struct socket Fa *so_splice
links from the source to the drain socket.
.It
.Vt struct socket Fa *so_spliceback
links back from the drain to the source socket.
.It
.Vt off_t Fa so_splicelen
counts the number of bytes spliced so far from this socket.
.It
.Vt off_t Fa so_splicemax
specifies the maximum number of bytes to splice from this socket if
non-zero.
.It
.Vt struct timeval Fa so_idletv
specifies the maximum idle time if non-zero.
.It
.Vt struct timeout Fa so_idleto
provides storage for the kernel timeout if idle time is used.
.El
.Pp
After connecting both sockets,
.Fn sosplice
calls
.Fn somove
to transfer the mbufs already in the source receive buffer to the
drain send buffer.
Finally the socket buffer flag
.Dv SB_SPLICE
is set on both socket buffers, to indicate that the protocol layer
has to call
.Fn somove
whenever data or space is available.
.Pp
The function
.Fn somove
transfers data from the source's receive buffer to the drain's send
buffer.
It must be called at
.Xr splsoftnet 9
and
.Fa so
must be a spliced drain socket.
It may be necessary to split an mbuf to handle out-of-band data
inline or when the maximum splice length has been reached.
If
.Fa wait
is
.Dv M_WAIT ,
splitting mbufs will always succeed.
For
.Dv M_DONTWAIT
the out-of-band property might get lost or a short splice might
happen.
In the latter case, less than the given maximum number of bytes are
transferred and userland has to cope with this.
Note that a short splice cannot happen if
.Fn somove
was called by
.Fn sosplice .
So a second
.Xr setsockopt 2
after a short splice pointing to the same maximum will always
succeed.
.Pp
Before transferring data,
.Fn somove
checks both sockets for errors and that the drain socket is connected.
If the drain cannot send anymore, an
.Er EPIPE
error is set on the source socket.
The data length to move is limited by the optional maximum splice
length and the space in the drain's send socket buffer.
Up to this amount of data is taken out of the source's receive
socket buffer.
.Pp
For atomic protocols, either one complete packet is taken out, or
nothing is taken at all if:
the packet is bigger than the drain's send buffer size, in which
case the splicing gets aborted with an
.Er EMSGSIZE
error;
the packet does not fit into the drain's current send buffer space,
in which case it is left in the source's receive buffer for later
processing;
or the maximum splice length is located within a packet, in which
case splicing gets dissolved like a short splice.
All address or control mbufs associated with the taken packet are
dropped.
.Pp
If the maximum splice length has been reached, an mbuf may get
split for non-atomic protocols.
Otherwise an mbuf is either moved completely to the send buffer or
left in the receive buffer for later processing.
If SO_OOBINLINE is set, out-of-band data will get moved as such
although this might not be reliable.
The data is sent out to the drain socket via the protocol function.
If that fails and the drain socket cannot send anymore, an
.Er EPIPE
error is set on the source socket.
.Pp
For packet oriented protocols
.Fn somove
iterates over the next packet queue.
.Pp
If a maximum splice length was specified and at least this amount
of data has been received from the drain socket, splicing gets
dissolved.
In this case, an
.Er EFBIG
error is set on the source socket if the maximum amount of data has
been transferred.
Userland can process this error to distinguish the full splice from
a short splice or to react to the completed maximum splice immediately.
If an idle timeout was specified and no data has been transferred
for that period of time, the handler
.Fn soidle
dissolves splicing and sets an
.Er ETIMEDOUT
error on the source socket.
.Pp
The function
.Fn sounsplice
is called to dissolve the socket splicing if the source socket
cannot receive anymore and its receive buffer is empty; or if the
drain socket cannot send anymore; or if the maximum has been reached;
or if an error occurred; or if the idle timeout has fired.
.Pp
If the socket buffer flag
.Dv SB_SPLICE
is set, the functions
.Fn sorwakeup
and
.Fn sowwakeup
will call
.Fn somove
to trigger the transfer when new data or buffer space is available.
While socket splicing is active, any
.Xr read 2
from the source socket will block and the wakeup will not be delivered
to the file descriptor.
A read event or a socket error is signaled to userland after
dissolving.
.Sh RETURN VALUES
.Fn sosplice
returns 0 on success and otherwise the error number.
.Fn somove
returns 0 if socket splicing has been finished and 1 if it continues.
.Sh ERRORS
.Fn sosplice
will succeed unless:
.Bl -tag -width Er
.It Bq Er EBADF
The given file descriptor
.Fa fd
is not an active descriptor.
.It Bq Er EBUSY
The source or the drain socket is already spliced.
.It Bq Er EINVAL
The given maximum value
.Fa max
is negative.
.It Bq Er ENOTCONN
The source socket requires a connection and is neither connected
nor in the process of connecting to a peer.
.It Bq Er ENOTCONN
The drain socket is neither connected nor in the process of connecting
to a peer.
.It Bq Er ENOTSOCK
The given file descriptor
.Fa fd
is not a socket.
.It Bq Er EOPNOTSUPP
The source or the drain socket is a listen socket.
.It Bq Er EPROTONOSUPPORT
The source socket's protocol layer does not have the
.Dv PR_SPLICE
flag set.
Only TCP and UDP socket splicing is supported.
.It Bq Er EPROTONOSUPPORT
The drain socket's protocol does not have the same
.Fa pr_usrreq
function as the source.
.It Bq Er EWOULDBLOCK
The source socket is non-blocking and the receive buffer is already
locked.
.El
.Sh SEE ALSO
.Xr setsockopt 2 ,
.Xr options 4 ,
.Xr timeout 9
.Sh HISTORY
Socket splicing for TCP first appeared in
.Ox 4.9 ;
support for UDP was added in
.Ox 5.3 .
.Sh AUTHORS
.An -nosplit
The idea for socket splicing originally came from
.An Markus Friedl Aq Mt markus@openbsd.org ,
and
.An Alexander Bluhm Aq Mt bluhm@openbsd.org
implemented it.
.An Mike Belopuhov Aq Mt mikeb@openbsd.org
added the timeout feature.
|