Age | Commit message (Collapse) | Author |
|
in_pcbbind(), in_pcbconnect(), and in6_pcbconnect() have to set
addresses and ports within the same critical section as the inpcb
hash table calculation. Also lookup and address selection have to
be protected to avoid bindings and connections that are not unique.
For that in_pcbpickport() and in_pcbbind_locked() expect that the
table mutex is already taken. The functions in_pcblookup_lock(),
in_pcblookup_local_lock(), and in_pcbaddrisavail_lock() grab the
mutex iff the lock parameter is IN_PCBLOCK_GRAB. Otherwise the
parameter is IN_PCBLOCK_HOLD has the lock has to be taken already.
Note that in_pcblookup_lock() and in_pcblookup_local() return an
inp with increased reference iff they take and release the lock.
Otherwise the caller protects the life time of the inp.
This gives enough flexibility that in_pcbbind() and in_pcbconnect()
can hold the table mutex when they need it. The public inpcb API
does not change.
OK sashan@ mvs@
|
|
Since soreceive() runs in parallel for raw sockets, sbappendaddr()
has to be protected by inpcb mutex. This was missing in multicast
forwarding which is running with a combination of shared net lock
and kernel lock. soreceive() uses shared net lock and mutex per
inpcb. Grab mutex before sbappendaddr() in socket_send() and
socket6_send().
panic receive 1 reported by Jo Geraerts
OK mvs@ claudio@
|
|
During initialization in_pcballoc() sets INP_IPV6 once to avoid
reaching through inp_socket->so_proto->pr_domain->dom_family. Use
this flag consistently.
OK sashan@ mvs@
|
|
protects related data.
ok bluhm
|
|
The inpcb hash table is protected by table->inpt_mtx. The hash is
based on addresses, ports, and routing table. These fields were
not sychronized with the hash. Put writes and hash update into the
same critical section.
Move the updates from ip_ctloutput(), ip6_ctloutput(), syn_cache_get(),
tcp_connect(), udp_disconnect() to dedicated inpcb set functions.
There they use the same table mutex as in_pcbrehash().
in_pcbbind(), in_pcbconnect(), and in6_pcbconnect() need more work
and are not included yet.
OK sashan@ mvs@
|
|
The public interface is in_pcbconnect(). It dispatches to
in6_pcbconnect() if necessary. Call the former from tcp_connect()
and udp_connect().
In in6_pcbconnect() initialization in6a = NULL is not necessary.
in6_pcbselsrc() sets the pointer, but does not read the value.
Pass a constant in6_addr pointer to in6_pcbselsrc() and in6_selectsrc().
It returns a reference to the address of some internal data structure.
We want to be sure that in6_addr is not modified this way. IPv4
in_pcbselsrc() solves this by passing a copy of the address.
OK kn@ sashan@ mvs@
|
|
TCP syn_cache_respond() uses inp_seclevel from listening socket as
ip_output() parameter. This was missing for ip6_output().
OK mvs@
|
|
As syn_cache_timer() uses syn cache mutex and exclusive net lock,
it does not need kernel lock.
OK mvs@
|
|
Struct inpcb field inp_socket is initialized in in_pcballoc(). It
is not NULL and never changed.
OK mvs@
|
|
rip6_output() did modify inp_outputopts6 temporarily to provide
different ip6_pktopts to in6_embedscope(). Better pass inp_outputopts6
and inp_moptions6 as separate arguments to in6_embedscope().
Simplify the code that deals with these options in in6_embedscope().
Doucument inp_moptions and inp_moptions6 as protected by net lock.
OK kn@
|
|
In some cases inp may be NULL, so check that before passing
inp->inp_seclevel to ipsp_spd_lookup() or ip_output().
Missed in previous commit.
|
|
ip_output() received inp as parameter. This is only used to lookup
the IPsec level of the socket. Reasoning about MP locking is much
easier if only relevant data is passed around. Convert ip_output()
to receive constant inp_seclevel as argument and mark it as protected
by net lock.
OK mvs@
|
|
Introduce global TCP SYN cache mutex. Devide timer function in
parts protected by mutex and sending with netlock. Split the flags
field in dynamic flags protected by mutex and fixed flags set during
initialization. Document whether fields of struct syn_cache are
protected by net lock or mutex.
input and OK sashan@
|
|
OK mvs@ jca@
|
|
For implementing MP safe route lookup, it helps to know which
function parameters are constant. Add some const declarations, so
that the compiler guarantees that sockaddr dst parameter of
rtable_match() does not change.
OK dlg@
|
|
Since cheloha@ has implemented timeout processes that do not grab
the kernel lock, start using TIMEOUT_MPSAFE for arptimer().
OK kn@ mvs@
|
|
When receiving a pfkeyv2 SADB_ADD message, a newly created tdb can
fail in tdb_init(), which causes the tdb to not get added to the
global tdb list and an immediate dereference. If a lifetime timeout
triggers on this tdb, it will unconditionally try to remove it from
the list and in the process deref once more than allowed,
causing a one bit corruption in the already freed up slot in the
tdb pool.
We resolve this issue by moving timeout_add() after tdb_init()
just before puttdb(). This means tdbs failing initialization
get discarded immediately as they only hold a single reference.
Valid tdbs get their timeouts activated just before we add them
to the tdb list, meaning the timeout can safely assume they are
linked.
Feedback from mvs@ and millert@
ok mvs@ mbuhl@
|
|
Using a scratch buffer makes it possible to take a consistent snapshot of
per-CPU counters without having to allocate memory.
Makes ddb(4) show uvmexp command work in OOM situations.
ok kn@, mvs@, cheloha@
|
|
When called with NULL options, ip_output() and ip6_output() are MP
safe. Convert exclusive to shared net lock in send dispatch.
OK mpi@
|
|
TCP syn cache variable scs_use is basically counting packet insertions
into syn cache. Prefer type long to exclude overflow on fast
machines. Due to counting downwards from a limit, it can become
negative. Copy it out as tcps_sc_uses_left via sysctl, and print
it as signed long long integer.
OK mvs@
|
|
by unexpanding the SYN_CACHE_TIMER_ARM() macro in the timer callback.
OK mvs@
|
|
The syn_cache_reaper() is a hack to serialize timeouts. Unfortunately
it has a race and panics sometimes with pool_do_get: syncache free
list modified. Add a reference counter for timeout and list of syn
cache entries. Currently list refcout is not strictly necessary
due to exclusive netlock, but will be needed when we continue
unlocking.
Checking timeout_initialized() is not MP friendly, better do proper
initialization during object allocation. Refcount in btrace helps
to find leaks.
bug reported and fix tested by Peter J. Philipp
OK claudio@
|
|
if TDBF_IFACE is set on a tdb, the ipsec stack will pass it to the
sec(4) driver to keep track of instead of wiring it up for security
associations to use.
when sec(4) transmits a packet, it will look up it's list of tdbs
to find the right SA to encrypt and send the packet out with.
if an incoming ipsec packet arrives with TDBF_IFACE set, it's passed
to sec(4) to be injected back into the network stack as if it was
received on the sec interface, instead of being reinjected into the
IP stack like normal SA/SPD processing does.
note that this means you do not have to configure tunnel endpoints
on sec(4) interfaces, instead you line the interface unit number
in the ipsec config up with the minor number of the sec(4) interfaces.
the peer IPs used on the SAs are what's used as the traffic endpoints.
support from many including markus@ tobhe@ claudio@ sthen@ patrick@
now is a good time deraadt@
|
|
rather than use ipsec flows (aka, entries in the ipsec security
policy database) to decide which traffic should be encapsulated in
ipsec and sent to a peer, this tweaks security associations (SAs)
so they can refer to a tunnel interface. when traffic is routed
over that tunnel interface, an ipsec SA is looked up and used to
encapsulate traffic before being sent to the peer on the SA. When
traffic is received from a peer using an interface SA, the specified
interface is looked up and the packet is handed to it so it looks
like packets come out of the tunnel.
to support this, SAs get a TDBF_IFACE flag and iface and iface_dir
fields. When TDBF_IFACE is set the iface and dir fields are
considered valid, and the tdb/SA should be used with the tunnel
interface instead of the SPD.
support from many including markus@ tobhe@ claudio@ sthen@ patrick@
now is a good time deraadt@
|
|
Implement vlan-tag parsing ether_extract_header() to use this information
to adjust the MSS calculation of LRO packets.
pointed out by mbuhl and bluhm
with tweaks from bluhm
ok bluhm@
|
|
When doing LRO (Large Receive Offload), the drivers, currently ix(4)
and lo(4) only, record an upper bound of the size of the original
packets in ph_mss. When sending, either stack or hardware must
chop the packets with TSO (TCP Segmentation Offload) to that size.
That means we have to call tcp_if_output_tso() before ifp->if_output().
Put that logic into if_output_tso() to avoid code duplication. As
TCP packets on the wire do not get larger that way, path MTU discovery
should still work.
tested by and OK jan@
|
|
Replace hand-rolled reference counting with refcnt_init(9) and hook it up
with a new dt(4) probe.
OK mvs
Feedback OK bluhm
|
|
After changing tcp now tick to milliseconds, 32 bits will wrap
around after 49 days of uptime. That may be a problem in some
places of our stack. Better use a 64 bit counter.
As timestamp option is 32 bit in TCP protocol, use the lower 32 bit
there. There are casts to 32 bits that should behave correctly.
Start with random 63 bit offset to avoid uptime leakage. 2^63
milliseconds result in 2.9*10^8 years of possible uptime.
OK yasuoka@
|
|
moving pf forward has been a real struggle, and pfsync has been a
constant source of pain. we have been papering over the problems
for a while now, but it reached the point that it needed a fundamental
restructure, which is what this diff is.
the big headliner changes in this diff are:
- pfsync specific locks
this is the whole reason for this diff.
rather than rely on NET_LOCK or KERNEL_LOCK or whatever, pfsync now
has it's own locks to protect it's internal data structures. this
is important because pfsync runs a bunch of timeouts and tasks to
push pfsync packets out on the wire, or when it's handling requests
generated by incoming pfsync packets, both of which happen outside
pf itself running. having pfsync specific locks around pfsync data
structures makes the mutations of these data structures a lot more
explicit and auditable.
- partitioning
to enable future parallelisation of the network stack, this rewrite
includes support for pfsync to partition states into different "slices".
these slices run independently, ie, the states collected by one slice
are serialised into a separate packet to the states collected and
serialised by another slice.
states are mapped to pfsync slices based on the pf state hash, which
is the same hash that the rest of the network stack and multiq
hardware uses.
- no more pfsync called from netisr
pfsync used to be called from netisr to try and bundle packets, but now
that there's multiple pfsync slices this doesnt make sense. instead it
uses tasks in softnet tqs.
- improved bulk transfer handling
there's shiny new state machines around both the bulk transmit and
receive handling. pfsync used to do horrible things to carp demotion
counters, but now it is very predictable and returns the counters back
where they started.
- better tdb handling
the tdb handling was pretty hairy, but hrvoje has kicked this around
a lot with ipsec and sasyncd and we've found and fixed a bunch of
issues as a result of that testing.
- mpsafe pf state purges
this was committed previously, but because the locks pfsync relied on
weren't clear this just caused a ton of bugs. as part of this diff it's
now reliable, and moves a big chunk of work out from under KERNEL_LOCK,
which in turn improves the responsiveness and throughput of a firewall
even if you're not using pfsync.
there's a bunch of other little changes along the way, but the above are
the big ones.
hrvoje has done performance testing with this diff and notes a big
improvement when pfsync is not in use. performance when pfsync is
enabled is about the same, but im hoping the slices means we can scale
along with pf as it improves.
lots (months) of testing by me and hrvoje on pfsync boxes
tests and ok sashan@
deraadt@ says this is a good time to put it in
|
|
OK jmatthew@
|
|
If tcplro is activated on lo(4), ignore the MTU with TCP packets.
They are passed along with the information that they have to be
chopped in case they are forwarded later. New netstat(1) counter
shows that software LRO is in effect. The feature is currently
turned off by default.
tested by jan@; OK claudio@ jan@
|
|
Replace hand-rolled reference counting with refcnt_init(9) and hook it up
with a new dt(4) probe.
OK bluhm mvs
|
|
Goal is to run UDP input in parallel. Btrace kstack analysis shows
that SIP hash for PCB lookup is quite expensive. When running in
parallel, there is also lock contention on the PCB table mutex.
It results in better performance to calculate the hash value before
taking the mutex. The hash secret has to be constant as hash
calculation must not depend on values protected by the table mutex.
Do not reseed anymore when hash table gets resized.
Analysis also shows that asserting a rw_lock while holding a mutex
is a bit expensive. Just remove the netlock assert.
OK dlg@ mvs@
|
|
ok bluhm
|
|
Our syn cache did checksum calculation by hand, instead of the
established mechanism in ip output. The software-checksummed counter
increased once per incoming TCP connection.
Just set the flag M_TCP_CSUM_OUT in syn_cache_respond() and let
in_proto_cksum_out() do the work later. Then hardware checksumming
is used where available. Also remove redundant code. The unhandled
af case is handled in the first switch statement of the function.
tested by Hrvoje Popovski; OK mvs@
|
|
With tweaks from patrick@ and bluhm@.
OK bluhm@
|
|
When sending TCP packets with software TSO to the local address of
a physical interface, the TCP checksum was miscalculated. As the
small MSS is taken from the physical interface, but the large MTU
of the loopback interface is used, large TSO packets are generated,
but sent directly to the loopback interface. There we need the
regular pseudo header checksum and not the modified without packet
length.
To avoid this confusion, use the same decision for checksum generation
in in_proto_cksum_out() as for using hardware TSO in tcp_if_output_tso().
bug reported and tested by robert@ bket@ Hrvoje Popovski
OK claudio@ jan@
|
|
compliance with POSIX/SUS restrictions on <netinet/tcp.h>
ok bluhm@
ports testing and ok sthen@
|
|
layer.
|
|
With a lot of tweaks, improvements and testing from bluhm.
Thanks to Hrvoje Popovski from the University of Zagreb for
his great testing effort to make this happen.
ok bluhm
|
|
safe. We have may of them, so use flag instead of pushing kernel lock
within.
Unlock ip_sysctl(). Still take kernel lock within IPCTL_MRTSTATS case.
It looks like `mrtstat' protection is inconsistent, so keep locking as
it was. Since `mrtstat' are counters, it make sense to rework them into
per CPU counters with separate diffs.
Feedback and ok from bluhm@
|
|
This diff introduces separate capabilities for TCP offloading. We split this
into LRO (large receive offloading) and TSO (TCP segmentation offloading).
LRO can be turned on/off via tcprecvoffload option of ifconfig and is not
inherited to sub interfaces.
TSO is inherited by sub interfaces to signal this hardware offloading capability
to the network stack.
With tweaks from bluhm, claudio and dlg
ok bluhm, claudio
|
|
If the driver of a network interface claims to support TSO, do not
chop the packet in software, but pass it down to the interface
layer.
Precalculate parts of the pseudo header checksum, but without the
packet length. The length of all generated smaller packets is not
known yet. Driver and hardware will use the mbuf packet header
field ph_mss to calculate it and update checksum.
Introduce separate flags IFCAP_TSOv4 and IFCAP_TSOv6 as hardware
might support ony one protocol family. The old flag IFXF_TSO is
only relevant for large receive offload. It is missnamed, but keep
that for now.
Note that drivers do not set TSO capabilites yet. Also the ifconfig
flags and pseudo interfaces capabilities will be done separately.
So this commit should not change behavior.
heavily based on the work from jan@; OK sashan@
|
|
introduce in_hdr_cksum_out(). It is used like in_proto_cksum_out().
OK claudio@
|
|
always set together with ARP mutex.
OK mvs@
|
|
meant as a fallback if network hardware does not support TSO. Driver
support is still work in progress. TCP output generates large
packets. In IP output the packet is chopped to TCP maximum segment
size. This reduces the CPU cycles used by pf. The regular output
could be assisted by hardware later, but pf route-to and IPsec needs
the software fallback in general.
For performance comparison or to workaround possible bugs, sysctl
net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows
TSO counter with chopped and generated packets.
based on work from jan@
tested by jmc@ jan@ Hrvoje Popovski
OK jan@ claudio@
|
|
is passed to ifp->if_output(). The fragment code has its own
checksum calculation and the other paths end in goto bad.
OK claudio@
|
|
if_output_ml() to send mbuf lists to interfaces. This can be used
for TSO, fragments, ARP and ND6. Rename variable fml to ml. In
pf_route6() split the if else block. Put the safety check (hlen +
firstlen < tlen) into ip_fragment(). It makes the code correct in
case the packet is too short to be fragmented. This should not
happen, but other functions also have this logic.
No functional change. OK sashan@
|
|
|
|
So kernel lock is only needed for changing the route rt_flags. In
arpresolve() protect rt_llinfo lookup and llinfo_arp modification
with arp_mtx. Grab kernel lock for rt_flags reject modification
only when needed.
Tested by Hrvoje Popovski; OK patrick@ kn@
|