Age | Commit message (Collapse) | Author |
|
Goal is to run UDP input in parallel. Btrace kstack analysis shows
that SIP hash for PCB lookup is quite expensive. When running in
parallel, there is also lock contention on the PCB table mutex.
It results in better performance to calculate the hash value before
taking the mutex. The hash secret has to be constant as hash
calculation must not depend on values protected by the table mutex.
Do not reseed anymore when hash table gets resized.
Analysis also shows that asserting a rw_lock while holding a mutex
is a bit expensive. Just remove the netlock assert.
OK dlg@ mvs@
|
|
ok bluhm
|
|
Our syn cache did checksum calculation by hand, instead of the
established mechanism in ip output. The software-checksummed counter
increased once per incoming TCP connection.
Just set the flag M_TCP_CSUM_OUT in syn_cache_respond() and let
in_proto_cksum_out() do the work later. Then hardware checksumming
is used where available. Also remove redundant code. The unhandled
af case is handled in the first switch statement of the function.
tested by Hrvoje Popovski; OK mvs@
|
|
With tweaks from patrick@ and bluhm@.
OK bluhm@
|
|
When sending TCP packets with software TSO to the local address of
a physical interface, the TCP checksum was miscalculated. As the
small MSS is taken from the physical interface, but the large MTU
of the loopback interface is used, large TSO packets are generated,
but sent directly to the loopback interface. There we need the
regular pseudo header checksum and not the modified without packet
length.
To avoid this confusion, use the same decision for checksum generation
in in_proto_cksum_out() as for using hardware TSO in tcp_if_output_tso().
bug reported and tested by robert@ bket@ Hrvoje Popovski
OK claudio@ jan@
|
|
compliance with POSIX/SUS restrictions on <netinet/tcp.h>
ok bluhm@
ports testing and ok sthen@
|
|
layer.
|
|
With a lot of tweaks, improvements and testing from bluhm.
Thanks to Hrvoje Popovski from the University of Zagreb for
his great testing effort to make this happen.
ok bluhm
|
|
safe. We have may of them, so use flag instead of pushing kernel lock
within.
Unlock ip_sysctl(). Still take kernel lock within IPCTL_MRTSTATS case.
It looks like `mrtstat' protection is inconsistent, so keep locking as
it was. Since `mrtstat' are counters, it make sense to rework them into
per CPU counters with separate diffs.
Feedback and ok from bluhm@
|
|
This diff introduces separate capabilities for TCP offloading. We split this
into LRO (large receive offloading) and TSO (TCP segmentation offloading).
LRO can be turned on/off via tcprecvoffload option of ifconfig and is not
inherited to sub interfaces.
TSO is inherited by sub interfaces to signal this hardware offloading capability
to the network stack.
With tweaks from bluhm, claudio and dlg
ok bluhm, claudio
|
|
If the driver of a network interface claims to support TSO, do not
chop the packet in software, but pass it down to the interface
layer.
Precalculate parts of the pseudo header checksum, but without the
packet length. The length of all generated smaller packets is not
known yet. Driver and hardware will use the mbuf packet header
field ph_mss to calculate it and update checksum.
Introduce separate flags IFCAP_TSOv4 and IFCAP_TSOv6 as hardware
might support ony one protocol family. The old flag IFXF_TSO is
only relevant for large receive offload. It is missnamed, but keep
that for now.
Note that drivers do not set TSO capabilites yet. Also the ifconfig
flags and pseudo interfaces capabilities will be done separately.
So this commit should not change behavior.
heavily based on the work from jan@; OK sashan@
|
|
introduce in_hdr_cksum_out(). It is used like in_proto_cksum_out().
OK claudio@
|
|
always set together with ARP mutex.
OK mvs@
|
|
meant as a fallback if network hardware does not support TSO. Driver
support is still work in progress. TCP output generates large
packets. In IP output the packet is chopped to TCP maximum segment
size. This reduces the CPU cycles used by pf. The regular output
could be assisted by hardware later, but pf route-to and IPsec needs
the software fallback in general.
For performance comparison or to workaround possible bugs, sysctl
net.inet.tcp.tso=0 disables the feature. netstat -s -p tcp shows
TSO counter with chopped and generated packets.
based on work from jan@
tested by jmc@ jan@ Hrvoje Popovski
OK jan@ claudio@
|
|
is passed to ifp->if_output(). The fragment code has its own
checksum calculation and the other paths end in goto bad.
OK claudio@
|
|
if_output_ml() to send mbuf lists to interfaces. This can be used
for TSO, fragments, ARP and ND6. Rename variable fml to ml. In
pf_route6() split the if else block. Put the safety check (hlen +
firstlen < tlen) into ip_fragment(). It makes the code correct in
case the packet is too short to be fragmented. This should not
happen, but other functions also have this logic.
No functional change. OK sashan@
|
|
|
|
So kernel lock is only needed for changing the route rt_flags. In
arpresolve() protect rt_llinfo lookup and llinfo_arp modification
with arp_mtx. Grab kernel lock for rt_flags reject modification
only when needed.
Tested by Hrvoje Popovski; OK patrick@ kn@
|
|
in6.c already has the privilege check as early as possible, make in.c match.
For unprivileged IPv4 ioctl calls with invalid args, this changes errno from
E* to EPERM.
OK bluhm
|
|
read-olny access to netlock protected data, so the radix tree will
not be modified during spd_table_walk() run.
Also change netlock assertion within spd_table_add() and
ipsec_delete_policy() to exclusive. These are correlating functions
which modifies radix tree, so make us sure spd_table_walk() run with
shared netlock is safe.
Feedback and ok by bluhm@
|
|
|
|
|
|
All cases do the same check as first step, so merge it before the switch
and before grapping exclusive locks.
OK mvs
|
|
Just like in6_ioctl_get(), read ioctls are safe with the shared net lock to
protect interface addresses and flags.
OK mvs
|
|
Defer sending after unlock, reuse `refresh' from similar construct.
OK bluhm
|
|
feedback and ok jmc@ miod, ok millert@
|
|
immutable, we don't need to reload it again.
ok bluhm@
|
|
grabs the exclusive netlock and that is sufficent for in_arpinput()
and arpcache().
with kn@; OK mvs@; tested by Hrvoje Popovski
|
|
response. Implement analog sysctl net.inet6.icmp6.nd6_queued for
ND6 to reduce places where mbufs can hide within the kernel.
Atomic operations operate on unsigned int. Make the type of total
hold queue length consistent.
Use atomic load to read the value for the sysctl. This clarifies
why no lock around sysctl_rdint() is needed.
OK mvs@ kn@
|
|
ND6 did only hold a single packet. Unify the logic and add a mbuf
hold queue to struct llinfo_nd6. This is MP safe and queue limits
are tracked with atomic operations. New function if_mqoutput() has
common code for ARP and ND6. ln_saddr6 holds the source address
of the requesting packet. That is easier than fiddling with mbuf
queue in nd6_ns_output().
OK kn@
|
|
checksum may be wrong. Locally generated packets diverted by pf
out rules may have no checksum due to to hardware offloading.
Calculate the checksum in that case.
OK mvs@ sashan@
|
|
milliseconds, which is the same unit of tcp_now(). However, keep the
unit of sysctl variables in seconds and convert their unit in
tcp_sysctl(). Additionally revert TCPTV_SRTTDFLT back to 3 seconds,
which was mistakenly changed to 1.5 seconds by tcp_timer.h 1.19.
ok claudio
|
|
by using a bad option length. This bug is only reachable if both
pf IP option check is disabled and IP source routing is enabled.
reported by @fuzzingrf Erg Noor
OK claudio@ deraadt@
|
|
ok miod@ millert@
|
|
This worked because the global head variable is zero-initialised,
but one must not rely on that.
OK mvs claudio
|
|
public header file. Makes debugging with special kernels easier.
|
|
with tweaks from mvs@, mpi@, dlg@, naddy@ and bluhm@
"go for it" deraadt@
ok naddy@, mvs@
|
|
No functional change.
|
|
No functional changes.
|
|
rwlock(9) acquisition.
Reported-by: syzbot+fbe3acb4886adeef31e0@syzkaller.appspotmail.com
|
|
easily repeatable ASSERT happens seconds after starting compiles over nfs.
|
|
with tweaks from mvs@, mpi@ and dlg@
ok mvs@, dlg@
|
|
receive buffer. As it was done for SS_CANTSENDMORE bit, the definition
kept as is, but now these bits belongs to the `sb_state' of receive
buffer. `sb_state' ored with `so_state' when socket data exporting to the
userland.
ok bluhm@
|
|
serialize arpcache() and arpresolve(). In fact, net stack already has
sleep points, so the rwlock(9) is better here because we avoid
intersection with the rest of kernel locked paths. Also this new lock
assumed to use to route layer protection instead of netlock.
Hrvoje Popovski had tested this diff and found no visible performance
impact.
ok bluhm@
|
|
This time, socket's buffer lock requires solock() to be held. As a part of
socket buffers standalone locking work, move socket state bits which
represent its buffers state to per buffer state.
Opposing the previous reverted diff, the SS_CANTSENDMORE definition left
as is, but it used only with `sb_state'. `sb_state' ored with original
`so_state' when socket's data exported to the userland, so the ABI kept as
it was.
Inputs from deraadt@.
ok bluhm@
|
|
listen port is not bound to port 0. With a matching pf divert-to
rule this assumption is no longer true and could crash the kernel
with kassert. In both pf and stack drop TCP packets with destination
port 0 before they can do harm.
OK sashan@ claudio@
|
|
New warning -Warray-parameter is a bit overzealous.
ok millert@ tb@
|
|
The tcp timer is not supposed to run during suspend but getnsecuptime() does
and because of this sessions with TCP_KEEPALIVE on reset after a few hours
of sleep.
Problem noticed by mlarkin@, investigation by yasuoka@ additional testing jca@
OK yasuoka@ jca@ cheloha@
|
|
|
|
socket buffers standalone locking work, move socket state bits which
represent its buffers state to per buffer state. Introduce `sb_state' and
turn SS_CANTSENDMORE to SBS_CANTSENDMORE. This bit will be processed on
`so_snd' buffer only.
Move SS_CANTRCVMORE and SS_RCVATMARK bits with separate diff to make
review easier and exclude possible so_rcv/so_snd mistypes.
Also, don't adjust the remaining SS_* bits right now.
ok millert@
|