The discovery DHT contains a number of hosts with LAN and loopback IPs.
These get relayed because some implementations do not perform any checks
on the IP.
go-ethereum already prevented relay in most cases because it verifies
that the host actually exists before adding it to the local table. But
this verification causes other issues. We have received several reports
where people's VPSs got shut down by hosting providers because sending
packets to random LAN hosts is indistinguishable from a slow port scan.
The new check prevents sending random packets to LAN by discarding LAN
IPs sent by Internet hosts (and loopback IPs from LAN and Internet
hosts). The new check also blacklists almost all currently registered
special-purpose networks assigned by IANA to avoid inciting random
responses from services in the LAN.
As another precaution against abuse of the DHT, ports below 1024 are now
considered invalid.
On Windows, UDPConn.ReadFrom returns an error for packets larger
than the receive buffer. The error is not marked temporary, causing
our loop to exit when the first oversized packet arrived. The fix
is to treat this particular error as temporary.
Fixes: #1579, #2087
Updates: #2082
nodeDB.querySeeds was not safe for concurrent use but could be called
concurrenty on multiple goroutines in the following case:
- the table was empty
- a timed refresh started
- a lookup was started and initiated refresh
These conditions are unlikely to coincide during normal use, but are
much more likely to occur all at once when the user's machine just woke
from sleep. The root cause of the issue is that querySeeds reused the
same leveldb iterator until it was exhausted.
This commit moves the refresh scheduling logic into its own goroutine
(so only one refresh is ever active) and changes querySeeds to not use
a persistent iterator. The seed node selection is now more random and
ignores nodes that have not been contacted in the last 5 days.
PR #1621 changed Table locking so the mutex is not held while a
contested node is being pinged. If multiple nodes ping the local node
during this time window, multiple ping packets will be sent to the
contested node. The changes in this commit prevent multiple packets by
tracking whether the node is being replaced.
If the timeout fired (even just nanoseconds) before the deadline of the
next pending reply, the timer was not rescheduled. The timer would've
been rescheduled anyway once the next packet was sent, but there were
cases where no next packet could ever be sent due to the locking issue
fixed in the previous commit.
As timing-related bugs go, this issue had been present for a long time
and I could never reproduce it. The test added in this commit did
reproduce the issue on about one out of 15 runs.
Table.mutex was being held while waiting for a reply packet, which
effectively made many parts of the whole stack block on that packet,
including the net_peerCount RPC call.
We don't have a UDP which specifies any messages that will be 4KB. Aside from being implemented for months and a necessity for encryption and piggy-backing packets, 1280bytes is ideal, and, means this TODO can be completed!
Why 1280 bytes?
* It's less than the default MTU for most WAN/LAN networks. That means fewer fragmented datagrams (esp on well-connected networks).
* Fragmented datagrams and dropped packets suck and add latency while OS waits for a dropped fragment to never arrive (blocking readLoop())
* Most of our packets are < 1280 bytes.
* 1280 bytes is minimum datagram size and MTU for IPv6 -- on IPv6, a datagram < 1280bytes will *never* be fragmented.
UDP datagrams are dropped. A lot! And fragmented datagrams are worse. If a datagram has a 30% chance of being dropped, then a fragmented datagram has a 60% chance of being dropped. More importantly, we have signed packets and can't do anything with a packet unless we receive the entire datagram because the signature can't be verified. The same is true when we have encrypted packets.
So the solution here to picking an ideal buffer size for receiving datagrams is a number under 1400bytes. And the lower-bound value for IPv6 of 1280 bytes make's it a non-decision. On IPv4 most ISPs and 3g/4g/let networks have an MTU just over 1400 -- and *never* over 1500. Never -- that means packets over 1500 (in reality: ~1450) bytes are fragmented. And probably dropped a lot.
Just to prove the point, here are pings sending non-fragmented packets over wifi/ISP, and a second set of pings via cell-phone tethering. It's important to note that, if *any* router between my system and the EC2 node has a lower MTU, the message would not go through:
On wifi w/normal ISP:
localhost:Debug $ ping -D -s 1450 52.6.250.242
PING 52.6.250.242 (52.6.250.242): 1450 data bytes
1458 bytes from 52.6.250.242: icmp_seq=0 ttl=42 time=104.831 ms
1458 bytes from 52.6.250.242: icmp_seq=1 ttl=42 time=119.004 ms
^C
--- 52.6.250.242 ping statistics ---
2 packets transmitted, 2 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 104.831/111.918/119.004/7.087 ms
localhost:Debug $ ping -D -s 1480 52.6.250.242
PING 52.6.250.242 (52.6.250.242): 1480 data bytes
ping: sendto: Message too long
ping: sendto: Message too long
Request timeout for icmp_seq 0
ping: sendto: Message too long
Request timeout for icmp_seq 1
Tethering to O2:
localhost:Debug $ ping -D -s 1480 52.6.250.242
PING 52.6.250.242 (52.6.250.242): 1480 data bytes
ping: sendto: Message too long
ping: sendto: Message too long
Request timeout for icmp_seq 0
^C
--- 52.6.250.242 ping statistics ---
2 packets transmitted, 0 packets received, 100.0% packet loss
localhost:Debug $ ping -D -s 1450 52.6.250.242
PING 52.6.250.242 (52.6.250.242): 1450 data bytes
1458 bytes from 52.6.250.242: icmp_seq=0 ttl=42 time=107.844 ms
1458 bytes from 52.6.250.242: icmp_seq=1 ttl=42 time=105.127 ms
1458 bytes from 52.6.250.242: icmp_seq=2 ttl=42 time=120.483 ms
1458 bytes from 52.6.250.242: icmp_seq=3 ttl=42 time=102.136 ms