p2p/discover: improved node revalidation (#29572)
Node discovery periodically revalidates the nodes in its table by sending PING, checking
if they are still alive. I recently noticed some issues with the implementation of this
process, which can cause strange results such as nodes dropping unexpectedly, certain
nodes not getting revalidated often enough, and bad results being returned to incoming
FINDNODE queries.
In this change, the revalidation process is improved with the following logic:
- We maintain two 'revalidation lists' containing the table nodes, named 'fast' and 'slow'.
- The process chooses random nodes from each list on a randomized interval, the interval being
faster for the 'fast' list, and performs revalidation for the chosen node.
- Whenever a node is newly inserted into the table, it goes into the 'fast' list.
Once validation passes, it transfers to the 'slow' list. If a request fails, or the
node changes endpoint, it transfers back into 'fast'.
- livenessChecks is incremented by one for successful checks. Unlike the old implementation,
we will not drop the node on the first failing check. We instead quickly decay the
livenessChecks give it another chance.
- Order of nodes in bucket doesn't matter anymore.
I am also adding a debug API endpoint to dump the node table content.
Co-authored-by: Martin HS <martin@swende.se>
2024-05-23 15:26:09 +03:00
|
|
|
// Copyright 2024 The go-ethereum Authors
|
|
|
|
// This file is part of the go-ethereum library.
|
|
|
|
//
|
|
|
|
// The go-ethereum library is free software: you can redistribute it and/or modify
|
|
|
|
// it under the terms of the GNU Lesser General Public License as published by
|
|
|
|
// the Free Software Foundation, either version 3 of the License, or
|
|
|
|
// (at your option) any later version.
|
|
|
|
//
|
|
|
|
// The go-ethereum library is distributed in the hope that it will be useful,
|
|
|
|
// but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
|
|
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
|
|
|
// GNU Lesser General Public License for more details.
|
|
|
|
//
|
|
|
|
// You should have received a copy of the GNU Lesser General Public License
|
|
|
|
// along with the go-ethereum library. If not, see <http://www.gnu.org/licenses/>.
|
|
|
|
|
|
|
|
package discover
|
|
|
|
|
|
|
|
import (
|
|
|
|
"fmt"
|
|
|
|
"math"
|
|
|
|
"slices"
|
|
|
|
"time"
|
|
|
|
|
|
|
|
"github.com/ethereum/go-ethereum/common/mclock"
|
|
|
|
"github.com/ethereum/go-ethereum/p2p/enode"
|
|
|
|
)
|
|
|
|
|
|
|
|
const never = mclock.AbsTime(math.MaxInt64)
|
|
|
|
|
2024-05-28 22:30:17 +03:00
|
|
|
const slowRevalidationFactor = 3
|
|
|
|
|
p2p/discover: improved node revalidation (#29572)
Node discovery periodically revalidates the nodes in its table by sending PING, checking
if they are still alive. I recently noticed some issues with the implementation of this
process, which can cause strange results such as nodes dropping unexpectedly, certain
nodes not getting revalidated often enough, and bad results being returned to incoming
FINDNODE queries.
In this change, the revalidation process is improved with the following logic:
- We maintain two 'revalidation lists' containing the table nodes, named 'fast' and 'slow'.
- The process chooses random nodes from each list on a randomized interval, the interval being
faster for the 'fast' list, and performs revalidation for the chosen node.
- Whenever a node is newly inserted into the table, it goes into the 'fast' list.
Once validation passes, it transfers to the 'slow' list. If a request fails, or the
node changes endpoint, it transfers back into 'fast'.
- livenessChecks is incremented by one for successful checks. Unlike the old implementation,
we will not drop the node on the first failing check. We instead quickly decay the
livenessChecks give it another chance.
- Order of nodes in bucket doesn't matter anymore.
I am also adding a debug API endpoint to dump the node table content.
Co-authored-by: Martin HS <martin@swende.se>
2024-05-23 15:26:09 +03:00
|
|
|
// tableRevalidation implements the node revalidation process.
|
|
|
|
// It tracks all nodes contained in Table, and schedules sending PING to them.
|
|
|
|
type tableRevalidation struct {
|
|
|
|
fast revalidationList
|
|
|
|
slow revalidationList
|
|
|
|
activeReq map[enode.ID]struct{}
|
|
|
|
}
|
|
|
|
|
|
|
|
type revalidationResponse struct {
|
2024-05-29 16:02:26 +03:00
|
|
|
n *tableNode
|
p2p/discover: improved node revalidation (#29572)
Node discovery periodically revalidates the nodes in its table by sending PING, checking
if they are still alive. I recently noticed some issues with the implementation of this
process, which can cause strange results such as nodes dropping unexpectedly, certain
nodes not getting revalidated often enough, and bad results being returned to incoming
FINDNODE queries.
In this change, the revalidation process is improved with the following logic:
- We maintain two 'revalidation lists' containing the table nodes, named 'fast' and 'slow'.
- The process chooses random nodes from each list on a randomized interval, the interval being
faster for the 'fast' list, and performs revalidation for the chosen node.
- Whenever a node is newly inserted into the table, it goes into the 'fast' list.
Once validation passes, it transfers to the 'slow' list. If a request fails, or the
node changes endpoint, it transfers back into 'fast'.
- livenessChecks is incremented by one for successful checks. Unlike the old implementation,
we will not drop the node on the first failing check. We instead quickly decay the
livenessChecks give it another chance.
- Order of nodes in bucket doesn't matter anymore.
I am also adding a debug API endpoint to dump the node table content.
Co-authored-by: Martin HS <martin@swende.se>
2024-05-23 15:26:09 +03:00
|
|
|
newRecord *enode.Node
|
|
|
|
didRespond bool
|
|
|
|
}
|
|
|
|
|
|
|
|
func (tr *tableRevalidation) init(cfg *Config) {
|
|
|
|
tr.activeReq = make(map[enode.ID]struct{})
|
|
|
|
tr.fast.nextTime = never
|
|
|
|
tr.fast.interval = cfg.PingInterval
|
|
|
|
tr.fast.name = "fast"
|
|
|
|
tr.slow.nextTime = never
|
2024-05-28 22:30:17 +03:00
|
|
|
tr.slow.interval = cfg.PingInterval * slowRevalidationFactor
|
p2p/discover: improved node revalidation (#29572)
Node discovery periodically revalidates the nodes in its table by sending PING, checking
if they are still alive. I recently noticed some issues with the implementation of this
process, which can cause strange results such as nodes dropping unexpectedly, certain
nodes not getting revalidated often enough, and bad results being returned to incoming
FINDNODE queries.
In this change, the revalidation process is improved with the following logic:
- We maintain two 'revalidation lists' containing the table nodes, named 'fast' and 'slow'.
- The process chooses random nodes from each list on a randomized interval, the interval being
faster for the 'fast' list, and performs revalidation for the chosen node.
- Whenever a node is newly inserted into the table, it goes into the 'fast' list.
Once validation passes, it transfers to the 'slow' list. If a request fails, or the
node changes endpoint, it transfers back into 'fast'.
- livenessChecks is incremented by one for successful checks. Unlike the old implementation,
we will not drop the node on the first failing check. We instead quickly decay the
livenessChecks give it another chance.
- Order of nodes in bucket doesn't matter anymore.
I am also adding a debug API endpoint to dump the node table content.
Co-authored-by: Martin HS <martin@swende.se>
2024-05-23 15:26:09 +03:00
|
|
|
tr.slow.name = "slow"
|
|
|
|
}
|
|
|
|
|
|
|
|
// nodeAdded is called when the table receives a new node.
|
2024-05-29 16:02:26 +03:00
|
|
|
func (tr *tableRevalidation) nodeAdded(tab *Table, n *tableNode) {
|
p2p/discover: improved node revalidation (#29572)
Node discovery periodically revalidates the nodes in its table by sending PING, checking
if they are still alive. I recently noticed some issues with the implementation of this
process, which can cause strange results such as nodes dropping unexpectedly, certain
nodes not getting revalidated often enough, and bad results being returned to incoming
FINDNODE queries.
In this change, the revalidation process is improved with the following logic:
- We maintain two 'revalidation lists' containing the table nodes, named 'fast' and 'slow'.
- The process chooses random nodes from each list on a randomized interval, the interval being
faster for the 'fast' list, and performs revalidation for the chosen node.
- Whenever a node is newly inserted into the table, it goes into the 'fast' list.
Once validation passes, it transfers to the 'slow' list. If a request fails, or the
node changes endpoint, it transfers back into 'fast'.
- livenessChecks is incremented by one for successful checks. Unlike the old implementation,
we will not drop the node on the first failing check. We instead quickly decay the
livenessChecks give it another chance.
- Order of nodes in bucket doesn't matter anymore.
I am also adding a debug API endpoint to dump the node table content.
Co-authored-by: Martin HS <martin@swende.se>
2024-05-23 15:26:09 +03:00
|
|
|
tr.fast.push(n, tab.cfg.Clock.Now(), &tab.rand)
|
|
|
|
}
|
|
|
|
|
|
|
|
// nodeRemoved is called when a node was removed from the table.
|
2024-05-29 16:02:26 +03:00
|
|
|
func (tr *tableRevalidation) nodeRemoved(n *tableNode) {
|
2024-05-28 19:13:03 +03:00
|
|
|
if n.revalList == nil {
|
|
|
|
panic(fmt.Errorf("removed node %v has nil revalList", n.ID()))
|
p2p/discover: improved node revalidation (#29572)
Node discovery periodically revalidates the nodes in its table by sending PING, checking
if they are still alive. I recently noticed some issues with the implementation of this
process, which can cause strange results such as nodes dropping unexpectedly, certain
nodes not getting revalidated often enough, and bad results being returned to incoming
FINDNODE queries.
In this change, the revalidation process is improved with the following logic:
- We maintain two 'revalidation lists' containing the table nodes, named 'fast' and 'slow'.
- The process chooses random nodes from each list on a randomized interval, the interval being
faster for the 'fast' list, and performs revalidation for the chosen node.
- Whenever a node is newly inserted into the table, it goes into the 'fast' list.
Once validation passes, it transfers to the 'slow' list. If a request fails, or the
node changes endpoint, it transfers back into 'fast'.
- livenessChecks is incremented by one for successful checks. Unlike the old implementation,
we will not drop the node on the first failing check. We instead quickly decay the
livenessChecks give it another chance.
- Order of nodes in bucket doesn't matter anymore.
I am also adding a debug API endpoint to dump the node table content.
Co-authored-by: Martin HS <martin@swende.se>
2024-05-23 15:26:09 +03:00
|
|
|
}
|
2024-05-28 19:13:03 +03:00
|
|
|
n.revalList.remove(n)
|
p2p/discover: improved node revalidation (#29572)
Node discovery periodically revalidates the nodes in its table by sending PING, checking
if they are still alive. I recently noticed some issues with the implementation of this
process, which can cause strange results such as nodes dropping unexpectedly, certain
nodes not getting revalidated often enough, and bad results being returned to incoming
FINDNODE queries.
In this change, the revalidation process is improved with the following logic:
- We maintain two 'revalidation lists' containing the table nodes, named 'fast' and 'slow'.
- The process chooses random nodes from each list on a randomized interval, the interval being
faster for the 'fast' list, and performs revalidation for the chosen node.
- Whenever a node is newly inserted into the table, it goes into the 'fast' list.
Once validation passes, it transfers to the 'slow' list. If a request fails, or the
node changes endpoint, it transfers back into 'fast'.
- livenessChecks is incremented by one for successful checks. Unlike the old implementation,
we will not drop the node on the first failing check. We instead quickly decay the
livenessChecks give it another chance.
- Order of nodes in bucket doesn't matter anymore.
I am also adding a debug API endpoint to dump the node table content.
Co-authored-by: Martin HS <martin@swende.se>
2024-05-23 15:26:09 +03:00
|
|
|
}
|
|
|
|
|
2024-05-28 22:30:17 +03:00
|
|
|
// nodeEndpointChanged is called when a change in IP or port is detected.
|
2024-05-29 16:02:26 +03:00
|
|
|
func (tr *tableRevalidation) nodeEndpointChanged(tab *Table, n *tableNode) {
|
2024-05-28 22:30:17 +03:00
|
|
|
n.isValidatedLive = false
|
|
|
|
tr.moveToList(&tr.fast, n, tab.cfg.Clock.Now(), &tab.rand)
|
|
|
|
}
|
|
|
|
|
p2p/discover: improved node revalidation (#29572)
Node discovery periodically revalidates the nodes in its table by sending PING, checking
if they are still alive. I recently noticed some issues with the implementation of this
process, which can cause strange results such as nodes dropping unexpectedly, certain
nodes not getting revalidated often enough, and bad results being returned to incoming
FINDNODE queries.
In this change, the revalidation process is improved with the following logic:
- We maintain two 'revalidation lists' containing the table nodes, named 'fast' and 'slow'.
- The process chooses random nodes from each list on a randomized interval, the interval being
faster for the 'fast' list, and performs revalidation for the chosen node.
- Whenever a node is newly inserted into the table, it goes into the 'fast' list.
Once validation passes, it transfers to the 'slow' list. If a request fails, or the
node changes endpoint, it transfers back into 'fast'.
- livenessChecks is incremented by one for successful checks. Unlike the old implementation,
we will not drop the node on the first failing check. We instead quickly decay the
livenessChecks give it another chance.
- Order of nodes in bucket doesn't matter anymore.
I am also adding a debug API endpoint to dump the node table content.
Co-authored-by: Martin HS <martin@swende.se>
2024-05-23 15:26:09 +03:00
|
|
|
// run performs node revalidation.
|
|
|
|
// It returns the next time it should be invoked, which is used in the Table main loop
|
|
|
|
// to schedule a timer. However, run can be called at any time.
|
|
|
|
func (tr *tableRevalidation) run(tab *Table, now mclock.AbsTime) (nextTime mclock.AbsTime) {
|
2024-07-31 22:38:23 +03:00
|
|
|
reval := func(list *revalidationList) {
|
|
|
|
if list.nextTime <= now {
|
|
|
|
if n := list.get(now, &tab.rand, tr.activeReq); n != nil {
|
|
|
|
tr.startRequest(tab, n)
|
|
|
|
}
|
|
|
|
// Update nextTime regardless if any requests were started because
|
|
|
|
// current value has passed.
|
|
|
|
list.schedule(now, &tab.rand)
|
|
|
|
}
|
p2p/discover: improved node revalidation (#29572)
Node discovery periodically revalidates the nodes in its table by sending PING, checking
if they are still alive. I recently noticed some issues with the implementation of this
process, which can cause strange results such as nodes dropping unexpectedly, certain
nodes not getting revalidated often enough, and bad results being returned to incoming
FINDNODE queries.
In this change, the revalidation process is improved with the following logic:
- We maintain two 'revalidation lists' containing the table nodes, named 'fast' and 'slow'.
- The process chooses random nodes from each list on a randomized interval, the interval being
faster for the 'fast' list, and performs revalidation for the chosen node.
- Whenever a node is newly inserted into the table, it goes into the 'fast' list.
Once validation passes, it transfers to the 'slow' list. If a request fails, or the
node changes endpoint, it transfers back into 'fast'.
- livenessChecks is incremented by one for successful checks. Unlike the old implementation,
we will not drop the node on the first failing check. We instead quickly decay the
livenessChecks give it another chance.
- Order of nodes in bucket doesn't matter anymore.
I am also adding a debug API endpoint to dump the node table content.
Co-authored-by: Martin HS <martin@swende.se>
2024-05-23 15:26:09 +03:00
|
|
|
}
|
2024-07-31 22:38:23 +03:00
|
|
|
reval(&tr.fast)
|
|
|
|
reval(&tr.slow)
|
p2p/discover: improved node revalidation (#29572)
Node discovery periodically revalidates the nodes in its table by sending PING, checking
if they are still alive. I recently noticed some issues with the implementation of this
process, which can cause strange results such as nodes dropping unexpectedly, certain
nodes not getting revalidated often enough, and bad results being returned to incoming
FINDNODE queries.
In this change, the revalidation process is improved with the following logic:
- We maintain two 'revalidation lists' containing the table nodes, named 'fast' and 'slow'.
- The process chooses random nodes from each list on a randomized interval, the interval being
faster for the 'fast' list, and performs revalidation for the chosen node.
- Whenever a node is newly inserted into the table, it goes into the 'fast' list.
Once validation passes, it transfers to the 'slow' list. If a request fails, or the
node changes endpoint, it transfers back into 'fast'.
- livenessChecks is incremented by one for successful checks. Unlike the old implementation,
we will not drop the node on the first failing check. We instead quickly decay the
livenessChecks give it another chance.
- Order of nodes in bucket doesn't matter anymore.
I am also adding a debug API endpoint to dump the node table content.
Co-authored-by: Martin HS <martin@swende.se>
2024-05-23 15:26:09 +03:00
|
|
|
|
|
|
|
return min(tr.fast.nextTime, tr.slow.nextTime)
|
|
|
|
}
|
|
|
|
|
|
|
|
// startRequest spawns a revalidation request for node n.
|
2024-05-29 16:02:26 +03:00
|
|
|
func (tr *tableRevalidation) startRequest(tab *Table, n *tableNode) {
|
p2p/discover: improved node revalidation (#29572)
Node discovery periodically revalidates the nodes in its table by sending PING, checking
if they are still alive. I recently noticed some issues with the implementation of this
process, which can cause strange results such as nodes dropping unexpectedly, certain
nodes not getting revalidated often enough, and bad results being returned to incoming
FINDNODE queries.
In this change, the revalidation process is improved with the following logic:
- We maintain two 'revalidation lists' containing the table nodes, named 'fast' and 'slow'.
- The process chooses random nodes from each list on a randomized interval, the interval being
faster for the 'fast' list, and performs revalidation for the chosen node.
- Whenever a node is newly inserted into the table, it goes into the 'fast' list.
Once validation passes, it transfers to the 'slow' list. If a request fails, or the
node changes endpoint, it transfers back into 'fast'.
- livenessChecks is incremented by one for successful checks. Unlike the old implementation,
we will not drop the node on the first failing check. We instead quickly decay the
livenessChecks give it another chance.
- Order of nodes in bucket doesn't matter anymore.
I am also adding a debug API endpoint to dump the node table content.
Co-authored-by: Martin HS <martin@swende.se>
2024-05-23 15:26:09 +03:00
|
|
|
if _, ok := tr.activeReq[n.ID()]; ok {
|
2024-05-28 19:13:03 +03:00
|
|
|
panic(fmt.Errorf("duplicate startRequest (node %v)", n.ID()))
|
p2p/discover: improved node revalidation (#29572)
Node discovery periodically revalidates the nodes in its table by sending PING, checking
if they are still alive. I recently noticed some issues with the implementation of this
process, which can cause strange results such as nodes dropping unexpectedly, certain
nodes not getting revalidated often enough, and bad results being returned to incoming
FINDNODE queries.
In this change, the revalidation process is improved with the following logic:
- We maintain two 'revalidation lists' containing the table nodes, named 'fast' and 'slow'.
- The process chooses random nodes from each list on a randomized interval, the interval being
faster for the 'fast' list, and performs revalidation for the chosen node.
- Whenever a node is newly inserted into the table, it goes into the 'fast' list.
Once validation passes, it transfers to the 'slow' list. If a request fails, or the
node changes endpoint, it transfers back into 'fast'.
- livenessChecks is incremented by one for successful checks. Unlike the old implementation,
we will not drop the node on the first failing check. We instead quickly decay the
livenessChecks give it another chance.
- Order of nodes in bucket doesn't matter anymore.
I am also adding a debug API endpoint to dump the node table content.
Co-authored-by: Martin HS <martin@swende.se>
2024-05-23 15:26:09 +03:00
|
|
|
}
|
|
|
|
tr.activeReq[n.ID()] = struct{}{}
|
2024-05-28 19:13:03 +03:00
|
|
|
resp := revalidationResponse{n: n}
|
p2p/discover: improved node revalidation (#29572)
Node discovery periodically revalidates the nodes in its table by sending PING, checking
if they are still alive. I recently noticed some issues with the implementation of this
process, which can cause strange results such as nodes dropping unexpectedly, certain
nodes not getting revalidated often enough, and bad results being returned to incoming
FINDNODE queries.
In this change, the revalidation process is improved with the following logic:
- We maintain two 'revalidation lists' containing the table nodes, named 'fast' and 'slow'.
- The process chooses random nodes from each list on a randomized interval, the interval being
faster for the 'fast' list, and performs revalidation for the chosen node.
- Whenever a node is newly inserted into the table, it goes into the 'fast' list.
Once validation passes, it transfers to the 'slow' list. If a request fails, or the
node changes endpoint, it transfers back into 'fast'.
- livenessChecks is incremented by one for successful checks. Unlike the old implementation,
we will not drop the node on the first failing check. We instead quickly decay the
livenessChecks give it another chance.
- Order of nodes in bucket doesn't matter anymore.
I am also adding a debug API endpoint to dump the node table content.
Co-authored-by: Martin HS <martin@swende.se>
2024-05-23 15:26:09 +03:00
|
|
|
|
|
|
|
// Fetch the node while holding lock.
|
|
|
|
tab.mutex.Lock()
|
|
|
|
node := n.Node
|
|
|
|
tab.mutex.Unlock()
|
|
|
|
|
|
|
|
go tab.doRevalidate(resp, node)
|
|
|
|
}
|
|
|
|
|
|
|
|
func (tab *Table) doRevalidate(resp revalidationResponse, node *enode.Node) {
|
|
|
|
// Ping the selected node and wait for a pong response.
|
|
|
|
remoteSeq, err := tab.net.ping(node)
|
|
|
|
resp.didRespond = err == nil
|
|
|
|
|
|
|
|
// Also fetch record if the node replied and returned a higher sequence number.
|
|
|
|
if remoteSeq > node.Seq() {
|
|
|
|
newrec, err := tab.net.RequestENR(node)
|
|
|
|
if err != nil {
|
|
|
|
tab.log.Debug("ENR request failed", "id", node.ID(), "err", err)
|
|
|
|
} else {
|
|
|
|
resp.newRecord = newrec
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
select {
|
|
|
|
case tab.revalResponseCh <- resp:
|
|
|
|
case <-tab.closed:
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// handleResponse processes the result of a revalidation request.
|
|
|
|
func (tr *tableRevalidation) handleResponse(tab *Table, resp revalidationResponse) {
|
2024-05-28 19:13:03 +03:00
|
|
|
var (
|
|
|
|
now = tab.cfg.Clock.Now()
|
|
|
|
n = resp.n
|
|
|
|
b = tab.bucket(n.ID())
|
|
|
|
)
|
p2p/discover: improved node revalidation (#29572)
Node discovery periodically revalidates the nodes in its table by sending PING, checking
if they are still alive. I recently noticed some issues with the implementation of this
process, which can cause strange results such as nodes dropping unexpectedly, certain
nodes not getting revalidated often enough, and bad results being returned to incoming
FINDNODE queries.
In this change, the revalidation process is improved with the following logic:
- We maintain two 'revalidation lists' containing the table nodes, named 'fast' and 'slow'.
- The process chooses random nodes from each list on a randomized interval, the interval being
faster for the 'fast' list, and performs revalidation for the chosen node.
- Whenever a node is newly inserted into the table, it goes into the 'fast' list.
Once validation passes, it transfers to the 'slow' list. If a request fails, or the
node changes endpoint, it transfers back into 'fast'.
- livenessChecks is incremented by one for successful checks. Unlike the old implementation,
we will not drop the node on the first failing check. We instead quickly decay the
livenessChecks give it another chance.
- Order of nodes in bucket doesn't matter anymore.
I am also adding a debug API endpoint to dump the node table content.
Co-authored-by: Martin HS <martin@swende.se>
2024-05-23 15:26:09 +03:00
|
|
|
delete(tr.activeReq, n.ID())
|
|
|
|
|
2024-05-28 19:13:03 +03:00
|
|
|
// If the node was removed from the table while getting checked, we need to stop
|
|
|
|
// processing here to avoid re-adding it.
|
|
|
|
if n.revalList == nil {
|
|
|
|
return
|
|
|
|
}
|
|
|
|
|
|
|
|
// Store potential seeds in database.
|
|
|
|
// This is done via defer to avoid holding Table lock while writing to DB.
|
|
|
|
defer func() {
|
|
|
|
if n.isValidatedLive && n.livenessChecks > 5 {
|
|
|
|
tab.db.UpdateNode(resp.n.Node)
|
|
|
|
}
|
|
|
|
}()
|
|
|
|
|
|
|
|
// Remaining logic needs access to Table internals.
|
p2p/discover: improved node revalidation (#29572)
Node discovery periodically revalidates the nodes in its table by sending PING, checking
if they are still alive. I recently noticed some issues with the implementation of this
process, which can cause strange results such as nodes dropping unexpectedly, certain
nodes not getting revalidated often enough, and bad results being returned to incoming
FINDNODE queries.
In this change, the revalidation process is improved with the following logic:
- We maintain two 'revalidation lists' containing the table nodes, named 'fast' and 'slow'.
- The process chooses random nodes from each list on a randomized interval, the interval being
faster for the 'fast' list, and performs revalidation for the chosen node.
- Whenever a node is newly inserted into the table, it goes into the 'fast' list.
Once validation passes, it transfers to the 'slow' list. If a request fails, or the
node changes endpoint, it transfers back into 'fast'.
- livenessChecks is incremented by one for successful checks. Unlike the old implementation,
we will not drop the node on the first failing check. We instead quickly decay the
livenessChecks give it another chance.
- Order of nodes in bucket doesn't matter anymore.
I am also adding a debug API endpoint to dump the node table content.
Co-authored-by: Martin HS <martin@swende.se>
2024-05-23 15:26:09 +03:00
|
|
|
tab.mutex.Lock()
|
|
|
|
defer tab.mutex.Unlock()
|
|
|
|
|
|
|
|
if !resp.didRespond {
|
|
|
|
n.livenessChecks /= 3
|
|
|
|
if n.livenessChecks <= 0 {
|
|
|
|
tab.deleteInBucket(b, n.ID())
|
|
|
|
} else {
|
2024-05-28 22:30:17 +03:00
|
|
|
tab.log.Debug("Node revalidation failed", "b", b.index, "id", n.ID(), "checks", n.livenessChecks, "q", n.revalList.name)
|
2024-05-28 19:13:03 +03:00
|
|
|
tr.moveToList(&tr.fast, n, now, &tab.rand)
|
p2p/discover: improved node revalidation (#29572)
Node discovery periodically revalidates the nodes in its table by sending PING, checking
if they are still alive. I recently noticed some issues with the implementation of this
process, which can cause strange results such as nodes dropping unexpectedly, certain
nodes not getting revalidated often enough, and bad results being returned to incoming
FINDNODE queries.
In this change, the revalidation process is improved with the following logic:
- We maintain two 'revalidation lists' containing the table nodes, named 'fast' and 'slow'.
- The process chooses random nodes from each list on a randomized interval, the interval being
faster for the 'fast' list, and performs revalidation for the chosen node.
- Whenever a node is newly inserted into the table, it goes into the 'fast' list.
Once validation passes, it transfers to the 'slow' list. If a request fails, or the
node changes endpoint, it transfers back into 'fast'.
- livenessChecks is incremented by one for successful checks. Unlike the old implementation,
we will not drop the node on the first failing check. We instead quickly decay the
livenessChecks give it another chance.
- Order of nodes in bucket doesn't matter anymore.
I am also adding a debug API endpoint to dump the node table content.
Co-authored-by: Martin HS <martin@swende.se>
2024-05-23 15:26:09 +03:00
|
|
|
}
|
|
|
|
return
|
|
|
|
}
|
|
|
|
|
|
|
|
// The node responded.
|
|
|
|
n.livenessChecks++
|
|
|
|
n.isValidatedLive = true
|
2024-05-28 22:30:17 +03:00
|
|
|
tab.log.Debug("Node revalidated", "b", b.index, "id", n.ID(), "checks", n.livenessChecks, "q", n.revalList.name)
|
p2p/discover: improved node revalidation (#29572)
Node discovery periodically revalidates the nodes in its table by sending PING, checking
if they are still alive. I recently noticed some issues with the implementation of this
process, which can cause strange results such as nodes dropping unexpectedly, certain
nodes not getting revalidated often enough, and bad results being returned to incoming
FINDNODE queries.
In this change, the revalidation process is improved with the following logic:
- We maintain two 'revalidation lists' containing the table nodes, named 'fast' and 'slow'.
- The process chooses random nodes from each list on a randomized interval, the interval being
faster for the 'fast' list, and performs revalidation for the chosen node.
- Whenever a node is newly inserted into the table, it goes into the 'fast' list.
Once validation passes, it transfers to the 'slow' list. If a request fails, or the
node changes endpoint, it transfers back into 'fast'.
- livenessChecks is incremented by one for successful checks. Unlike the old implementation,
we will not drop the node on the first failing check. We instead quickly decay the
livenessChecks give it another chance.
- Order of nodes in bucket doesn't matter anymore.
I am also adding a debug API endpoint to dump the node table content.
Co-authored-by: Martin HS <martin@swende.se>
2024-05-23 15:26:09 +03:00
|
|
|
var endpointChanged bool
|
|
|
|
if resp.newRecord != nil {
|
2024-05-28 22:30:17 +03:00
|
|
|
_, endpointChanged = tab.bumpInBucket(b, resp.newRecord, false)
|
p2p/discover: improved node revalidation (#29572)
Node discovery periodically revalidates the nodes in its table by sending PING, checking
if they are still alive. I recently noticed some issues with the implementation of this
process, which can cause strange results such as nodes dropping unexpectedly, certain
nodes not getting revalidated often enough, and bad results being returned to incoming
FINDNODE queries.
In this change, the revalidation process is improved with the following logic:
- We maintain two 'revalidation lists' containing the table nodes, named 'fast' and 'slow'.
- The process chooses random nodes from each list on a randomized interval, the interval being
faster for the 'fast' list, and performs revalidation for the chosen node.
- Whenever a node is newly inserted into the table, it goes into the 'fast' list.
Once validation passes, it transfers to the 'slow' list. If a request fails, or the
node changes endpoint, it transfers back into 'fast'.
- livenessChecks is incremented by one for successful checks. Unlike the old implementation,
we will not drop the node on the first failing check. We instead quickly decay the
livenessChecks give it another chance.
- Order of nodes in bucket doesn't matter anymore.
I am also adding a debug API endpoint to dump the node table content.
Co-authored-by: Martin HS <martin@swende.se>
2024-05-23 15:26:09 +03:00
|
|
|
}
|
|
|
|
|
2024-05-28 22:30:17 +03:00
|
|
|
// Node moves to slow list if it passed and hasn't changed.
|
p2p/discover: improved node revalidation (#29572)
Node discovery periodically revalidates the nodes in its table by sending PING, checking
if they are still alive. I recently noticed some issues with the implementation of this
process, which can cause strange results such as nodes dropping unexpectedly, certain
nodes not getting revalidated often enough, and bad results being returned to incoming
FINDNODE queries.
In this change, the revalidation process is improved with the following logic:
- We maintain two 'revalidation lists' containing the table nodes, named 'fast' and 'slow'.
- The process chooses random nodes from each list on a randomized interval, the interval being
faster for the 'fast' list, and performs revalidation for the chosen node.
- Whenever a node is newly inserted into the table, it goes into the 'fast' list.
Once validation passes, it transfers to the 'slow' list. If a request fails, or the
node changes endpoint, it transfers back into 'fast'.
- livenessChecks is incremented by one for successful checks. Unlike the old implementation,
we will not drop the node on the first failing check. We instead quickly decay the
livenessChecks give it another chance.
- Order of nodes in bucket doesn't matter anymore.
I am also adding a debug API endpoint to dump the node table content.
Co-authored-by: Martin HS <martin@swende.se>
2024-05-23 15:26:09 +03:00
|
|
|
if !endpointChanged {
|
2024-05-28 19:13:03 +03:00
|
|
|
tr.moveToList(&tr.slow, n, now, &tab.rand)
|
p2p/discover: improved node revalidation (#29572)
Node discovery periodically revalidates the nodes in its table by sending PING, checking
if they are still alive. I recently noticed some issues with the implementation of this
process, which can cause strange results such as nodes dropping unexpectedly, certain
nodes not getting revalidated often enough, and bad results being returned to incoming
FINDNODE queries.
In this change, the revalidation process is improved with the following logic:
- We maintain two 'revalidation lists' containing the table nodes, named 'fast' and 'slow'.
- The process chooses random nodes from each list on a randomized interval, the interval being
faster for the 'fast' list, and performs revalidation for the chosen node.
- Whenever a node is newly inserted into the table, it goes into the 'fast' list.
Once validation passes, it transfers to the 'slow' list. If a request fails, or the
node changes endpoint, it transfers back into 'fast'.
- livenessChecks is incremented by one for successful checks. Unlike the old implementation,
we will not drop the node on the first failing check. We instead quickly decay the
livenessChecks give it another chance.
- Order of nodes in bucket doesn't matter anymore.
I am also adding a debug API endpoint to dump the node table content.
Co-authored-by: Martin HS <martin@swende.se>
2024-05-23 15:26:09 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2024-05-28 19:13:03 +03:00
|
|
|
// moveToList ensures n is in the 'dest' list.
|
2024-05-29 16:02:26 +03:00
|
|
|
func (tr *tableRevalidation) moveToList(dest *revalidationList, n *tableNode, now mclock.AbsTime, rand randomSource) {
|
2024-05-28 19:13:03 +03:00
|
|
|
if n.revalList == dest {
|
p2p/discover: improved node revalidation (#29572)
Node discovery periodically revalidates the nodes in its table by sending PING, checking
if they are still alive. I recently noticed some issues with the implementation of this
process, which can cause strange results such as nodes dropping unexpectedly, certain
nodes not getting revalidated often enough, and bad results being returned to incoming
FINDNODE queries.
In this change, the revalidation process is improved with the following logic:
- We maintain two 'revalidation lists' containing the table nodes, named 'fast' and 'slow'.
- The process chooses random nodes from each list on a randomized interval, the interval being
faster for the 'fast' list, and performs revalidation for the chosen node.
- Whenever a node is newly inserted into the table, it goes into the 'fast' list.
Once validation passes, it transfers to the 'slow' list. If a request fails, or the
node changes endpoint, it transfers back into 'fast'.
- livenessChecks is incremented by one for successful checks. Unlike the old implementation,
we will not drop the node on the first failing check. We instead quickly decay the
livenessChecks give it another chance.
- Order of nodes in bucket doesn't matter anymore.
I am also adding a debug API endpoint to dump the node table content.
Co-authored-by: Martin HS <martin@swende.se>
2024-05-23 15:26:09 +03:00
|
|
|
return
|
|
|
|
}
|
2024-05-28 19:13:03 +03:00
|
|
|
if n.revalList != nil {
|
|
|
|
n.revalList.remove(n)
|
p2p/discover: improved node revalidation (#29572)
Node discovery periodically revalidates the nodes in its table by sending PING, checking
if they are still alive. I recently noticed some issues with the implementation of this
process, which can cause strange results such as nodes dropping unexpectedly, certain
nodes not getting revalidated often enough, and bad results being returned to incoming
FINDNODE queries.
In this change, the revalidation process is improved with the following logic:
- We maintain two 'revalidation lists' containing the table nodes, named 'fast' and 'slow'.
- The process chooses random nodes from each list on a randomized interval, the interval being
faster for the 'fast' list, and performs revalidation for the chosen node.
- Whenever a node is newly inserted into the table, it goes into the 'fast' list.
Once validation passes, it transfers to the 'slow' list. If a request fails, or the
node changes endpoint, it transfers back into 'fast'.
- livenessChecks is incremented by one for successful checks. Unlike the old implementation,
we will not drop the node on the first failing check. We instead quickly decay the
livenessChecks give it another chance.
- Order of nodes in bucket doesn't matter anymore.
I am also adding a debug API endpoint to dump the node table content.
Co-authored-by: Martin HS <martin@swende.se>
2024-05-23 15:26:09 +03:00
|
|
|
}
|
|
|
|
dest.push(n, now, rand)
|
|
|
|
}
|
|
|
|
|
|
|
|
// revalidationList holds a list nodes and the next revalidation time.
|
|
|
|
type revalidationList struct {
|
2024-05-29 16:02:26 +03:00
|
|
|
nodes []*tableNode
|
p2p/discover: improved node revalidation (#29572)
Node discovery periodically revalidates the nodes in its table by sending PING, checking
if they are still alive. I recently noticed some issues with the implementation of this
process, which can cause strange results such as nodes dropping unexpectedly, certain
nodes not getting revalidated often enough, and bad results being returned to incoming
FINDNODE queries.
In this change, the revalidation process is improved with the following logic:
- We maintain two 'revalidation lists' containing the table nodes, named 'fast' and 'slow'.
- The process chooses random nodes from each list on a randomized interval, the interval being
faster for the 'fast' list, and performs revalidation for the chosen node.
- Whenever a node is newly inserted into the table, it goes into the 'fast' list.
Once validation passes, it transfers to the 'slow' list. If a request fails, or the
node changes endpoint, it transfers back into 'fast'.
- livenessChecks is incremented by one for successful checks. Unlike the old implementation,
we will not drop the node on the first failing check. We instead quickly decay the
livenessChecks give it another chance.
- Order of nodes in bucket doesn't matter anymore.
I am also adding a debug API endpoint to dump the node table content.
Co-authored-by: Martin HS <martin@swende.se>
2024-05-23 15:26:09 +03:00
|
|
|
nextTime mclock.AbsTime
|
|
|
|
interval time.Duration
|
|
|
|
name string
|
|
|
|
}
|
|
|
|
|
|
|
|
// get returns a random node from the queue. Nodes in the 'exclude' map are not returned.
|
2024-05-29 16:02:26 +03:00
|
|
|
func (list *revalidationList) get(now mclock.AbsTime, rand randomSource, exclude map[enode.ID]struct{}) *tableNode {
|
2024-07-31 22:38:23 +03:00
|
|
|
if len(list.nodes) == 0 {
|
p2p/discover: improved node revalidation (#29572)
Node discovery periodically revalidates the nodes in its table by sending PING, checking
if they are still alive. I recently noticed some issues with the implementation of this
process, which can cause strange results such as nodes dropping unexpectedly, certain
nodes not getting revalidated often enough, and bad results being returned to incoming
FINDNODE queries.
In this change, the revalidation process is improved with the following logic:
- We maintain two 'revalidation lists' containing the table nodes, named 'fast' and 'slow'.
- The process chooses random nodes from each list on a randomized interval, the interval being
faster for the 'fast' list, and performs revalidation for the chosen node.
- Whenever a node is newly inserted into the table, it goes into the 'fast' list.
Once validation passes, it transfers to the 'slow' list. If a request fails, or the
node changes endpoint, it transfers back into 'fast'.
- livenessChecks is incremented by one for successful checks. Unlike the old implementation,
we will not drop the node on the first failing check. We instead quickly decay the
livenessChecks give it another chance.
- Order of nodes in bucket doesn't matter anymore.
I am also adding a debug API endpoint to dump the node table content.
Co-authored-by: Martin HS <martin@swende.se>
2024-05-23 15:26:09 +03:00
|
|
|
return nil
|
|
|
|
}
|
|
|
|
for i := 0; i < len(list.nodes)*3; i++ {
|
|
|
|
n := list.nodes[rand.Intn(len(list.nodes))]
|
|
|
|
_, excluded := exclude[n.ID()]
|
|
|
|
if !excluded {
|
|
|
|
return n
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return nil
|
|
|
|
}
|
|
|
|
|
|
|
|
func (list *revalidationList) schedule(now mclock.AbsTime, rand randomSource) {
|
|
|
|
list.nextTime = now.Add(time.Duration(rand.Int63n(int64(list.interval))))
|
|
|
|
}
|
|
|
|
|
2024-05-29 16:02:26 +03:00
|
|
|
func (list *revalidationList) push(n *tableNode, now mclock.AbsTime, rand randomSource) {
|
p2p/discover: improved node revalidation (#29572)
Node discovery periodically revalidates the nodes in its table by sending PING, checking
if they are still alive. I recently noticed some issues with the implementation of this
process, which can cause strange results such as nodes dropping unexpectedly, certain
nodes not getting revalidated often enough, and bad results being returned to incoming
FINDNODE queries.
In this change, the revalidation process is improved with the following logic:
- We maintain two 'revalidation lists' containing the table nodes, named 'fast' and 'slow'.
- The process chooses random nodes from each list on a randomized interval, the interval being
faster for the 'fast' list, and performs revalidation for the chosen node.
- Whenever a node is newly inserted into the table, it goes into the 'fast' list.
Once validation passes, it transfers to the 'slow' list. If a request fails, or the
node changes endpoint, it transfers back into 'fast'.
- livenessChecks is incremented by one for successful checks. Unlike the old implementation,
we will not drop the node on the first failing check. We instead quickly decay the
livenessChecks give it another chance.
- Order of nodes in bucket doesn't matter anymore.
I am also adding a debug API endpoint to dump the node table content.
Co-authored-by: Martin HS <martin@swende.se>
2024-05-23 15:26:09 +03:00
|
|
|
list.nodes = append(list.nodes, n)
|
|
|
|
if list.nextTime == never {
|
|
|
|
list.schedule(now, rand)
|
|
|
|
}
|
2024-05-28 19:13:03 +03:00
|
|
|
n.revalList = list
|
p2p/discover: improved node revalidation (#29572)
Node discovery periodically revalidates the nodes in its table by sending PING, checking
if they are still alive. I recently noticed some issues with the implementation of this
process, which can cause strange results such as nodes dropping unexpectedly, certain
nodes not getting revalidated often enough, and bad results being returned to incoming
FINDNODE queries.
In this change, the revalidation process is improved with the following logic:
- We maintain two 'revalidation lists' containing the table nodes, named 'fast' and 'slow'.
- The process chooses random nodes from each list on a randomized interval, the interval being
faster for the 'fast' list, and performs revalidation for the chosen node.
- Whenever a node is newly inserted into the table, it goes into the 'fast' list.
Once validation passes, it transfers to the 'slow' list. If a request fails, or the
node changes endpoint, it transfers back into 'fast'.
- livenessChecks is incremented by one for successful checks. Unlike the old implementation,
we will not drop the node on the first failing check. We instead quickly decay the
livenessChecks give it another chance.
- Order of nodes in bucket doesn't matter anymore.
I am also adding a debug API endpoint to dump the node table content.
Co-authored-by: Martin HS <martin@swende.se>
2024-05-23 15:26:09 +03:00
|
|
|
}
|
|
|
|
|
2024-05-29 16:02:26 +03:00
|
|
|
func (list *revalidationList) remove(n *tableNode) {
|
p2p/discover: improved node revalidation (#29572)
Node discovery periodically revalidates the nodes in its table by sending PING, checking
if they are still alive. I recently noticed some issues with the implementation of this
process, which can cause strange results such as nodes dropping unexpectedly, certain
nodes not getting revalidated often enough, and bad results being returned to incoming
FINDNODE queries.
In this change, the revalidation process is improved with the following logic:
- We maintain two 'revalidation lists' containing the table nodes, named 'fast' and 'slow'.
- The process chooses random nodes from each list on a randomized interval, the interval being
faster for the 'fast' list, and performs revalidation for the chosen node.
- Whenever a node is newly inserted into the table, it goes into the 'fast' list.
Once validation passes, it transfers to the 'slow' list. If a request fails, or the
node changes endpoint, it transfers back into 'fast'.
- livenessChecks is incremented by one for successful checks. Unlike the old implementation,
we will not drop the node on the first failing check. We instead quickly decay the
livenessChecks give it another chance.
- Order of nodes in bucket doesn't matter anymore.
I am also adding a debug API endpoint to dump the node table content.
Co-authored-by: Martin HS <martin@swende.se>
2024-05-23 15:26:09 +03:00
|
|
|
i := slices.Index(list.nodes, n)
|
|
|
|
if i == -1 {
|
2024-05-28 19:13:03 +03:00
|
|
|
panic(fmt.Errorf("node %v not found in list", n.ID()))
|
p2p/discover: improved node revalidation (#29572)
Node discovery periodically revalidates the nodes in its table by sending PING, checking
if they are still alive. I recently noticed some issues with the implementation of this
process, which can cause strange results such as nodes dropping unexpectedly, certain
nodes not getting revalidated often enough, and bad results being returned to incoming
FINDNODE queries.
In this change, the revalidation process is improved with the following logic:
- We maintain two 'revalidation lists' containing the table nodes, named 'fast' and 'slow'.
- The process chooses random nodes from each list on a randomized interval, the interval being
faster for the 'fast' list, and performs revalidation for the chosen node.
- Whenever a node is newly inserted into the table, it goes into the 'fast' list.
Once validation passes, it transfers to the 'slow' list. If a request fails, or the
node changes endpoint, it transfers back into 'fast'.
- livenessChecks is incremented by one for successful checks. Unlike the old implementation,
we will not drop the node on the first failing check. We instead quickly decay the
livenessChecks give it another chance.
- Order of nodes in bucket doesn't matter anymore.
I am also adding a debug API endpoint to dump the node table content.
Co-authored-by: Martin HS <martin@swende.se>
2024-05-23 15:26:09 +03:00
|
|
|
}
|
|
|
|
list.nodes = slices.Delete(list.nodes, i, i+1)
|
|
|
|
if len(list.nodes) == 0 {
|
|
|
|
list.nextTime = never
|
|
|
|
}
|
2024-05-28 19:13:03 +03:00
|
|
|
n.revalList = nil
|
|
|
|
}
|
|
|
|
|
|
|
|
func (list *revalidationList) contains(id enode.ID) bool {
|
2024-05-29 16:02:26 +03:00
|
|
|
return slices.ContainsFunc(list.nodes, func(n *tableNode) bool {
|
2024-05-28 19:13:03 +03:00
|
|
|
return n.ID() == id
|
|
|
|
})
|
p2p/discover: improved node revalidation (#29572)
Node discovery periodically revalidates the nodes in its table by sending PING, checking
if they are still alive. I recently noticed some issues with the implementation of this
process, which can cause strange results such as nodes dropping unexpectedly, certain
nodes not getting revalidated often enough, and bad results being returned to incoming
FINDNODE queries.
In this change, the revalidation process is improved with the following logic:
- We maintain two 'revalidation lists' containing the table nodes, named 'fast' and 'slow'.
- The process chooses random nodes from each list on a randomized interval, the interval being
faster for the 'fast' list, and performs revalidation for the chosen node.
- Whenever a node is newly inserted into the table, it goes into the 'fast' list.
Once validation passes, it transfers to the 'slow' list. If a request fails, or the
node changes endpoint, it transfers back into 'fast'.
- livenessChecks is incremented by one for successful checks. Unlike the old implementation,
we will not drop the node on the first failing check. We instead quickly decay the
livenessChecks give it another chance.
- Order of nodes in bucket doesn't matter anymore.
I am also adding a debug API endpoint to dump the node table content.
Co-authored-by: Martin HS <martin@swende.se>
2024-05-23 15:26:09 +03:00
|
|
|
}
|