web3-proxy/TODO.md at a5e324a6925da4057beba6b0d6e636e8358e1bd1

if web3 proxy gets an http error back, retry another node

refactor Connection::spawn. have it return a handle to the spawned future of it running with block and transaction subscriptions

refactor Connections::spawn. have it return a handle that is selecting on those handles?

support websocket clients

we support websockets for the backends already, but we need them for the frontend too
when block subscribers receive blocks, store them in a cache. use this cache instead of querying eth_getBlock
have a /ws endpoint (figure out how to route on / later)
inspect any jsonrpc errors. if its something like "header not found" or "block with id $x not found" retry on another node (and add a negative score to that server)
- this error seems to happen when we use load balanced rpcs

use redis and redis-cell for rate limits

if we don't cache errors, then in-flight request caching is going to bottleneck

some production configs are occassionally stuck waiting at 100% cpu

they stop processing new blocks. i'm guessing 2 blocks arrive at the same time, but i thought our locks would handle that
even after removing a bunch of the locks, the deadlock still happens. i can't reliably reproduce. i just let it run for awhile and it happens.
running gdb shows the thread at tokio tungstenite thread is spinning near 100% cpu and none of the rest of the program is proceeding
fixed by https://github.com/gakonst/ethers-rs/pull/1287

improve caching

if the eth_call (or similar) params include a block, we can cache for longer
if the call is something simple like "symbol" or "decimals", cache that too
when we receive a block, we should store it for later eth_getBlockByNumber, eth_blockNumber, and similar calls

eth_sendRawTransaction should return the most common result, not the first

if chain split detected, don't send transactions?

if a rpc fails to connect at start, retry later instead of skipping it forever

endpoint for health checks. if no synced servers, give a 502 error

move from warp to auxm?

proper logging with useful instrumentation

handle websocket disconnect and reconnect

warning if no blocks for too long. maybe reconnect automatically?

if the fastest server has hit rate limits, we won't be able to serve any traffic until another server is synced.

thundering herd problem if we only allow a lag of 0 blocks
we can fix this by only publishing the sorted list once a certain sync limit is reached

tarpit hard_ratelimit at the start, but reject if incoming requests is super high?

add the backend server to the header?

the web3proxyapp object gets cloned for every call. why do we need any arcs inside that? shouldn't they be able to connect to the app's? can we just use static lifetimes

think more about how multiple rpc tiers should work

we should have a "backup" tier that is only used when the primary tier has no servers or is multiple blocks behind. we don't want the backup tier taking over all the time. only if the primary tier has fallen behind or gone entirely offline

if a request gets a socket timeout, try on another server

maybe always try at least two servers in parallel? and then return the first? or only if the first one doesn't respond very quickly?

incoming rate limiting (by ip or by api key or what?)

measure latency to nodes?

one proxy for mulitple chains?

zero downtime deploys

are we using Acquire/Release/AcqRel properly? or do we need other modes?

subscription id should be per connection, not global

simple proxy

better locking. when lots of requests come in, we seem to be in the way of block updates

load balance between multiple RPC servers

support more than just ETH

option to disable private rpc and send everything to primary

health check nodes by block height

Dockerfile

docker-compose.yml

after connecting to a server, check that it gives the expected chainId

the ethermine rpc is usually fastest. but its in the private tier. since we only allow synced rpcs, we are going to not have an rpc a lot of the time

if not backends. return a 502 instead of delaying?

4.2 KiB

Raw Blame History

Todo

4.2 KiB Raw Blame History

Todo

4.2 KiB

Raw Blame History