web3-proxy/TODO.md

# Todo

- [ ] quick requests/second timer until we have real stats
- [x] it works for a few seconds and then gets stuck on something.
  - [x] its working with one backend node, but multiple breaks. something to do with pending transactions
  - [x] dashmap entry api is easy to deadlock! be careful with it!
- [ ] rpc errors propagate too far. one subscription failing ends the app. isolate the providers more
- [ ] ethers has a transactions_unsorted httprpc method that we should probably use. all rpcs probably don't support it, so make it okay for that to fail
- [ ] if web3 proxy gets an http error back, retry another node
- [x] refactor Connection::spawn. have it return a handle to the spawned future of it running with block and transaction subscriptions
- [x] refactor Connections::spawn. have it return a handle that is selecting on those handles?
- [x] support websocket clients
  - we support websockets for the backends already, but we need them for the frontend too
  - [ ] when block subscribers receive blocks, store them in a cache. use this cache instead of querying eth_getBlock
  - [x] have a /ws endpoint (figure out how to route on / later)
  - inspect any jsonrpc errors. if its something like "header not found" or "block with id $x not found" retry on another node (and add a negative score to that server)
    - this error seems to happen when we use load balanced rpcs
- [x] use redis and redis-cell for rate limits
- [ ] if we don't cache errors, then in-flight request caching is going to bottleneck 
- [x] some production configs are occassionally stuck waiting at 100% cpu
  - they stop processing new blocks. i'm guessing 2 blocks arrive at the same time, but i thought our locks would handle that
  - even after removing a bunch of the locks, the deadlock still happens. i can't reliably reproduce. i just let it run for awhile and it happens.
  - running gdb shows the thread at tokio tungstenite thread is spinning near 100% cpu and none of the rest of the program is proceeding
  - fixed by https://github.com/gakonst/ethers-rs/pull/1287
- [ ] improve caching
  - [ ] if the eth_call (or similar) params include a block, we can cache for longer
  - [ ] if the call is something simple like "symbol" or "decimals", cache that too
  - [ ] when we receive a block, we should store it for later eth_getBlockByNumber, eth_blockNumber, and similar calls
- [x] eth_sendRawTransaction should return the most common result, not the first
- [ ] if chain split detected, don't send transactions?
- [ ] if a rpc fails to connect at start, retry later instead of skipping it forever
- [ ] endpoint for health checks. if no synced servers, give a 502 error
- [x] move from warp to auxm?
- [ ] proper logging with useful instrumentation
- [x] handle websocket disconnect and reconnect
- [ ] warning if no blocks for too long. maybe reconnect automatically?
- [ ] if the fastest server has hit rate limits, we won't be able to serve any traffic until another server is synced.
    - thundering herd problem if we only allow a lag of 0 blocks
    - we can fix this by only `publish`ing the sorted list once a certain sync limit is reached 
- [ ] tarpit hard_ratelimit at the start, but reject if incoming requests is super high?
- [ ] add the backend server to the header?
- [x] the web3proxyapp object gets cloned for every call. why do we need any arcs inside that? shouldn't they be able to connect to the app's? can we just use static lifetimes
- [ ] think more about how multiple rpc tiers should work
  - we should have a "backup" tier that is only used when the primary tier has no servers or is multiple blocks behind. we don't want the backup tier taking over all the time. only if the primary tier has fallen behind or gone entirely offline
- [ ] if a request gets a socket timeout, try on another server
  - maybe always try at least two servers in parallel? and then return the first? or only if the first one doesn't respond very quickly?
- [ ] incoming rate limiting (by ip or by api key or what?)
- [ ] measure latency to nodes?
- [ ] one proxy for mulitple chains?
- [ ] zero downtime deploys
- [ ] are we using Acquire/Release/AcqRel properly? or do we need other modes?
- [ ] subscription id should be per connection, not global
- [ ] emit stats
- [x] simple proxy
- [x] better locking. when lots of requests come in, we seem to be in the way of block updates
- [x] load balance between multiple RPC servers
- [x] support more than just ETH
- [x] option to disable private rpc and send everything to primary
- [x] health check nodes by block height
- [x] Dockerfile
- [x] docker-compose.yml
- [x] after connecting to a server, check that it gives the expected chainId
- [x] the ethermine rpc is usually fastest. but its in the private tier. since we only allow synced rpcs, we are going to not have an rpc a lot of the time
- [x] if not backends. return a 502 instead of delaying?
watch new heads 2022-04-25 22:14:10 +03:00			`# Todo`

newPendingRawTransactions 2022-06-18 10:06:54 +03:00			`- [ ] quick requests/second timer until we have real stats`
funnel survive rate limiting 2022-06-17 01:23:41 +03:00			`- [x] it works for a few seconds and then gets stuck on something.`
			`- [x] its working with one backend node, but multiple breaks. something to do with pending transactions`
			`- [x] dashmap entry api is easy to deadlock! be careful with it!`
getting closer 2022-06-16 05:53:37 +03:00			`- [ ] rpc errors propagate too far. one subscription failing ends the app. isolate the providers more`
funnel survive rate limiting 2022-06-17 01:23:41 +03:00			`- [ ] ethers has a transactions_unsorted httprpc method that we should probably use. all rpcs probably don't support it, so make it okay for that to fail`
transaction subscription getting closer 2022-06-14 07:04:14 +03:00			`- [ ] if web3 proxy gets an http error back, retry another node`
it works, but we need it to be optional 2022-06-15 01:02:18 +03:00			`- [x] refactor Connection::spawn. have it return a handle to the spawned future of it running with block and transaction subscriptions`
			`- [x] refactor Connections::spawn. have it return a handle that is selecting on those handles?`
transaction subscription getting closer 2022-06-14 07:04:14 +03:00			`- [x] support websocket clients`
json errors 2022-05-29 17:39:17 +03:00			`- we support websockets for the backends already, but we need them for the frontend too`
transaction subscription getting closer 2022-06-14 07:04:14 +03:00			`- [ ] when block subscribers receive blocks, store them in a cache. use this cache instead of querying eth_getBlock`
			`- [x] have a /ws endpoint (figure out how to route on / later)`
json errors 2022-05-29 17:39:17 +03:00			`- inspect any jsonrpc errors. if its something like "header not found" or "block with id $x not found" retry on another node (and add a negative score to that server)`
			`- this error seems to happen when we use load balanced rpcs`
todos 2022-05-28 21:45:45 +03:00			`- [x] use redis and redis-cell for rate limits`
			`- [ ] if we don't cache errors, then in-flight request caching is going to bottleneck`
start adding redis-cell for rate limits 2022-05-21 23:40:22 +03:00			`- [x] some production configs are occassionally stuck waiting at 100% cpu`
check to see if this gets stuck 2022-05-19 06:00:54 +03:00			`- they stop processing new blocks. i'm guessing 2 blocks arrive at the same time, but i thought our locks would handle that`
			`- even after removing a bunch of the locks, the deadlock still happens. i can't reliably reproduce. i just let it run for awhile and it happens.`
			`- running gdb shows the thread at tokio tungstenite thread is spinning near 100% cpu and none of the rest of the program is proceeding`
start adding redis-cell for rate limits 2022-05-21 23:40:22 +03:00			`- fixed by https://github.com/gakonst/ethers-rs/pull/1287`
more comments 2022-05-16 08:56:57 +03:00			`- [ ] improve caching`
			`- [ ] if the eth_call (or similar) params include a block, we can cache for longer`
			`- [ ] if the call is something simple like "symbol" or "decimals", cache that too`
			`- [ ] when we receive a block, we should store it for later eth_getBlockByNumber, eth_blockNumber, and similar calls`
todos 2022-05-28 21:45:45 +03:00			`- [x] eth_sendRawTransaction should return the most common result, not the first`
			`- [ ] if chain split detected, don't send transactions?`
better errors on reconnect 2022-05-17 07:24:13 +03:00			`- [ ] if a rpc fails to connect at start, retry later instead of skipping it forever`
hmmm 2022-05-13 20:58:31 +03:00			`- [ ] endpoint for health checks. if no synced servers, give a 502 error`
start adding redis-cell for rate limits 2022-05-21 23:40:22 +03:00			`- [x] move from warp to auxm?`
move todos 2022-05-13 09:54:47 +03:00			`- [ ] proper logging with useful instrumentation`
todos 2022-05-28 21:45:45 +03:00			`- [x] handle websocket disconnect and reconnect`
move todos 2022-05-13 09:54:47 +03:00			`- [ ] warning if no blocks for too long. maybe reconnect automatically?`
			`- [ ] if the fastest server has hit rate limits, we won't be able to serve any traffic until another server is synced.`
			`- thundering herd problem if we only allow a lag of 0 blocks`
more comments 2022-05-16 08:56:57 +03:00			- we can fix this by only `publish`ing the sorted list once a certain sync limit is reached
move todos 2022-05-13 09:54:47 +03:00			`- [ ] tarpit hard_ratelimit at the start, but reject if incoming requests is super high?`
set overall max inside the lock 2022-05-06 23:44:12 +03:00			`- [ ] add the backend server to the header?`
todos 2022-05-28 21:45:45 +03:00			`- [x] the web3proxyapp object gets cloned for every call. why do we need any arcs inside that? shouldn't they be able to connect to the app's? can we just use static lifetimes`
move todos 2022-05-13 09:54:47 +03:00			`- [ ] think more about how multiple rpc tiers should work`
json errors 2022-05-29 17:39:17 +03:00			`- we should have a "backup" tier that is only used when the primary tier has no servers or is multiple blocks behind. we don't want the backup tier taking over all the time. only if the primary tier has fallen behind or gone entirely offline`
move todos 2022-05-13 09:54:47 +03:00			`- [ ] if a request gets a socket timeout, try on another server`
			`- maybe always try at least two servers in parallel? and then return the first? or only if the first one doesn't respond very quickly?`
			`- [ ] incoming rate limiting (by ip or by api key or what?)`
			`- [ ] measure latency to nodes?`
			`- [ ] one proxy for mulitple chains?`
			`- [ ] zero downtime deploys`
			`- [ ] are we using Acquire/Release/AcqRel properly? or do we need other modes?`
transaction subscription getting closer 2022-06-14 07:04:14 +03:00			`- [ ] subscription id should be per connection, not global`
it works, but we need it to be optional 2022-06-15 01:02:18 +03:00			`- [ ] emit stats`
move todos 2022-05-13 09:54:47 +03:00			`- [x] simple proxy`
			`- [x] better locking. when lots of requests come in, we seem to be in the way of block updates`
			`- [x] load balance between multiple RPC servers`
			`- [x] support more than just ETH`
			`- [x] option to disable private rpc and send everything to primary`
			`- [x] health check nodes by block height`
			`- [x] Dockerfile`
			`- [x] docker-compose.yml`
			`- [x] after connecting to a server, check that it gives the expected chainId`
			`- [x] the ethermine rpc is usually fastest. but its in the private tier. since we only allow synced rpcs, we are going to not have an rpc a lot of the time`
			`- [x] if not backends. return a 502 instead of delaying?`