web3-proxy/TODO.md

# Todo

## MVP

- [x] simple proxy
- [x] better locking. when lots of requests come in, we seem to be in the way of block updates
- [x] load balance between multiple RPC servers
- [x] support more than just ETH
- [x] option to disable private rpc and send everything to primary
- [x] support websocket clients
  - we support websockets for the backends already, but we need them for the frontend too
- [x] health check nodes by block height
- [x] Dockerfile
- [x] docker-compose.yml
- [x] after connecting to a server, check that it gives the expected chainId
- [x] the ethermine rpc is usually fastest. but its in the private tier. since we only allow synced rpcs, we are going to not have an rpc a lot of the time
- [x] if not backends. return a 502 instead of delaying?
- [x] move from warp to axum
- [x] handle websocket disconnect and reconnect
- [x] eth_sendRawTransaction should return the most common result, not the first
- [x] use redis and redis-cell for rate limits
- [x] it works for a few seconds and then gets stuck on something.
  - [x] its working with one backend node, but multiple breaks. something to do with pending transactions
  - [x] dashmap entry api is easy to deadlock! be careful with it!
- [x] the web3proxyapp object gets cloned for every call. why do we need any arcs inside that? shouldn't they be able to connect to the app's? can we just use static lifetimes
- [x] refactor Connection::spawn. have it return a handle to the spawned future of it running with block and transaction subscriptions
- [x] refactor Connections::spawn. have it return a handle that is selecting on those handles?
- [x] some production configs are occassionally stuck waiting at 100% cpu
  - they stop processing new blocks. i'm guessing 2 blocks arrive at the same time, but i thought our locks would handle that
  - even after removing a bunch of the locks, the deadlock still happens. i can't reliably reproduce. i just let it run for awhile and it happens.
  - running gdb shows the thread at tokio tungstenite thread is spinning near 100% cpu and none of the rest of the program is proceeding
  - fixed by https://github.com/gakonst/ethers-rs/pull/1287
- [x] when sending with private relays, brownie's tx.wait can think the transaction was dropped. smarter retry on eth_getTransactionByHash and eth_getTransactionReceipt (maybe only if we sent the transaction ourselves)
- [x] if web3 proxy gets an http error back, retry another node
- [x] endpoint for health checks. if no synced servers, give a 502 error
- [x] rpc errors propagate too far. one subscription failing ends the app. isolate the providers more (might already be fixed)
- [x] incoming rate limiting (by ip)
- [x] connection pool for redis
- [x] automatically route to archive server when necessary
  - originally, no processing was done to params; they were just serde_json::RawValue. this is probably fastest, but we need to look for "latest" and count elements, so we have to use serde_json::Value
  - when getting the next server, filtering on "archive" isn't going to work well. need to check inner instead
  - [ ] this works well for local servers, but public nodes (especially on other chains) seem to give unreliable results. likely because of load balancers. maybe have a "max block data limit"
- [x] if the requested block is ahead of the best block, return without querying any backend servers
- [ ] basic request method stats
- [x] http servers should check block at the very start
- [ ] if the fastest server has hit rate limits, we won't be able to serve any traffic until another server is synced.
  - thundering herd problem if we only allow a lag of 0 blocks
  - we can fix this by only `publish`ing the sorted list once a threshold of total soft limits is passed

## V1

- [ ] refactor so configs can change while running
  - create the app without applying any config to it
  - have a blocking future watching the config file and calling app.apply_config() on first load and on change
  - work started on this in the "config_reloads" branch. because of how we pass channels around during spawn, this requires a larger refactor.
- [ ] have a "backup" tier that is only used when the primary tier has no servers or is multiple blocks behind. we don't want the backup tier taking over with the head block if they happen to be fast at that (but overall low/expensive rps). only if the primary tier has fallen behind or gone entirely offline should we go to third parties
- [ ] most things that are cached locally should probably be in shared redis caches
- [ ] stats when forks are resolved (and what chain they were on?)
- [ ] incoming rate limiting (by api key)
- [ ] failsafe. if no blocks or transactions in the last second, warn and reset the connection
- [ ] if we don't cache errors, then in-flight request caching is going to bottleneck 
  - i think now that we retry header not found and similar, caching errors should be fine
- [x] if the eth_call (or similar) params include a block, we can cache for that
- [x] when block subscribers receive blocks, store them in a block_map
- [ ] right now the block_map is unbounded. move this to redis and do some calculations to be sure about RAM usage
- [x] eth_blockNumber without a backend request
- [ ] eth_getBlockByNumber and similar calls served from the block map
- [ ] if a rpc fails to connect at start, retry later instead of skipping it forever
- [x] inspect any jsonrpc errors. if its something like "header not found" or "block with id $x not found" retry on another node (and add a negative score to that server)
  - this error seems to happen when we use load balanced backend rpcs like pokt and ankr
- [ ] emit stats for successes, retries, failures, with the types of requests, account, chain, rpc
- [ ] handle log subscriptions
- [x] if we send a transaction to private rpcs and then people query it on public rpcs things, some interfaces might think the transaction is dropped (i saw this happen in a brownie script of mine). how should we handle this?
  - [x] send getTransaction rpc requests to the private rpc tier
  - [ ] right now we send too many to the private rpc tier and i think are being rate limited. change to be serially and weight by soft limit. 
- [ ] improved logging with useful instrumentation
- [ ] don't "unwrap" anywhere. give proper errors

new endpoints for users:
- think about where to put this. a separate app might be better, especially so we don't get cloned too easily. open source code could just have a cli tool for managing users
- [ ] GET /user/login/$address
  - returns a JSON string for the user to sign
- [ ] POST /user/login/$address
  - returns a JSON string including the api key
  - sets session cookie
- [ ] GET /user/$address
  - checks for api key in session cookie or header
  - returns a JSON string including user stats
    - balance in USD 
    - deposits history (currency, amounts, transaction id)
    - number of requests used (so we can calculate average spending over a month, burn rate for a user etc, something like "Your balance will be depleted in xx days)
    - the email address of a user if he opted in to get contacted via email
    - all the success/retry/fail counts and latencies (but that may better come from somewhere else)
- [ ] POST /user/$address
  - opt-in link email address
  - checks for api key in session cookie or header
  - allows modifying user settings
- [ ] GET /$api_key
  - proxies to web3 websocket
- [ ] POST /$api_key
  - proxies to web3
- [ ] POST /users/process_transaction
  - checks a transaction to see if it modifies a user's balance. records results in a sql database
  - we will have our own event subscriber watching for "deposit" events, but sometimes events get missed and users might incorrectly "transfer" the tokens directly to an address instead of using the dapp

in another repo: event subscriber
  - watch for transfer events to our contract and submit them to /payment/$tx_hash
  - also have a command line script that support can run to manually check and submit a transaction

## V2

- [ ] automated soft limit
  - look at average request time for getBlock? i'm not sure how good a proxy that will be for serving eth_call, but its a start
- [ ] interval for http subscriptions should be based on block time. load from config is easy, but better to query. currently hard coded to 13 seconds
- [ ] more advanced automated soft limit
  - measure average latency of a node's responses and load balance on that

## "Maybe some day" and other Miscellaneous Things

- [ ] instead of giving a rate limit error code, delay the connection's response at the start. reject if incoming requests is super high?
- [ ] add the backend server to the header?
- [ ] think more about how multiple rpc tiers should work
- maybe always try at least two servers in parallel? and then return the first? or only if the first one doesn't respond very quickly? this doubles our request load though.
- [ ] one proxy for multiple chains?
- [ ] zero downtime deploys
- [ ] are we using Acquire/Release/AcqRel properly? or do we need other modes?
- [x] subscription id should be per connection, not global
- [ ] use https://github.com/ledgerwatch/interfaces to talk to erigon directly instead of through erigon's rpcdaemon (possible example code which uses ledgerwatch/interfaces: https://github.com/akula-bft/akula/tree/master)
- [ ] subscribe to pending transactions and build an intelligent gas estimator
- [ ] include private rpcs with regular queries? i don't want to overwhelm them, but they could be good for excess load
- [ ] flashbots specific methods
  - [ ] flashbots protect fast mode or not? probably fast matches most user's needs, but no reverts is nice.
  - [ ] https://docs.flashbots.net/flashbots-auction/searchers/advanced/rpc-endpoint#authentication maybe have per-user keys. or pass their header on if its set
- [ ] if no redis set, but public rate limits are set, exit with an error
- [ ] i saw "WebSocket connection closed unexpectedly" but no auto reconnect. need better logs on these
- [ ] if archive servers are added to the rotation while they are still syncing, they might get requests too soon. keep archive servers out of the configs until they are done syncing. full nodes should be fine to add to the configs even while syncing, though its a wasted connection
- [x] when under load, i'm seeing "http interval lagging!". sometimes it happens when not loaded.
  - we were skipping our delay interval when block hash wasn't changed. so if a block was ever slow, the http provider would get the same hash twice and then would try eth_getBlockByNumber a ton of times
- [x] document load tests: docker run --rm --name spam shazow/ethspam --rpc http://$LOCAL_IP:8544 | versus --concurrency=100 --stop-after=10000 http://$LOCAL_IP:8544; docker stop spam
- [ ] if the call is something simple like "symbol" or "decimals", cache that too. though i think this could bite us.
- [ ] Got warning: "WARN subscribe_new_heads:send_block: web3_proxy::connection: unable to get block from https://rpc.ethermine.org: Deserialization Error: expected value at line 1 column 1. Response: error code: 1015". this is cloudflare rate limiting on fetching a block, but this is a private rpc. why is there a block subscription?
- [ ] add a subscription that returns the head block number and hash but nothing else
- [ ] if chain split detected, what should we do? don't send transactions?
watch new heads 2022-04-25 22:14:10 +03:00			`# Todo`

clean up todos 2022-06-21 04:02:49 +03:00			`## MVP`

			`- [x] simple proxy`
			`- [x] better locking. when lots of requests come in, we seem to be in the way of block updates`
			`- [x] load balance between multiple RPC servers`
			`- [x] support more than just ETH`
			`- [x] option to disable private rpc and send everything to primary`
			`- [x] support websocket clients`
			`- we support websockets for the backends already, but we need them for the frontend too`
			`- [x] health check nodes by block height`
			`- [x] Dockerfile`
			`- [x] docker-compose.yml`
			`- [x] after connecting to a server, check that it gives the expected chainId`
			`- [x] the ethermine rpc is usually fastest. but its in the private tier. since we only allow synced rpcs, we are going to not have an rpc a lot of the time`
			`- [x] if not backends. return a 502 instead of delaying?`
			`- [x] move from warp to axum`
			`- [x] handle websocket disconnect and reconnect`
			`- [x] eth_sendRawTransaction should return the most common result, not the first`
			`- [x] use redis and redis-cell for rate limits`
funnel survive rate limiting 2022-06-17 01:23:41 +03:00			`- [x] it works for a few seconds and then gets stuck on something.`
			`- [x] its working with one backend node, but multiple breaks. something to do with pending transactions`
			`- [x] dashmap entry api is easy to deadlock! be careful with it!`
clean up todos 2022-06-21 04:02:49 +03:00			`- [x] the web3proxyapp object gets cloned for every call. why do we need any arcs inside that? shouldn't they be able to connect to the app's? can we just use static lifetimes`
it works, but we need it to be optional 2022-06-15 01:02:18 +03:00			`- [x] refactor Connection::spawn. have it return a handle to the spawned future of it running with block and transaction subscriptions`
			`- [x] refactor Connections::spawn. have it return a handle that is selecting on those handles?`
start adding redis-cell for rate limits 2022-05-21 23:40:22 +03:00			`- [x] some production configs are occassionally stuck waiting at 100% cpu`
check to see if this gets stuck 2022-05-19 06:00:54 +03:00			`- they stop processing new blocks. i'm guessing 2 blocks arrive at the same time, but i thought our locks would handle that`
			`- even after removing a bunch of the locks, the deadlock still happens. i can't reliably reproduce. i just let it run for awhile and it happens.`
			`- running gdb shows the thread at tokio tungstenite thread is spinning near 100% cpu and none of the rest of the program is proceeding`
start adding redis-cell for rate limits 2022-05-21 23:40:22 +03:00			`- fixed by https://github.com/gakonst/ethers-rs/pull/1287`
retries 2022-07-02 04:20:28 +03:00			`- [x] when sending with private relays, brownie's tx.wait can think the transaction was dropped. smarter retry on eth_getTransactionByHash and eth_getTransactionReceipt (maybe only if we sent the transaction ourselves)`
			`- [x] if web3 proxy gets an http error back, retry another node`
			`- [x] endpoint for health checks. if no synced servers, give a 502 error`
todos 2022-07-07 03:00:15 +03:00			`- [x] rpc errors propagate too far. one subscription failing ends the app. isolate the providers more (might already be fixed)`
connection pooling 2022-07-07 06:22:09 +03:00			`- [x] incoming rate limiting (by ip)`
todo complete 2022-07-07 06:30:04 +03:00			`- [x] connection pool for redis`
better archive split 2022-07-16 07:13:02 +03:00			`- [x] automatically route to archive server when necessary`
improve redis connection pool 2022-07-09 02:02:32 +03:00			`- originally, no processing was done to params; they were just serde_json::RawValue. this is probably fastest, but we need to look for "latest" and count elements, so we have to use serde_json::Value`
better archive split 2022-07-16 07:13:02 +03:00			`- when getting the next server, filtering on "archive" isn't going to work well. need to check inner instead`
error if future block is requested 2022-07-21 02:49:29 +03:00			`- [ ] this works well for local servers, but public nodes (especially on other chains) seem to give unreliable results. likely because of load balancers. maybe have a "max block data limit"`
			`- [x] if the requested block is ahead of the best block, return without querying any backend servers`
connection pooling 2022-07-07 06:22:09 +03:00			`- [ ] basic request method stats`
better error handling 2022-07-08 21:27:06 +03:00			`- [x] http servers should check block at the very start`
todos 2022-07-19 10:01:55 +03:00			`- [ ] if the fastest server has hit rate limits, we won't be able to serve any traffic until another server is synced.`
			`- thundering herd problem if we only allow a lag of 0 blocks`
			- we can fix this by only `publish`ing the sorted list once a threshold of total soft limits is passed
clean up todos 2022-06-21 04:02:49 +03:00
			`## V1`

todos 2022-07-07 03:00:15 +03:00			`- [ ] refactor so configs can change while running`
			`- create the app without applying any config to it`
			`- have a blocking future watching the config file and calling app.apply_config() on first load and on change`
			`- work started on this in the "config_reloads" branch. because of how we pass channels around during spawn, this requires a larger refactor.`
error if future block is requested 2022-07-21 02:49:29 +03:00			`- [ ] have a "backup" tier that is only used when the primary tier has no servers or is multiple blocks behind. we don't want the backup tier taking over with the head block if they happen to be fast at that (but overall low/expensive rps). only if the primary tier has fallen behind or gone entirely offline should we go to third parties`
connection pooling 2022-07-07 06:22:09 +03:00			`- [ ] most things that are cached locally should probably be in shared redis caches`
todos 2022-06-25 05:45:50 +03:00			`- [ ] stats when forks are resolved (and what chain they were on?)`
clean up todos 2022-06-21 04:02:49 +03:00			`- [ ] incoming rate limiting (by api key)`
			`- [ ] failsafe. if no blocks or transactions in the last second, warn and reset the connection`
			`- [ ] if we don't cache errors, then in-flight request caching is going to bottleneck`
better archive split 2022-07-16 07:13:02 +03:00			`- i think now that we retry header not found and similar, caching errors should be fine`
todos 2022-07-19 10:01:55 +03:00			`- [x] if the eth_call (or similar) params include a block, we can cache for that`
			`- [x] when block subscribers receive blocks, store them in a block_map`
			`- [ ] right now the block_map is unbounded. move this to redis and do some calculations to be sure about RAM usage`
			`- [x] eth_blockNumber without a backend request`
			`- [ ] eth_getBlockByNumber and similar calls served from the block map`
better errors on reconnect 2022-05-17 07:24:13 +03:00			`- [ ] if a rpc fails to connect at start, retry later instead of skipping it forever`
todos 2022-07-19 10:01:55 +03:00			`- [x] inspect any jsonrpc errors. if its something like "header not found" or "block with id $x not found" retry on another node (and add a negative score to that server)`
whitespace 2022-07-07 06:22:59 +03:00			`- this error seems to happen when we use load balanced backend rpcs like pokt and ankr`
todos 2022-06-25 05:45:50 +03:00			`- [ ] emit stats for successes, retries, failures, with the types of requests, account, chain, rpc`
todos 2022-07-19 10:01:55 +03:00			`- [ ] handle log subscriptions`
todos 2022-07-07 03:00:15 +03:00			`- [x] if we send a transaction to private rpcs and then people query it on public rpcs things, some interfaces might think the transaction is dropped (i saw this happen in a brownie script of mine). how should we handle this?`
todos 2022-07-19 10:01:55 +03:00			`- [x] send getTransaction rpc requests to the private rpc tier`
			`- [ ] right now we send too many to the private rpc tier and i think are being rate limited. change to be serially and weight by soft limit.`
			`- [ ] improved logging with useful instrumentation`
shared interval for http 2022-06-29 22:15:05 +03:00			`- [ ] don't "unwrap" anywhere. give proper errors`
clean up todos 2022-06-21 04:02:49 +03:00
just do one app for now 2022-07-14 00:49:57 +03:00			`new endpoints for users:`
error if future block is requested 2022-07-21 02:49:29 +03:00			`- think about where to put this. a separate app might be better, especially so we don't get cloned too easily. open source code could just have a cli tool for managing users`
just do one app for now 2022-07-14 00:49:57 +03:00			`- [ ] GET /user/login/$address`
			`- returns a JSON string for the user to sign`
			`- [ ] POST /user/login/$address`
			`- returns a JSON string including the api key`
			`- sets session cookie`
			`- [ ] GET /user/$address`
			`- checks for api key in session cookie or header`
			`- returns a JSON string including user stats`
more todo this should probably all be moved to the google doc 2022-07-14 00:57:50 +03:00			`- balance in USD`
			`- deposits history (currency, amounts, transaction id)`
			`- number of requests used (so we can calculate average spending over a month, burn rate for a user etc, something like "Your balance will be depleted in xx days)`
			`- the email address of a user if he opted in to get contacted via email`
			`- all the success/retry/fail counts and latencies (but that may better come from somewhere else)`
just do one app for now 2022-07-14 00:49:57 +03:00			`- [ ] POST /user/$address`
			`- opt-in link email address`
			`- checks for api key in session cookie or header`
			`- allows modifying user settings`
			`- [ ] GET /$api_key`
			`- proxies to web3 websocket`
			`- [ ] POST /$api_key`
			`- proxies to web3`
more todo this should probably all be moved to the google doc 2022-07-14 00:57:50 +03:00			`- [ ] POST /users/process_transaction`
			`- checks a transaction to see if it modifies a user's balance. records results in a sql database`
			`- we will have our own event subscriber watching for "deposit" events, but sometimes events get missed and users might incorrectly "transfer" the tokens directly to an address instead of using the dapp`
just do one app for now 2022-07-14 00:49:57 +03:00
more todo this should probably all be moved to the google doc 2022-07-14 00:57:50 +03:00			`in another repo: event subscriber`
			`- watch for transfer events to our contract and submit them to /payment/$tx_hash`
			`- also have a command line script that support can run to manually check and submit a transaction`
just do one app for now 2022-07-14 00:49:57 +03:00
clean up todos 2022-06-21 04:02:49 +03:00			`## V2`

todos 2022-07-19 10:01:55 +03:00			`- [ ] automated soft limit`
			`- look at average request time for getBlock? i'm not sure how good a proxy that will be for serving eth_call, but its a start`
error if future block is requested 2022-07-21 02:49:29 +03:00			`- [ ] interval for http subscriptions should be based on block time. load from config is easy, but better to query. currently hard coded to 13 seconds`
whitespace 2022-07-07 06:22:59 +03:00			`- [ ] more advanced automated soft limit`
			`- measure average latency of a node's responses and load balance on that`
clean up todos 2022-06-21 04:02:49 +03:00
			`## "Maybe some day" and other Miscellaneous Things`

			`- [ ] instead of giving a rate limit error code, delay the connection's response at the start. reject if incoming requests is super high?`
set overall max inside the lock 2022-05-06 23:44:12 +03:00			`- [ ] add the backend server to the header?`
move todos 2022-05-13 09:54:47 +03:00			`- [ ] think more about how multiple rpc tiers should work`
clean up todos 2022-06-21 04:02:49 +03:00			`- maybe always try at least two servers in parallel? and then return the first? or only if the first one doesn't respond very quickly? this doubles our request load though.`
per connection subscription id 2022-07-09 01:14:45 +03:00			`- [ ] one proxy for multiple chains?`
move todos 2022-05-13 09:54:47 +03:00			`- [ ] zero downtime deploys`
			`- [ ] are we using Acquire/Release/AcqRel properly? or do we need other modes?`
per connection subscription id 2022-07-09 01:14:45 +03:00			`- [x] subscription id should be per connection, not global`
clean up todos 2022-06-21 04:02:49 +03:00			`- [ ] use https://github.com/ledgerwatch/interfaces to talk to erigon directly instead of through erigon's rpcdaemon (possible example code which uses ledgerwatch/interfaces: https://github.com/akula-bft/akula/tree/master)`
retries 2022-07-02 04:20:28 +03:00			`- [ ] subscribe to pending transactions and build an intelligent gas estimator`
			`- [ ] include private rpcs with regular queries? i don't want to overwhelm them, but they could be good for excess load`
add is_archive_needed and a bunch of rpc commands 2022-07-09 05:23:26 +03:00			`- [ ] flashbots specific methods`
			`- [ ] flashbots protect fast mode or not? probably fast matches most user's needs, but no reverts is nice.`
			`- [ ] https://docs.flashbots.net/flashbots-auction/searchers/advanced/rpc-endpoint#authentication maybe have per-user keys. or pass their header on if its set`
todos 2022-07-10 21:06:20 +03:00			`- [ ] if no redis set, but public rate limits are set, exit with an error`
better archive split 2022-07-16 07:13:02 +03:00			`- [ ] i saw "WebSocket connection closed unexpectedly" but no auto reconnect. need better logs on these`
			`- [ ] if archive servers are added to the rotation while they are still syncing, they might get requests too soon. keep archive servers out of the configs until they are done syncing. full nodes should be fine to add to the configs even while syncing, though its a wasted connection`
fix http interval 2022-07-16 08:21:08 +03:00			`- [x] when under load, i'm seeing "http interval lagging!". sometimes it happens when not loaded.`
document more and cache in block_map 2022-07-19 09:41:04 +03:00			`- we were skipping our delay interval when block hash wasn't changed. so if a block was ever slow, the http provider would get the same hash twice and then would try eth_getBlockByNumber a ton of times`
todos 2022-07-19 10:01:55 +03:00			`- [x] document load tests: docker run --rm --name spam shazow/ethspam --rpc http://$LOCAL_IP:8544 \| versus --concurrency=100 --stop-after=10000 http://$LOCAL_IP:8544; docker stop spam`
			`- [ ] if the call is something simple like "symbol" or "decimals", cache that too. though i think this could bite us.`
			`- [ ] Got warning: "WARN subscribe_new_heads:send_block: web3_proxy::connection: unable to get block from https://rpc.ethermine.org: Deserialization Error: expected value at line 1 column 1. Response: error code: 1015". this is cloudflare rate limiting on fetching a block, but this is a private rpc. why is there a block subscription?`
error if future block is requested 2022-07-21 02:49:29 +03:00			`- [ ] add a subscription that returns the head block number and hash but nothing else`
			`- [ ] if chain split detected, what should we do? don't send transactions?`