web3-proxy/TODO.md

# Todo

## MVP

These are roughly in order of completition

- [x] simple proxy
- [x] better locking. when lots of requests come in, we seem to be in the way of block updates
- [x] load balance between multiple RPC servers
- [x] support more than just ETH
- [x] option to disable private rpc and send everything to primary
- [x] support websocket clients
  - we support websockets for the backends already, but we need them for the frontend too
- [x] health check nodes by block height
- [x] Dockerfile
- [x] docker-compose.yml
- [x] after connecting to a server, check that it gives the expected chainId
- [x] the ethermine rpc is usually fastest. but its in the private tier. since we only allow synced rpcs, we are going to not have an rpc a lot of the time
- [x] if not backends. return a 502 instead of delaying?
- [x] move from warp to axum
- [x] handle websocket disconnect and reconnect
- [x] eth_sendRawTransaction should return the most common result, not the first
- [x] use redis and redis-cell for rate limits
- [x] it works for a few seconds and then gets stuck on something.
  - [x] its working with one backend node, but multiple breaks. something to do with pending transactions
  - [x] dashmap entry api is easy to deadlock! be careful with it!
- [x] the web3proxyapp object gets cloned for every call. why do we need any arcs inside that? shouldn't they be able to connect to the app's? can we just use static lifetimes
- [x] refactor Connection::spawn. have it return a handle to the spawned future of it running with block and transaction subscriptions
- [x] refactor Connections::spawn. have it return a handle that is selecting on those handles?
- [x] some production configs are occassionally stuck waiting at 100% cpu
  - they stop processing new blocks. i'm guessing 2 blocks arrive at the same time, but i thought our locks would handle that
  - even after removing a bunch of the locks, the deadlock still happens. i can't reliably reproduce. i just let it run for awhile and it happens.
  - running gdb shows the thread at tokio tungstenite thread is spinning near 100% cpu and none of the rest of the program is proceeding
  - fixed by https://github.com/gakonst/ethers-rs/pull/1287
- [x] when sending with private relays, brownie's tx.wait can think the transaction was dropped. smarter retry on eth_getTransactionByHash and eth_getTransactionReceipt (maybe only if we sent the transaction ourselves)
- [x] if web3 proxy gets an http error back, retry another node
- [x] endpoint for health checks. if no synced servers, give a 502 error
- [x] rpc errors propagate too far. one subscription failing ends the app. isolate the providers more (might already be fixed)
- [x] incoming rate limiting (by ip)
- [x] connection pool for redis
- [x] automatically route to archive server when necessary
  - originally, no processing was done to params; they were just serde_json::RawValue. this is probably fastest, but we need to look for "latest" and count elements, so we have to use serde_json::Value
  - when getting the next server, filtering on "archive" isn't going to work well. need to check inner instead
- [x] if the requested block is ahead of the best block, return without querying any backend servers
- [x] http servers should check block at the very start
- [x] subscription id should be per connection, not global
- [x] when under load, i'm seeing "http interval lagging!". sometimes it happens when not loaded.
  - we were skipping our delay interval when block hash wasn't changed. so if a block was ever slow, the http provider would get the same hash twice and then would try eth_getBlockByNumber a ton of times
- [x] inspect any jsonrpc errors. if its something like "header not found" or "block with id $x not found" retry on another node (and add a negative score to that server)
  - this error seems to happen when we use load balanced backend rpcs like pokt and ankr
- [x] RESPONSE_CACHE_CAP in bytes instead of number of entries
- [x] if we don't cache errors, then in-flight request caching is going to bottleneck 
  - i think now that we retry header not found and similar, caching errors should be fine
- [x] RESPONSE_CACHE_CAP from config
- [x] web3_sha3 rpc command
- [x] test that launches anvil and connects the proxy to it and does some basic queries
  - [x] need to have some sort of shutdown signaling. doesn't need to be graceful at this point, but should be eventually
- [x] if the fastest server has hit rate limits, we won't be able to serve any traffic until another server is synced.
  - thundering herd problem if we only allow a lag of 0 blocks
  - we can improve this by only publishing the synced connections once a threshold of total available soft and hard limits is passed. how can we do this without hammering redis? at least its only once per block per server
  - [x] instead of tracking `pending_synced_connections`, have a mapping of where all connections are individually. then each change, re-check for consensus.
- [x] synced connections swap threshold set to 1 so that it always serves something
- [x] cli tool for creating new users
- [x] incoming rate limiting by api key
- [x] sort forked blocks by total difficulty like geth does
- [x] refactor result type on active handlers to use a cleaner success/error so we can use the try operator
- [x] give users different rate limits looked up from the database 
- [x] Add a "weight" key to the servers. Sort on that after block. keep most requests local
- [x] cache db query results for user data. db is a big bottleneck right now
- [x] allow blocking public requests
- [x] Got warning: "WARN subscribe_new_heads:send_block: web3_proxy::connection: unable to get block from https://rpc.ethermine.org: Deserialization Error: expected value at line 1 column 1. Response: error code: 1015". this is cloudflare rate limiting on fetching a block, but this is a private rpc. why is there a block subscription?
- [x] im seeing ethspam occasionally try to query a future block. something must be setting the head block too early
  - [x] we were sorting best block the wrong direction. i flipped a.cmp(b) to b.cmp(a) so that the largest would be first, but then i used 'max_by' which looks at the end of the list
- [x] HTTP GET to the websocket endpoints should redirect instead of giving an ugly error
- [x] load the redirected page from config
- [x] prettier output for create_user command. need the key in hex
- [x] drop redis-cell in favor of a simpler (and faster) implementation. 
  - redis-cell was giving me weird errors and it isn't worth debugging it right now.
- [x] create user script should allow setting the api key
- [x] disable redis persistence in dev
- [x] attach a request id to every web request
- [x] attach user id (not IP!) to each request
- [x] fantom_1    | 2022-08-10T22:19:43.522465Z  WARN web3_proxy::jsonrpc: forwarding error err=missing field `jsonrpc` at line 1 column 60
  - [x] i think the server isn't following the spec. we need a context attached to more errors so we know which one
  - [x] make jsonrpc default to "2.0" (including the custom deserializer that handles the RawValues)
- [x] if the eth_call (or similar) params include a block, we can cache for that
- [x] when block subscribers receive blocks, store them in a block_map
- [x] eth_blockNumber without a backend request
- [x] if we send a transaction to private rpcs and then people query it on public rpcs things, some interfaces might think the transaction is dropped (i saw this happen in a brownie script of mine). how should we handle this?
  - [x] send getTransaction rpc requests to the private rpc tier
- [x] I'm hitting infura rate limits very quickly. I feel like that means something is very inefficient
  - whenever blocks were slow, we started checking as fast as possible
- [x] create user script should allow setting requests per minute
- [x] cache api keys that are not in the database
- [x] improve consensus block selection. Our goal is to find the highest work chain with a block over a minimum threshold of sum_soft_limit.
  - [x] i saw a fork of like 300 blocks. probably just because a node was restarted and had fallen behind. need some checks to ignore things that are far behind. this improvement should fix this problem
  - [x] A new block arrives at a connection.
  - [x] It checks that it isn't the same that it already has (which is a problem with polling nodes)
  - [x] If its new to this node...
    - [x] if the block does not have total work, check our cache. otherwise, query the node
    - [x] save the block num and hash so that http polling doesn't send duplicates
    - [x] send the deduped block through a channel to be handled by the connections grouping.
  - [x] The connections group...
    - [x] input = rpc, new_block
    - [x] adds the block and rpc to it's internal maps
      - [x] connection_heads: HashMap<rpc_name, blockhash>
      - [x] block_map: DashMap<blockhash, Arc<Block>>
      - [x] block_num: DashMap<U64, H256>
      - [x] blockchain: DiGraphMap<blockhash, ?>
    - [x] iterate the rpc_map to find the highest_work_block
    - [x] update synced connections
    - [x] send the block through new head_block_sender
  - [x] rewrite cannonical_block to work as long as there are no forks
  - [x] rewrite cannonical_block (again) and related functions to handle forks
    - [x] got a very large number of possible heads here. i think maybe a server was very far out of sync. we should drop servers behind by too much
    eth_1       | 2022-08-10T23:26:06.377129Z  WARN web3_proxy::connections: chain is forked! 261 possible heads. 1/2/5/5 rpcs have 0xd403…3c5d
    eth_1       | 2022-08-10T23:26:08.917603Z  WARN web3_proxy::connections: chain is forked! 262 possible heads. 1/2/5/5 rpcs have 0x0538…bfff
    eth_1       | 2022-08-10T23:26:10.195014Z  WARN web3_proxy::connections: chain is forked! 262 possible heads. 1/2/5/5 rpcs have 0x0538…bfff
    eth_1       | 2022-08-10T23:26:10.195658Z  WARN web3_proxy::connections: chain is forked! 262 possible heads. 2/3/5/5 rpcs have 0x0538…bfff
    - [x] todo!("handle equal") and also less and greater
    - [x] "chain is forked" message is wrong. it includes nodes just being on different heights of the same chain. need a smarter check
      - i think there is also a bug because i've seen "server not synced" a couple times
- [x] bug around eth_getBlockByHash sometimes causes tokio to lock up
  - i keep a mapping of blocks so that i can go from hash -> block. it has some consistent hashing it does to split them up across multiple maps each with their own lock. so a lot of the time reads dont block writes because they are in different internal maps. this was fine.
  - but after changing my fork detection logic to use the same rules as erigon, i discovered that when you get blocks from a websocket subscription in erigon and geth, theres a missing field (https://github.com/ledgerwatch/erigon/issues/5190). so i added a query to get the block that includes the missing field.
  - but i did this in a way where i was holding the write lock open while doing the query. the "new" block that has the missing field ends up in the same bucket and it also wants a write lock. oops. entry api has very sharp edges. don't ever await inside a match on DashMap::entry
- [x] requests for "Get transactions receipts" are routed to the private_rpcs and not the balanced_rpcs. do this better.
  - [x] quick fix, send to balanced_rpcs for now. we will just live with errors on new transactions.
  - this was intentional so that recently confirmed transactions go to a server that is more likely to have the tx.
  - but under heavy load, we hit their rate limits. need a "retry_until_success" function that goes to balanced_rpcs. or maybe store in redis the txids that we broadcast privately and use that to route.
- [x] some of the DashMaps grow unbounded! Make/find a "SizedDashMap" that cleans up old rows with some garbage collection task
  - moka is exactly what we need
- [x] if block data limit is 0, say Unknown in Debug output
- [x] basic request method stats (using the user_id and other fields that are in the tracing frame)
- [x] refactor from_anyhow_error to have consistent error codes and http codes. maybe implement the Error trait
- [x] improve rpc weights. i think theres still a potential thundering herd
- [x] improved logging with useful instrumentation
- [x] right now the block_map is unbounded. move this to redis and do some calculations to be sure about RAM usage
- [x] synced connections swap threshold should come from config
- [x] right now we send too many getTransaction queries to the private rpc tier and i are being rate limited by some of them. change to be serial and weight by hard/soft limit.  
- [x] ip blocking gives a 500 and not the proper error code
- [x] need a reconnect that doesn't unwrap
- [x] need a retrying_reconnect that is used everywhere reconnect is. have exponential backoff here
- [x] it looks like our reconnect logic is not always firing. we need to make reconnect more robust!
  - i am pretty sure that this is actually servers that fail to connect on initial setup (maybe the rpcs that are on the wrong chain are just timing out and they aren't set to reconnect?)
- [ ] rewrite rate limiting to have a tiered cache. do not put redis in the hot path
  - instead, we should check a local cache for the current rate limit (+1) and spawn an update to the local cache from redis in the background.
  - when there are a LOT of concurrent requests, we see errors. i thought that was a problem with redis cell, but it happens with my simpler rate limit. now i think the problem is actually with bb8
  - [ ] https://docs.rs/redis/latest/redis/aio/struct.ConnectionManager.html or https://crates.io/crates/deadpool-redis?
  - WARN http_request: redis_rate_limit::errors: redis error err=Response was of incompatible type: "Response type not string compatible." (response was int(500237)) id=01GC6514JWN5PS1NCWJCGJTC94 method=POST
  - maybe even bring back redis-cell
- [ ] web3_proxy_error_count{path = "backend_rpc/request"} is inflated by a bunch of reverts. do not log reverts as warn. 
  - erigon gives `method=eth_call reqid=986147 t=1.151551ms err="execution reverted"`
  - [ ] opt-in debug mode that inspects responses for reverts and saves the request to the database for the user
  - this must be opt-in or spawned since it will slow things down and will make their calls less private
- [ ] chain rolled back 1/1/1 con_head=15510065 (0xa4a3…d2d8) rpc_head=15510065 (0xa4a3…d2d8) rpc=local_erigon_archive
    - include the old head number and block in the log
- [ ] add configurable size limits to all the Caches
- [ ] Ulid instead of Uuid for user keys
  - <https://discord.com/channels/873880840487206962/900758376164757555/1012942974608474142>
  - since users are actively using our service, we will need to support both
- [ ] Ulid instead of Uuid for database ids
  - might have to use Uuid in sea-orm and then convert to Ulid on display
- [ ] Api keys need option to lock to IP, cors header, referer, etc
- [ ] requests per second per api key
- [ ] distribution of methods per api key (eth_call, eth_getLogs, etc.)

## V1

These are not yet ordered.

- [-] use siwe messages and signatures for sign up and login
- [-] if we request an old block, more servers can handle it than we currently use.
    - [ ] instead of the one list of just heads, store our intermediate mappings (rpcs_by_hash, rpcs_by_num, blocks_by_hash) in SyncedConnections. this shouldn't be too much slower than what we have now
    - [ ] remove the if/else where we optionally route to archive and refactor to require a BlockNumber enum
    - [ ] then check syncedconnections for the blockNum. if num given, use the cannonical chain to figure out the winning hash
    - [ ] this means if someone requests a recent but not ancient block, they can use all our servers, even the slower ones. need smart sorting for priority here
- [ ] favicon
  - eth_1       | 2022-09-07T17:10:48.431536Z  WARN web3_proxy::jsonrpc: forwarding error err=nothing to see here
  - use the one on https://staging.llamanodes.com/
- [ ] warn if no servers have transaction subscriptions
    - [ ] if no servers have transaction subscriptions, and a user tries to subscribe, make sure the error is user friendly
- [ ] only allow transaction and full block subscriptions if the user is registered?
- [ ] eth_subscribe logs (https://geth.ethereum.org/docs/rpc/pubsub)
- [ ] make private transactions opt in (its already in the database, but not our code)
- [ ] write a function for receipts that tries balanced_rpcs and only if they all error should it try private relays
  - [ ] automatic retries with a timeout or until all the servers have been tried.
    - i had the websocket die on me in the middle of a long test. only one in-flight request failed because of it. the rest delayed. figure out how to catch these ones since websocket fails sadly seem common
- [ ] nice output when cargo doc is run
- [ ] cache more things locally or in redis
- [ ] stats when forks are resolved (and what chain they were on?)
- [ ] failsafe. if no blocks or transactions in some time, warn and reset the connection
- [ ] emit stats for user's successes, retries, failures, with the types of requests, chain, rpc
- [ ] cli for creating and editing api keys
- [ ] Only subscribe to transactions when someone is listening and if the server has opted in to it
- [ ] When sending eth_sendRawTransaction, retry errors
- [ ] If we need an archive server and no servers in sync, exit immediately with an error instead of waiting 60 seconds
- [ ] 60 second timeout is too short. Maybe do that for free tier and larger timeout for paid. Problem is that some queries can take over 1000 seconds
- [ ] when handling errors from axum parsing the Json...Enum, the errors don't get wrapped in json. i think we need a axum::Layer
- [ ] don't "unwrap" anywhere. give proper errors
- [ ] handle log subscriptions
  - probably as a paid feature
- [ ] exponential backoff when reconnecting a connection

new endpoints for users (not totally sure about the exact paths, but these features are all needed):
- [x] GET /u/:api_key
  - proxies to web3 websocket
- [x] POST /u/:api_key
  - proxies to web3
- [ ] GET /user/login/$address
  - returns a JSON string for the user to sign
- [ ] POST /user/login/$address
  - returns a JSON string including the api key
  - sets session cookie
- [ ] GET /user/$address
  - checks for api key in session cookie or header
  - returns a JSON string including user stats
    - balance in USD 
    - deposits history (currency, amounts, transaction id)
    - number of requests used (so we can calculate average spending over a month, burn rate for a user etc, something like "Your balance will be depleted in xx days)
    - the email address of a user if he opted in to get contacted via email
    - all the success/retry/fail counts and latencies (but that may better come from somewhere else)
- [ ] POST /user/$address
  - opt-in link email address
  - checks for api key in session cookie or header
  - allows modifying user settings

## V2

These are not ordered. I think some rows also accidently got deleted here. Check git history.

- [ ] handle user payments
  - [ ] separate daemon (or users themselves) call POST /users/process_transaction
    - checks a transaction to see if it modifies a user's balance. records results in a sql database
    - we will have our own event subscriber watching for "deposit" events, but sometimes events get missed and users might incorrectly "transfer" the tokens directly to an address instead of using the dapp
- [ ] refactor so configs can change while running
  - this will probably be a rather large change, but is necessary when we have autoscaling
  - create the app without applying any config to it
  - have a blocking future watching the config file and calling app.apply_config() on first load and on change
  - work started on this in the "config_reloads" branch. because of how we pass channels around during spawn, this requires a larger refactor.
- [ ] if a rpc fails to connect at start, retry later instead of skipping it forever (need config hot reloads first)
- [ ] jwt auth so people can easily switch from infura
- [ ] automated soft limit
  - look at average request time for getBlock? i'm not sure how good a proxy that will be for serving eth_call, but its a start
  - https://crates.io/crates/histogram-sampler
- [ ] interval for http subscriptions should be based on block time. load from config is easy, but better to query. currently hard coded to 13 seconds

in another repo: event subscriber
  - [ ] watch for transfer events to our contract and submit them to /payment/$tx_hash
  - [ ] cli tool that support can run to manually check and submit a transaction

## "Maybe some day" and other Miscellaneous Things

- [ ] tool to revoke bearer tokens that also clears redis
- [ ] eth_getBlockByNumber and similar calls served from the block map
  - will need all Block<TxHash> **and** Block<TransactionReceipt> in caches or fetched efficiently
  - so maybe we don't want this. we can just use the general request cache for these. they will only require 1 request and it means requests won't get in the way as much on writes as new blocks arrive.
  - after looking at my request logs, i think its worth doing this. no point hitting the backends with requests for blocks multiple times. will also help with cache hit rates since we can keep recent blocks in a separate cache
- [ ] Public bsc server got “0” for block data limit (ninicoin)
- [ ] cli tool for resetting api keys
- [ ] cli tool for checking config
- [ ] benchmarks of the different Cache implementations (futures vs dash)
- [ ] Advanced load testing scripts so we can find optimal cost servers 
  - [ ] benchmarks from https://github.com/llamafolio/llamafolio-api/
  - [ ] benchmarks from ethspam and versus
  - [ ] benchmarks from other things
  - [ ] quick script that calls all the curve-api endpoints once and checks for success, then calls wrk to hammer it
    - [ ] https://github.com/curvefi/curve-api
    - [ ] test /api/getGaugesmethod
        - usually times out after vercel's 60 second timeout
        - one time got: Error invalid Json response ""
- [ ] send logs to sentry
- [ ] i think all the async methods in ethers need tracing instrument. something like `cfgif(tracing, tracing::instrument)`
  - if they do that, i think my request_id will show up on their logs
- [ ] page that prints a graphviz dotfile of the blockchain
- [ ] search for all the "TODO" and `todo!(...)` items in the code and move them here
- [ ] instead of giving a rate limit error code, delay the connection's response at the start. reject if incoming requests is super high?
- [ ] add the backend server to the header?
- [ ] have a low-latency option that always tries at least two servers in parallel and then returns the first success?
  - this doubles our request load though. maybe only if the first one doesn't respond very quickly? 
- [ ] zero downtime deploys
- [ ] graceful shutdown. stop taking new requests and don't stop until all outstanding queries are handled
  - https://github.com/tokio-rs/mini-redis/blob/master/src/shutdown.rs
- [ ] are we using Acquire/Release/AcqRel properly? or do we need other modes?
- [ ] use https://github.com/ledgerwatch/interfaces to talk to erigon directly instead of through erigon's rpcdaemon (possible example code which uses ledgerwatch/interfaces: https://github.com/akula-bft/akula/tree/master)
- [ ] subscribe to pending transactions and build an intelligent gas estimator
- [ ] flashbots specific methods
  - [ ] flashbots protect fast mode or not? probably fast matches most user's needs, but no reverts is nice.
  - [ ] https://docs.flashbots.net/flashbots-auction/searchers/advanced/rpc-endpoint#authentication maybe have per-user keys. or pass their header on if its set
- [ ] if no redis set, but public rate limits are set, exit with an error
- [ ] i saw "WebSocket connection closed unexpectedly" but no log about reconnecting
  - need better logs on this because afaict it did reconnect
- [ ] if archive servers are added to the rotation while they are still syncing, they might get requests too soon. keep archive servers out of the configs until they are done syncing. full nodes should be fine to add to the configs even while syncing, though its a wasted connection
- [ ] better document load tests: docker run --rm --name spam shazow/ethspam --rpc http://$LOCAL_IP:8544 | versus --concurrency=100 --stop-after=10000 http://$LOCAL_IP:8544; docker stop spam
- [ ] if the call is something simple like "symbol" or "decimals", cache that too. though i think this could bite us.
- [ ] add a subscription that returns the head block number and hash but nothing else
- [ ] if chain split detected, what should we do? don't send transactions?
- [ ] archive check works well for local servers, but public nodes (especially on other chains) seem to give unreliable results. likely because of load balancers. maybe have a "max block data limit"
- [ ] https://docs.rs/derive_builder/latest/derive_builder/
- [ ] Detect orphaned transactions
- [ ] https://crates.io/crates/reqwest-middleware easy retry with exponential back off
  - Though I think we want retries that go to other backends instead
- [ ] Some of the pub things should probably be "pub(crate)"
- [ ] Maybe storing pending txs on receipt in a dashmap is wrong. We want to store in a timer_heap (or similar) when we actually send. This way there's no lock contention until the race is over.
- [ ] Support "safe" block height. It's planned for eth2 but we can kind of do it now but just doing head block num-3
- [ ] Archive check on BSC gave “archive” when it isn’t. and FTM gave 90k for all servers even though they should be archive
- [ ] cache eth_getLogs in a database?
- [ ] stats for "read amplification". how many backend requests do we send compared to frontend requests we received?
- [ ] fully test retrying when "header not found"
  - i saw "header not found" on a simple eth_getCode query to a public load balanced bsc archive node on block 1
- [ ] weird flapping fork could have more useful logs. like, howd we get to 1/1/4 and fork. geth changed its mind 3 times?
  - should we change our code to follow the same consensus rules as geth? our first seen still seems like a reasonable choice
  -  other chains might change all sorts of things about their fork choice rules
    2022-07-22T23:52:18.593956Z  WARN block_receiver: web3_proxy::connections: chain is forked! 1 possible heads. 1/1/4 rpcs have 0xa906…5bc1 rpc=Web3Connection { url: "ws://127.0.0.1:8546", data: 64, .. } new_block_num=15195517
    2022-07-22T23:52:18.983441Z  WARN block_receiver: web3_proxy::connections: chain is forked! 1 possible heads. 1/1/4 rpcs have 0x70e8…48e0 rpc=Web3Connection { url: "ws://127.0.0.1:8546", data: 64, .. } new_block_num=15195517
    2022-07-22T23:52:19.350720Z  WARN block_receiver: web3_proxy::connections: chain is forked! 2 possible heads. 1/2/4 rpcs have 0x70e8…48e0 rpc=Web3Connection { url: "ws://127.0.0.1:8549", data: "archive", .. } new_block_num=15195517
    2022-07-22T23:52:26.041140Z  WARN block_receiver: web3_proxy::connections: chain is forked! 2 possible heads. 2/4/4 rpcs have 0x70e8…48e0 rpc=Web3Connection { url: "http://127.0.0.1:8549", data: "archive", .. } new_block_num=15195517
  - [ ] threshold should check actual available request limits (if any) instead of just the soft limit
- [ ] foreign key on_update and on_delete
- [ ] database creation timestamps
- [ ] better error handling. we warn too often for validation errors and use the same error code for most every request
- [ ] use &str more instead of String. lifetime annotations get really annoying though
- [ ] tarpit instead of reject requests (unless theres a lot)
- [ ] tune database connection pool size. i think a single web3_proxy currently maxes out our server
- [ ] subscribing to transactions should be configurable per server. listening to paid servers can get expensive
- [ ] archive servers should be lowest priority
- [ ] docker build context is really big. we must be including target or something
- [ ] ip detection needs work so that everything doesnt show up as 172.x.x.x
- [ ] status page leaks our urls which contain secrets. change that to use names
- [ ] PR to add this to sea orm prelude:
  ```
  #[cfg(feature = "with-uuid")]
  pub use uuid::Builder as UuidBuilder;
  ```
- [ ] get to /, when not serving a websocket, should have a simple welcome page. maybe with a button to update your wallet. 
- [ ] rate limit thoughts:
  - if someone subscribes to all pending transactions, how should that count against rate limits
  - when those rate limits are hit, what should happen?
  - missing pending transactions might be okay, but not missing confirmed blocks 
- [ ] for easier errors in the axum code, i think we need to have our own type that wraps anyhow::Result+Error
- [ ] fix ip detection when running in dev
- [ ] double check weight sorting code
- [ ] sea-orm brings in async-std, but we are using tokio. benchmark switching 
- [ ] this query always times out, but erigon can serve it quickly: `curl -X POST -H "Content-Type: application/json" --data '{"jsonrpc":"2.0","method":"debug_traceBlockByNumber","params":["latest"],"id":1}' 127.0.0.1:8544' 127.0.0.1:8544`
  {"jsonrpc":"2.0","id":null,"error":{"code":-32099,"message":"deadline has elapsed"}}
  - [ ] figure out rate limits for private rpcs. eden v1 gives 500 error instead of a code for rate limits
- [ ] https://gitlab.com/moka-labs/tiered-cache-example
- [ ] web3connection3.block(...) might wait forever. be sure to do it safely
- [ ] search for all "todo!"
- [ ] replace all `.context("no servers in sync")` with proper error type
- [ ] when using a bunch of slow public servers, i see "no servers in sync" even when things should be right
  - [ ] i think checking the parents of the heaviest chain works most of the time, but not always
  - maybe iterate connection heads by total weight? i still think we need to include parent hashes
- [ ] i see "No block found" sometimes for a single server's block. Not sure why since reads should happen after writes
- [ ] whats going on here? why is it rolling back? maybe total_difficulty was a LOT higher?
  - 2022-09-05T19:21:39.763630Z  WARN web3_proxy::rpcs::blockchain: chain rolled back 1/6/7 head=15479604 (0xf809…6a2c) rpc=infura_free
  - i wish i had more logs. its possible that 15479605 came immediatly after
- [ ] ip blocking logs a warn. we don't need that. a stat at most
- [ ] keep it working without redis and a database
-												watch new heads

											
										
										
											2022-04-25 22:14:10 +03:00
+								# Todo
-												clean up todos

											
										
										
											2022-06-21 04:02:49 +03:00
+								## MVP
-												order most of the todos

											
										
										
											2022-09-12 17:31:57 +03:00
+								These are roughly in order of completition
-												clean up todos

											
										
										
											2022-06-21 04:02:49 +03:00
+								- [x] simple proxy
 								- [x] better locking. when lots of requests come in, we seem to be in the way of block updates
 								- [x] load balance between multiple RPC servers
 								- [x] support more than just ETH
 								- [x] option to disable private rpc and send everything to primary
 								- [x] support websocket clients
 								  - we support websockets for the backends already, but we need them for the frontend too
 								- [x] health check nodes by block height
 								- [x] Dockerfile
 								- [x] docker-compose.yml
 								- [x] after connecting to a server, check that it gives the expected chainId
 								- [x] the ethermine rpc is usually fastest. but its in the private tier. since we only allow synced rpcs, we are going to not have an rpc a lot of the time
 								- [x] if not backends. return a 502 instead of delaying?
 								- [x] move from warp to axum
 								- [x] handle websocket disconnect and reconnect
 								- [x] eth_sendRawTransaction should return the most common result, not the first
 								- [x] use redis and redis-cell for rate limits
-												funnel survive rate limiting

											
										
										
											2022-06-17 01:23:41 +03:00
+								- [x] it works for a few seconds and then gets stuck on something.
 								  - [x] its working with one backend node, but multiple breaks. something to do with pending transactions
 								  - [x] dashmap entry api is easy to deadlock! be careful with it!
-												clean up todos

											
										
										
											2022-06-21 04:02:49 +03:00
+								- [x] the web3proxyapp object gets cloned for every call. why do we need any arcs inside that? shouldn't they be able to connect to the app's? can we just use static lifetimes
-												it works, but we need it to be optional

											
										
										
											2022-06-15 01:02:18 +03:00
+								- [x] refactor Connection::spawn. have it return a handle to the spawned future of it running with block and transaction subscriptions
 								- [x] refactor Connections::spawn. have it return a handle that is selecting on those handles?
-												start adding redis-cell for rate limits

											
										
										
											2022-05-21 23:40:22 +03:00
+								- [x] some production configs are occassionally stuck waiting at 100% cpu
-												check to see if this gets stuck

											
										
										
											2022-05-19 06:00:54 +03:00
+								  - they stop processing new blocks. i'm guessing 2 blocks arrive at the same time, but i thought our locks would handle that
 								  - even after removing a bunch of the locks, the deadlock still happens. i can't reliably reproduce. i just let it run for awhile and it happens.
 								  - running gdb shows the thread at tokio tungstenite thread is spinning near 100% cpu and none of the rest of the program is proceeding
-												start adding redis-cell for rate limits

											
										
										
											2022-05-21 23:40:22 +03:00
+								  - fixed by https://github.com/gakonst/ethers-rs/pull/1287
-												retries

											
										
										
											2022-07-02 04:20:28 +03:00
+								- [x] when sending with private relays, brownie's tx.wait can think the transaction was dropped. smarter retry on eth_getTransactionByHash and eth_getTransactionReceipt (maybe only if we sent the transaction ourselves)
 								- [x] if web3 proxy gets an http error back, retry another node
 								- [x] endpoint for health checks. if no synced servers, give a 502 error
-												todos

											
										
										
											2022-07-07 03:00:15 +03:00
+								- [x] rpc errors propagate too far. one subscription failing ends the app. isolate the providers more (might already be fixed)
-												connection pooling

											
										
										
											2022-07-07 06:22:09 +03:00
+								- [x] incoming rate limiting (by ip)
-												todo complete

											
										
										
											2022-07-07 06:30:04 +03:00
+								- [x] connection pool for redis
-												better archive split

											
										
										
											2022-07-16 07:13:02 +03:00
+								- [x] automatically route to archive server when necessary
-												improve redis connection pool

											
										
										
											2022-07-09 02:02:32 +03:00
+								  - originally, no processing was done to params; they were just serde_json::RawValue. this is probably fastest, but we need to look for "latest" and count elements, so we have to use serde_json::Value
-												better archive split

											
										
										
											2022-07-16 07:13:02 +03:00
+								  - when getting the next server, filtering on "archive" isn't going to work well. need to check inner instead
-												error if future block is requested

											
										
										
											2022-07-21 02:49:29 +03:00
+								- [x] if the requested block is ahead of the best block, return without querying any backend servers
-												better error handling

											
										
										
											2022-07-08 21:27:06 +03:00
+								- [x] http servers should check block at the very start
-												rearrange todos

											
										
										
											2022-07-21 05:57:14 +03:00
+								- [x] subscription id should be per connection, not global
 								- [x] when under load, i'm seeing "http interval lagging!". sometimes it happens when not loaded.
 								  - we were skipping our delay interval when block hash wasn't changed. so if a block was ever slow, the http provider would get the same hash twice and then would try eth_getBlockByNumber a ton of times
 								- [x] inspect any jsonrpc errors. if its something like "header not found" or "block with id $x not found" retry on another node (and add a negative score to that server)
 								  - this error seems to happen when we use load balanced backend rpcs like pokt and ankr
-												improve caching

											
										
										
											2022-07-22 22:30:39 +03:00
+								- [x] RESPONSE_CACHE_CAP in bytes instead of number of entries
 								- [x] if we don't cache errors, then in-flight request caching is going to bottleneck
-												rearrange todos

											
										
										
											2022-07-21 05:57:14 +03:00
+								  - i think now that we retry header not found and similar, caching errors should be fine
-												improve caching

											
										
										
											2022-07-22 22:30:39 +03:00
+								- [x] RESPONSE_CACHE_CAP from config
 								- [x] web3_sha3 rpc command
-												test more

											
										
										
											2022-07-23 03:19:13 +03:00
+								- [x] test that launches anvil and connects the proxy to it and does some basic queries
 								  - [x] need to have some sort of shutdown signaling. doesn't need to be graceful at this point, but should be eventually
-												thresholds and fork detection

											
										
										
											2022-07-25 03:27:00 +03:00
+								- [x] if the fastest server has hit rate limits, we won't be able to serve any traffic until another server is synced.
-												todos

											
										
										
											2022-07-19 10:01:55 +03:00
+								  - thundering herd problem if we only allow a lag of 0 blocks
-												thresholds and fork detection

											
										
										
											2022-07-25 03:27:00 +03:00
+								  - we can improve this by only publishing the synced connections once a threshold of total available soft and hard limits is passed. how can we do this without hammering redis? at least its only once per block per server
 								  - [x] instead of tracking `pending_synced_connections`, have a mapping of where all connections are individually. then each change, re-check for consensus.
-												always serve something

											
										
										
											2022-07-25 21:00:29 +03:00
+								- [x] synced connections swap threshold set to 1 so that it always serves something
-												more todos

											
										
										
											2022-08-06 05:29:55 +03:00
+								- [x] cli tool for creating new users
-												better results and errors

											
										
										
											2022-08-07 09:48:57 +03:00
+								- [x] incoming rate limiting by api key
-												sorting on total difficulty doesnt work with geth websocket

											
										
										
											2022-08-07 23:44:56 +03:00
+								- [x] sort forked blocks by total difficulty like geth does
 								- [x] refactor result type on active handlers to use a cleaner success/error so we can use the try operator
 								- [x] give users different rate limits looked up from the database
-												add weight to rpcs

											
										
										
											2022-08-08 22:57:54 +03:00
+								- [x] Add a "weight" key to the servers. Sort on that after block. keep most requests local
-												disable less used chains for now

											
										
										
											2022-08-10 07:27:27 +03:00
+								- [x] cache db query results for user data. db is a big bottleneck right now
-												did this earlier

											
										
										
											2022-08-10 08:23:32 +03:00
+								- [x] allow blocking public requests
-												dont subscribe to blocks on the private tier

											
										
										
											2022-08-11 00:52:28 +03:00
+								- [x] Got warning: "WARN subscribe_new_heads:send_block: web3_proxy::connection: unable to get block from https://rpc.ethermine.org: Deserialization Error: expected value at line 1 column 1. Response: error code: 1015". this is cloudflare rate limiting on fetching a block, but this is a private rpc. why is there a block subscription?
 								- [x] im seeing ethspam occasionally try to query a future block. something must be setting the head block too early
 								  - [x] we were sorting best block the wrong direction. i flipped a.cmp(b) to b.cmp(a) so that the largest would be first, but then i used 'max_by' which looks at the end of the list
-												better redirect and jsonrpc handling

											
										
										
											2022-08-11 04:53:27 +03:00
+								- [x] HTTP GET to the websocket endpoints should redirect instead of giving an ugly error
-												load the redirected page from config

											
										
										
											2022-08-12 22:07:14 +03:00
+								- [x] load the redirected page from config
-												todos

											
										
										
											2022-08-12 22:16:50 +03:00
+								- [x] prettier output for create_user command. need the key in hex
-												missed these todos

											
										
										
											2022-08-16 02:09:18 +03:00
+								- [x] drop redis-cell in favor of a simpler (and faster) implementation.
 								  - redis-cell was giving me weird errors and it isn't worth debugging it right now.
 								- [x] create user script should allow setting the api key
-												setup volatile redis

											
										
										
											2022-08-16 08:00:29 +03:00
+								- [x] disable redis persistence in dev
-												tower-request-id

											
										
										
											2022-08-16 03:33:26 +03:00
+								- [x] attach a request id to every web request
-												instrument with spans and allow skipping jsonrpc

											
										
										
											2022-08-16 07:56:01 +03:00
+								- [x] attach user id (not IP!) to each request
 								- [x] fantom_1    | 2022-08-10T22:19:43.522465Z  WARN web3_proxy::jsonrpc: forwarding error err=missing field `jsonrpc` at line 1 column 60
 								  - [x] i think the server isn't following the spec. we need a context attached to more errors so we know which one
 								  - [x] make jsonrpc default to "2.0" (including the custom deserializer that handles the RawValues)
-												rearrange todos

											
										
										
											2022-07-21 05:57:14 +03:00
+								- [x] if the eth_call (or similar) params include a block, we can cache for that
 								- [x] when block subscribers receive blocks, store them in a block_map
 								- [x] eth_blockNumber without a backend request
 								- [x] if we send a transaction to private rpcs and then people query it on public rpcs things, some interfaces might think the transaction is dropped (i saw this happen in a brownie script of mine). how should we handle this?
 								  - [x] send getTransaction rpc requests to the private rpc tier
-												merge todo list from phone

											
										
										
											2022-07-21 06:30:39 +03:00
+								- [x] I'm hitting infura rate limits very quickly. I feel like that means something is very inefficient
 								  - whenever blocks were slow, we started checking as fast as possible
-												Address, not String

											
										
										
											2022-08-16 20:55:44 +03:00
+								- [x] create user script should allow setting requests per minute
-												cache api keys that are not in the database

											
										
										
											2022-08-17 00:10:09 +03:00
+								- [x] cache api keys that are not in the database
-												todo cleanup

											
										
										
											2022-09-07 07:47:06 +03:00
+								- [x] improve consensus block selection. Our goal is to find the highest work chain with a block over a minimum threshold of sum_soft_limit.
-												more fork detection work

											
										
										
											2022-09-01 08:58:55 +03:00
+								  - [x] i saw a fork of like 300 blocks. probably just because a node was restarted and had fallen behind. need some checks to ignore things that are far behind. this improvement should fix this problem
-												rewrite cannonical block

											
										
										
											2022-08-28 02:49:41 +03:00
+								  - [x] A new block arrives at a connection.
 								  - [x] It checks that it isn't the same that it already has (which is a problem with polling nodes)
 								  - [x] If its new to this node...
 								    - [x] if the block does not have total work, check our cache. otherwise, query the node
 								    - [x] save the block num and hash so that http polling doesn't send duplicates
 								    - [x] send the deduped block through a channel to be handled by the connections grouping.
 								  - [x] The connections group...
 								    - [x] input = rpc, new_block
 								    - [x] adds the block and rpc to it's internal maps
 								      - [x] connection_heads: HashMap<rpc_name, blockhash>
 								      - [x] block_map: DashMap<blockhash, Arc<Block>>
 								      - [x] block_num: DashMap<U64, H256>
 								      - [x] blockchain: DiGraphMap<blockhash, ?>
 								    - [x] iterate the rpc_map to find the highest_work_block
-												todo cleanup

											
										
										
											2022-09-07 07:47:06 +03:00
+								    - [x] update synced connections
-												rewrite cannonical block

											
										
										
											2022-08-28 02:49:41 +03:00
+								    - [x] send the block through new head_block_sender
-												more fork detection work

											
										
										
											2022-09-01 08:58:55 +03:00
+								  - [x] rewrite cannonical_block to work as long as there are no forks
-												todo cleanup

											
										
										
											2022-09-07 07:47:06 +03:00
+								  - [x] rewrite cannonical_block (again) and related functions to handle forks
 								    - [x] got a very large number of possible heads here. i think maybe a server was very far out of sync. we should drop servers behind by too much
 								    eth_1       | 2022-08-10T23:26:06.377129Z  WARN web3_proxy::connections: chain is forked! 261 possible heads. 1/2/5/5 rpcs have 0xd403…3c5d
 								    eth_1       | 2022-08-10T23:26:08.917603Z  WARN web3_proxy::connections: chain is forked! 262 possible heads. 1/2/5/5 rpcs have 0x0538…bfff
 								    eth_1       | 2022-08-10T23:26:10.195014Z  WARN web3_proxy::connections: chain is forked! 262 possible heads. 1/2/5/5 rpcs have 0x0538…bfff
 								    eth_1       | 2022-08-10T23:26:10.195658Z  WARN web3_proxy::connections: chain is forked! 262 possible heads. 2/3/5/5 rpcs have 0x0538…bfff
 								    - [x] todo!("handle equal") and also less and greater
-												more fork detection work

											
										
										
											2022-09-01 08:58:55 +03:00
+								    - [x] "chain is forked" message is wrong. it includes nodes just being on different heights of the same chain. need a smarter check
 								      - i think there is also a bug because i've seen "server not synced" a couple times
-												update TODO list

											
										
										
											2022-08-31 00:02:35 +03:00
+								- [x] bug around eth_getBlockByHash sometimes causes tokio to lock up
 								  - i keep a mapping of blocks so that i can go from hash -> block. it has some consistent hashing it does to split them up across multiple maps each with their own lock. so a lot of the time reads dont block writes because they are in different internal maps. this was fine.
 								  - but after changing my fork detection logic to use the same rules as erigon, i discovered that when you get blocks from a websocket subscription in erigon and geth, theres a missing field (https://github.com/ledgerwatch/erigon/issues/5190). so i added a query to get the block that includes the missing field.
 								  - but i did this in a way where i was holding the write lock open while doing the query. the "new" block that has the missing field ends up in the same bucket and it also wants a write lock. oops. entry api has very sharp edges. don't ever await inside a match on DashMap::entry
-												todo cleanup

											
										
										
											2022-09-07 07:47:06 +03:00
+								- [x] requests for "Get transactions receipts" are routed to the private_rpcs and not the balanced_rpcs. do this better.
 								  - [x] quick fix, send to balanced_rpcs for now. we will just live with errors on new transactions.
-												temp fix for routing to eth_getTransactionByHash and eth_getTransactionReceipt

											
										
										
											2022-08-18 01:19:34 +03:00
+								  - this was intentional so that recently confirmed transactions go to a server that is more likely to have the tx.
 								  - but under heavy load, we hit their rate limits. need a "retry_until_success" function that goes to balanced_rpcs. or maybe store in redis the txids that we broadcast privately and use that to route.
-												use sized Caches

											
										
										
											2022-09-05 08:53:58 +03:00
+								- [x] some of the DashMaps grow unbounded! Make/find a "SizedDashMap" that cleans up old rows with some garbage collection task
 								  - moka is exactly what we need
-												todo cleanup

											
										
										
											2022-09-07 07:47:06 +03:00
+								- [x] if block data limit is 0, say Unknown in Debug output
-												order most of the todos

											
										
										
											2022-09-12 17:31:57 +03:00
+								- [x] basic request method stats (using the user_id and other fields that are in the tracing frame)
 								- [x] refactor from_anyhow_error to have consistent error codes and http codes. maybe implement the Error trait
 								- [x] improve rpc weights. i think theres still a potential thundering herd
 								- [x] improved logging with useful instrumentation
 								- [x] right now the block_map is unbounded. move this to redis and do some calculations to be sure about RAM usage
 								- [x] synced connections swap threshold should come from config
 								- [x] right now we send too many getTransaction queries to the private rpc tier and i are being rate limited by some of them. change to be serial and weight by hard/soft limit.
-												retrying reconnect

											
										
										
											2022-09-14 04:43:09 +03:00
+								- [x] ip blocking gives a 500 and not the proper error code
 								- [x] need a reconnect that doesn't unwrap
 								- [x] need a retrying_reconnect that is used everywhere reconnect is. have exponential backoff here
 								- [x] it looks like our reconnect logic is not always firing. we need to make reconnect more robust!
 								  - i am pretty sure that this is actually servers that fail to connect on initial setup (maybe the rpcs that are on the wrong chain are just timing out and they aren't set to reconnect?)
-												todo cleanup

											
										
										
											2022-09-07 07:47:06 +03:00
+								- [ ] rewrite rate limiting to have a tiered cache. do not put redis in the hot path
-												order most of the todos

											
										
										
											2022-09-12 17:31:57 +03:00
+								  - instead, we should check a local cache for the current rate limit (+1) and spawn an update to the local cache from redis in the background.
 								  - when there are a LOT of concurrent requests, we see errors. i thought that was a problem with redis cell, but it happens with my simpler rate limit. now i think the problem is actually with bb8
-												todo cleanup

											
										
										
											2022-09-07 07:47:06 +03:00
+								  - [ ] https://docs.rs/redis/latest/redis/aio/struct.ConnectionManager.html or https://crates.io/crates/deadpool-redis?
 								  - WARN http_request: redis_rate_limit::errors: redis error err=Response was of incompatible type: "Response type not string compatible." (response was int(500237)) id=01GC6514JWN5PS1NCWJCGJTC94 method=POST
 								  - maybe even bring back redis-cell
-												order most of the todos

											
										
										
											2022-09-12 17:31:57 +03:00
+								- [ ] web3_proxy_error_count{path = "backend_rpc/request"} is inflated by a bunch of reverts. do not log reverts as warn.
 								  - erigon gives `method=eth_call reqid=986147 t=1.151551ms err="execution reverted"`
 								  - [ ] opt-in debug mode that inspects responses for reverts and saves the request to the database for the user
 								  - this must be opt-in or spawned since it will slow things down and will make their calls less private
 								- [ ] chain rolled back 1/1/1 con_head=15510065 (0xa4a3…d2d8) rpc_head=15510065 (0xa4a3…d2d8) rpc=local_erigon_archive
 								    - include the old head number and block in the log
 								- [ ] add configurable size limits to all the Caches
 								- [ ] Ulid instead of Uuid for user keys
 								  - <https://discord.com/channels/873880840487206962/900758376164757555/1012942974608474142>
-												retrying reconnect

											
										
										
											2022-09-14 04:43:09 +03:00
+								  - since users are actively using our service, we will need to support both
-												order most of the todos

											
										
										
											2022-09-12 17:31:57 +03:00
+								- [ ] Ulid instead of Uuid for database ids
 								  - might have to use Uuid in sea-orm and then convert to Ulid on display
 								- [ ] Api keys need option to lock to IP, cors header, referer, etc
-												retrying reconnect

											
										
										
											2022-09-14 04:43:09 +03:00
+								- [ ] requests per second per api key
 								- [ ] distribution of methods per api key (eth_call, eth_getLogs, etc.)
-												polish todo list

											
										
										
											2022-08-16 08:13:19 +03:00
 								## V1
-												todo cleanup

											
										
										
											2022-09-07 07:47:06 +03:00
+								These are not yet ordered.
-												order most of the todos

											
										
										
											2022-09-12 17:31:57 +03:00
+								- [-] use siwe messages and signatures for sign up and login
 								- [-] if we request an old block, more servers can handle it than we currently use.
 								    - [ ] instead of the one list of just heads, store our intermediate mappings (rpcs_by_hash, rpcs_by_num, blocks_by_hash) in SyncedConnections. this shouldn't be too much slower than what we have now
 								    - [ ] remove the if/else where we optionally route to archive and refactor to require a BlockNumber enum
 								    - [ ] then check syncedconnections for the blockNum. if num given, use the cannonical chain to figure out the winning hash
 								    - [ ] this means if someone requests a recent but not ancient block, they can use all our servers, even the slower ones. need smart sorting for priority here
-												dry errors so that rate limits dont log so much

											
										
										
											2022-09-10 03:12:14 +03:00
+								- [ ] favicon
-												more small todos

											
										
										
											2022-09-07 23:24:35 +03:00
+								  - eth_1       | 2022-09-07T17:10:48.431536Z  WARN web3_proxy::jsonrpc: forwarding error err=nothing to see here
 								  - use the one on https://staging.llamanodes.com/
-												todo cleanup

											
										
										
											2022-09-07 07:47:06 +03:00
+								- [ ] warn if no servers have transaction subscriptions
 								    - [ ] if no servers have transaction subscriptions, and a user tries to subscribe, make sure the error is user friendly
-												order most of the todos

											
										
										
											2022-09-12 17:31:57 +03:00
+								- [ ] only allow transaction and full block subscriptions if the user is registered?
-												todo cleanup

											
										
										
											2022-09-07 07:47:06 +03:00
+								- [ ] eth_subscribe logs (https://geth.ethereum.org/docs/rpc/pubsub)
 								- [ ] make private transactions opt in (its already in the database, but not our code)
-												order most of the todos

											
										
										
											2022-09-12 17:31:57 +03:00
+								- [ ] write a function for receipts that tries balanced_rpcs and only if they all error should it try private relays
-												todo cleanup

											
										
										
											2022-09-07 07:47:06 +03:00
+								  - [ ] automatic retries with a timeout or until all the servers have been tried.
 								    - i had the websocket die on me in the middle of a long test. only one in-flight request failed because of it. the rest delayed. figure out how to catch these ones since websocket fails sadly seem common
-												better logs

											
										
										
											2022-07-26 01:36:02 +03:00
+								- [ ] nice output when cargo doc is run
-												simple page instead of websocket error

											
										
										
											2022-08-11 03:16:13 +03:00
+								- [ ] cache more things locally or in redis
-												todos

											
										
										
											2022-06-25 05:45:50 +03:00
+								- [ ] stats when forks are resolved (and what chain they were on?)
-												rearrange todos

											
										
										
											2022-07-21 05:57:14 +03:00
+								- [ ] failsafe. if no blocks or transactions in some time, warn and reset the connection
-												order most of the todos

											
										
										
											2022-09-12 17:31:57 +03:00
+								- [ ] emit stats for user's successes, retries, failures, with the types of requests, chain, rpc
-												rearrange todos

											
										
										
											2022-07-21 05:57:14 +03:00
+								- [ ] cli for creating and editing api keys
-												merge todo list from phone

											
										
										
											2022-07-21 06:30:39 +03:00
+								- [ ] Only subscribe to transactions when someone is listening and if the server has opted in to it
 								- [ ] When sending eth_sendRawTransaction, retry errors
 								- [ ] If we need an archive server and no servers in sync, exit immediately with an error instead of waiting 60 seconds
 								- [ ] 60 second timeout is too short. Maybe do that for free tier and larger timeout for paid. Problem is that some queries can take over 1000 seconds
-												order most of the todos

											
										
										
											2022-09-12 17:31:57 +03:00
+								- [ ] when handling errors from axum parsing the Json...Enum, the errors don't get wrapped in json. i think we need a axum::Layer
-												move no unwrap todo to v1

											
										
										
											2022-08-20 00:09:03 +03:00
+								- [ ] don't "unwrap" anywhere. give proper errors
-												todo cleanup

											
										
										
											2022-09-07 07:47:06 +03:00
+								- [ ] handle log subscriptions
-												order most of the todos

											
										
										
											2022-09-12 17:31:57 +03:00
+								  - probably as a paid feature
-												reconnect sooner

											
										
										
											2022-09-12 17:33:19 +03:00
+								- [ ] exponential backoff when reconnecting a connection
-												clean up todos

											
										
										
											2022-06-21 04:02:49 +03:00
-												order most of the todos

											
										
										
											2022-09-12 17:31:57 +03:00
+								new endpoints for users (not totally sure about the exact paths, but these features are all needed):
-												polish todo list

											
										
										
											2022-08-16 08:13:19 +03:00
+								- [x] GET /u/:api_key
 								  - proxies to web3 websocket
 								- [x] POST /u/:api_key
 								  - proxies to web3
-												just do one app for now

											
										
										
											2022-07-14 00:49:57 +03:00
+								- [ ] GET /user/login/$address
 								  - returns a JSON string for the user to sign
 								- [ ] POST /user/login/$address
 								  - returns a JSON string including the api key
 								  - sets session cookie
 								- [ ] GET /user/$address
 								  - checks for api key in session cookie or header
 								  - returns a JSON string including user stats
-												more todo

this should probably all be moved to the google doc

											
										
										
											2022-07-14 00:57:50 +03:00
+								    - balance in USD
 								    - deposits history (currency, amounts, transaction id)
 								    - number of requests used (so we can calculate average spending over a month, burn rate for a user etc, something like "Your balance will be depleted in xx days)
 								    - the email address of a user if he opted in to get contacted via email
 								    - all the success/retry/fail counts and latencies (but that may better come from somewhere else)
-												just do one app for now

											
										
										
											2022-07-14 00:49:57 +03:00
+								- [ ] POST /user/$address
 								  - opt-in link email address
 								  - checks for api key in session cookie or header
 								  - allows modifying user settings
-												clean up todos

											
										
										
											2022-06-21 04:02:49 +03:00
+								## V2
-												dry errors so that rate limits dont log so much

											
										
										
											2022-09-10 03:12:14 +03:00
+								These are not ordered. I think some rows also accidently got deleted here. Check git history.
-												todo cleanup

											
										
										
											2022-09-07 07:47:06 +03:00
-												order most of the todos

											
										
										
											2022-09-12 17:31:57 +03:00
+								- [ ] handle user payments
 								  - [ ] separate daemon (or users themselves) call POST /users/process_transaction
 								    - checks a transaction to see if it modifies a user's balance. records results in a sql database
 								    - we will have our own event subscriber watching for "deposit" events, but sometimes events get missed and users might incorrectly "transfer" the tokens directly to an address instead of using the dapp
 								- [ ] refactor so configs can change while running
 								  - this will probably be a rather large change, but is necessary when we have autoscaling
 								  - create the app without applying any config to it
 								  - have a blocking future watching the config file and calling app.apply_config() on first load and on change
 								  - work started on this in the "config_reloads" branch. because of how we pass channels around during spawn, this requires a larger refactor.
 								- [ ] if a rpc fails to connect at start, retry later instead of skipping it forever (need config hot reloads first)
-												first pass at a schema

											
										
										
											2022-07-26 03:38:00 +03:00
+								- [ ] jwt auth so people can easily switch from infura
-												todos

											
										
										
											2022-07-19 10:01:55 +03:00
+								- [ ] automated soft limit
 								  - look at average request time for getBlock? i'm not sure how good a proxy that will be for serving eth_call, but its a start
-												merge todo list from phone

											
										
										
											2022-07-21 06:30:39 +03:00
+								  - https://crates.io/crates/histogram-sampler
-												error if future block is requested

											
										
										
											2022-07-21 02:49:29 +03:00
+								- [ ] interval for http subscriptions should be based on block time. load from config is easy, but better to query. currently hard coded to 13 seconds
-												rearrange todos

											
										
										
											2022-07-21 05:57:14 +03:00
 								in another repo: event subscriber
 								  - [ ] watch for transfer events to our contract and submit them to /payment/$tx_hash
 								  - [ ] cli tool that support can run to manually check and submit a transaction
-												clean up todos

											
										
										
											2022-06-21 04:02:49 +03:00
 								## "Maybe some day" and other Miscellaneous Things
-												order most of the todos

											
										
										
											2022-09-12 17:31:57 +03:00
+								- [ ] tool to revoke bearer tokens that also clears redis
 								- [ ] eth_getBlockByNumber and similar calls served from the block map
 								  - will need all Block<TxHash> **and** Block<TransactionReceipt> in caches or fetched efficiently
 								  - so maybe we don't want this. we can just use the general request cache for these. they will only require 1 request and it means requests won't get in the way as much on writes as new blocks arrive.
 								  - after looking at my request logs, i think its worth doing this. no point hitting the backends with requests for blocks multiple times. will also help with cache hit rates since we can keep recent blocks in a separate cache
 								- [ ] Public bsc server got “0” for block data limit (ninicoin)
 								- [ ] cli tool for resetting api keys
 								- [ ] cli tool for checking config
 								- [ ] benchmarks of the different Cache implementations (futures vs dash)
 								- [ ] Advanced load testing scripts so we can find optimal cost servers
 								  - [ ] benchmarks from https://github.com/llamafolio/llamafolio-api/
 								  - [ ] benchmarks from ethspam and versus
 								  - [ ] benchmarks from other things
 								  - [ ] quick script that calls all the curve-api endpoints once and checks for success, then calls wrk to hammer it
 								    - [ ] https://github.com/curvefi/curve-api
 								    - [ ] test /api/getGaugesmethod
 								        - usually times out after vercel's 60 second timeout
 								        - one time got: Error invalid Json response ""
 								- [ ] send logs to sentry
 								- [ ] i think all the async methods in ethers need tracing instrument. something like `cfgif(tracing, tracing::instrument)`
 								  - if they do that, i think my request_id will show up on their logs
 								- [ ] page that prints a graphviz dotfile of the blockchain
-												todo cleanup

											
										
										
											2022-09-07 07:47:06 +03:00
+								- [ ] search for all the "TODO" and `todo!(...)` items in the code and move them here
-												clean up todos

											
										
										
											2022-06-21 04:02:49 +03:00
+								- [ ] instead of giving a rate limit error code, delay the connection's response at the start. reject if incoming requests is super high?
-												set overall max inside the lock

											
										
										
											2022-05-06 23:44:12 +03:00
+								- [ ] add the backend server to the header?
-												rearrange todos

											
										
										
											2022-07-21 05:57:14 +03:00
+								- [ ] have a low-latency option that always tries at least two servers in parallel and then returns the first success?
 								  - this doubles our request load though. maybe only if the first one doesn't respond very quickly?
-												move todos

											
										
										
											2022-05-13 09:54:47 +03:00
+								- [ ] zero downtime deploys
-												rearrange todos

											
										
										
											2022-07-21 05:57:14 +03:00
+								- [ ] graceful shutdown. stop taking new requests and don't stop until all outstanding queries are handled
-												merge todo list from phone

											
										
										
											2022-07-21 06:30:39 +03:00
+								  - https://github.com/tokio-rs/mini-redis/blob/master/src/shutdown.rs
-												move todos

											
										
										
											2022-05-13 09:54:47 +03:00
+								- [ ] are we using Acquire/Release/AcqRel properly? or do we need other modes?
-												clean up todos

											
										
										
											2022-06-21 04:02:49 +03:00
+								- [ ] use https://github.com/ledgerwatch/interfaces to talk to erigon directly instead of through erigon's rpcdaemon (possible example code which uses ledgerwatch/interfaces: https://github.com/akula-bft/akula/tree/master)
-												retries

											
										
										
											2022-07-02 04:20:28 +03:00
+								- [ ] subscribe to pending transactions and build an intelligent gas estimator
-												add is_archive_needed and a bunch of rpc commands

											
										
										
											2022-07-09 05:23:26 +03:00
+								- [ ] flashbots specific methods
 								  - [ ] flashbots protect fast mode or not? probably fast matches most user's needs, but no reverts is nice.
 								  - [ ] https://docs.flashbots.net/flashbots-auction/searchers/advanced/rpc-endpoint#authentication maybe have per-user keys. or pass their header on if its set
-												todos

											
										
										
											2022-07-10 21:06:20 +03:00
+								- [ ] if no redis set, but public rate limits are set, exit with an error
-												rearrange todos

											
										
										
											2022-07-21 05:57:14 +03:00
+								- [ ] i saw "WebSocket connection closed unexpectedly" but no log about reconnecting
 								  - need better logs on this because afaict it did reconnect
-												better archive split

											
										
										
											2022-07-16 07:13:02 +03:00
+								- [ ] if archive servers are added to the rotation while they are still syncing, they might get requests too soon. keep archive servers out of the configs until they are done syncing. full nodes should be fine to add to the configs even while syncing, though its a wasted connection
-												rearrange todos

											
										
										
											2022-07-21 05:57:14 +03:00
+								- [ ] better document load tests: docker run --rm --name spam shazow/ethspam --rpc http://$LOCAL_IP:8544 | versus --concurrency=100 --stop-after=10000 http://$LOCAL_IP:8544; docker stop spam
-												todos

											
										
										
											2022-07-19 10:01:55 +03:00
+								- [ ] if the call is something simple like "symbol" or "decimals", cache that too. though i think this could bite us.
-												error if future block is requested

											
										
										
											2022-07-21 02:49:29 +03:00
+								- [ ] add a subscription that returns the head block number and hash but nothing else
 								- [ ] if chain split detected, what should we do? don't send transactions?
-												rearrange todos

											
										
										
											2022-07-21 05:57:14 +03:00
+								- [ ] archive check works well for local servers, but public nodes (especially on other chains) seem to give unreliable results. likely because of load balancers. maybe have a "max block data limit"
-												merge todo list from phone

											
										
										
											2022-07-21 06:30:39 +03:00
+								- [ ] https://docs.rs/derive_builder/latest/derive_builder/
 								- [ ] Detect orphaned transactions
 								- [ ] https://crates.io/crates/reqwest-middleware easy retry with exponential back off
 								  - Though I think we want retries that go to other backends instead
 								- [ ] Some of the pub things should probably be "pub(crate)"
 								- [ ] Maybe storing pending txs on receipt in a dashmap is wrong. We want to store in a timer_heap (or similar) when we actually send. This way there's no lock contention until the race is over.
 								- [ ] Support "safe" block height. It's planned for eth2 but we can kind of do it now but just doing head block num-3
 								- [ ] Archive check on BSC gave “archive” when it isn’t. and FTM gave 90k for all servers even though they should be archive
-												improve caching

											
										
										
											2022-07-22 22:30:39 +03:00
+								- [ ] cache eth_getLogs in a database?
 								- [ ] stats for "read amplification". how many backend requests do we send compared to frontend requests we received?
-												shutdown signal

											
										
										
											2022-07-23 02:26:04 +03:00
+								- [ ] fully test retrying when "header not found"
 								  - i saw "header not found" on a simple eth_getCode query to a public load balanced bsc archive node on block 1
-												test more

											
										
										
											2022-07-23 03:19:13 +03:00
+								- [ ] weird flapping fork could have more useful logs. like, howd we get to 1/1/4 and fork. geth changed its mind 3 times?
-												and yet more todo

											
										
										
											2022-08-06 09:57:29 +03:00
+								  - should we change our code to follow the same consensus rules as geth? our first seen still seems like a reasonable choice
 								  -  other chains might change all sorts of things about their fork choice rules
 -07-22T23:52:18.593956Z  WARN block_receiver: web3_proxy::connections: chain is forked! 1 possible heads. 1/1/4 rpcs have 0xa906…5bc1 rpc=Web3Connection { url: "ws://127.0.0.1:8546", data: 64, .. } new_block_num=15195517
 -07-22T23:52:18.983441Z  WARN block_receiver: web3_proxy::connections: chain is forked! 1 possible heads. 1/1/4 rpcs have 0x70e8…48e0 rpc=Web3Connection { url: "ws://127.0.0.1:8546", data: 64, .. } new_block_num=15195517
 -07-22T23:52:19.350720Z  WARN block_receiver: web3_proxy::connections: chain is forked! 2 possible heads. 1/2/4 rpcs have 0x70e8…48e0 rpc=Web3Connection { url: "ws://127.0.0.1:8549", data: "archive", .. } new_block_num=15195517
 -07-22T23:52:26.041140Z  WARN block_receiver: web3_proxy::connections: chain is forked! 2 possible heads. 2/4/4 rpcs have 0x70e8…48e0 rpc=Web3Connection { url: "http://127.0.0.1:8549", data: "archive", .. } new_block_num=15195517
-												thresholds and fork detection

											
										
										
											2022-07-25 03:27:00 +03:00
+								  - [ ] threshold should check actual available request limits (if any) instead of just the soft limit
-												shorter function names

											
										
										
											2022-08-04 01:23:10 +03:00
+								- [ ] foreign key on_update and on_delete
-												use uuid earlier

											
										
										
											2022-08-06 04:17:25 +03:00
+								- [ ] database creation timestamps
-												more todos

											
										
										
											2022-08-06 05:29:55 +03:00
+								- [ ] better error handling. we warn too often for validation errors and use the same error code for most every request
 								- [ ] use &str more instead of String. lifetime annotations get really annoying though
 								- [ ] tarpit instead of reject requests (unless theres a lot)
-												dash consistency

											
										
										
											2022-08-06 08:46:33 +03:00
+								- [ ] tune database connection pool size. i think a single web3_proxy currently maxes out our server
-												make it work

											
										
										
											2022-08-06 08:26:43 +03:00
+								- [ ] subscribing to transactions should be configurable per server. listening to paid servers can get expensive
-												another todo

											
										
										
											2022-08-06 08:33:32 +03:00
+								- [ ] archive servers should be lowest priority
-												more todo

											
										
										
											2022-08-06 09:20:29 +03:00
+								- [ ] docker build context is really big. we must be including target or something
-												even more todo

											
										
										
											2022-08-06 09:23:38 +03:00
+								- [ ] ip detection needs work so that everything doesnt show up as 172.x.x.x
-												and yet more todo

											
										
										
											2022-08-06 09:57:29 +03:00
+								- [ ] status page leaks our urls which contain secrets. change that to use names
 								- [ ] PR to add this to sea orm prelude:
 								  ```
 								  #[cfg(feature = "with-uuid")]
 								  pub use uuid::Builder as UuidBuilder;
 								  ```
-												todos

											
										
										
											2022-08-07 23:49:46 +03:00
+								- [ ] get to /, when not serving a websocket, should have a simple welcome page. maybe with a button to update your wallet.
 								- [ ] rate limit thoughts:
 								  - if someone subscribes to all pending transactions, how should that count against rate limits
 								  - when those rate limits are hit, what should happen?
 								  - missing pending transactions might be okay, but not missing confirmed blocks
-												did this earlier

											
										
										
											2022-08-10 08:23:32 +03:00
+								- [ ] for easier errors in the axum code, i think we need to have our own type that wraps anyhow::Result+Error
-												cleanup

											
										
										
											2022-08-15 20:23:13 +03:00
+								- [ ] fix ip detection when running in dev
 								- [ ] double check weight sorting code
-												polish todo list

											
										
										
											2022-08-16 08:13:19 +03:00
+								- [ ] sea-orm brings in async-std, but we are using tokio. benchmark switching
-												part of the command got deleted

											
										
										
											2022-08-16 20:14:47 +03:00
+								- [ ] this query always times out, but erigon can serve it quickly: `curl -X POST -H "Content-Type: application/json" --data '{"jsonrpc":"2.0","method":"debug_traceBlockByNumber","params":["latest"],"id":1}' 127.0.0.1:8544' 127.0.0.1:8544`
 								  {"jsonrpc":"2.0","id":null,"error":{"code":-32099,"message":"deadline has elapsed"}}
-												move todo

											
										
										
											2022-08-27 03:19:49 +03:00
+								  - [ ] figure out rate limits for private rpcs. eden v1 gives 500 error instead of a code for rate limits
-												user_address change not made yet

											
										
										
											2022-09-05 09:29:27 +03:00
+								- [ ] https://gitlab.com/moka-labs/tiered-cache-example
-												todo cleanup

											
										
										
											2022-09-07 07:47:06 +03:00
+								- [ ] web3connection3.block(...) might wait forever. be sure to do it safely
 								- [ ] search for all "todo!"
 								- [ ] replace all `.context("no servers in sync")` with proper error type
 								- [ ] when using a bunch of slow public servers, i see "no servers in sync" even when things should be right
 								  - [ ] i think checking the parents of the heaviest chain works most of the time, but not always
 								  - maybe iterate connection heads by total weight? i still think we need to include parent hashes
 								- [ ] i see "No block found" sometimes for a single server's block. Not sure why since reads should happen after writes
 								- [ ] whats going on here? why is it rolling back? maybe total_difficulty was a LOT higher?
 								  - 2022-09-05T19:21:39.763630Z  WARN web3_proxy::rpcs::blockchain: chain rolled back 1/6/7 head=15479604 (0xf809…6a2c) rpc=infura_free
 								  - i wish i had more logs. its possible that 15479605 came immediatly after
-												retrying reconnect

											
										
										
											2022-09-14 04:43:09 +03:00
+								- [ ] ip blocking logs a warn. we don't need that. a stat at most
 								- [ ] keep it working without redis and a database