web3-proxy/TODO.md

# Todo

## MVP

These are roughly in order of completition

- [x] simple proxy
- [x] better locking. when lots of requests come in, we seem to be in the way of block updates
- [x] load balance between multiple RPC servers
- [x] support more than just ETH
- [x] option to disable private rpc and send everything to primary
- [x] support websocket clients
  - we support websockets for the backends already, but we need them for the frontend too
- [x] health check nodes by block height
- [x] Dockerfile
- [x] docker-compose.yml
- [x] after connecting to a server, check that it gives the expected chainId
- [x] the ethermine rpc is usually fastest. but its in the private tier. since we only allow synced rpcs, we are going to not have an rpc a lot of the time
- [x] if not backends. return a 502 instead of delaying?
- [x] move from warp to axum
- [x] handle websocket disconnect and reconnect
- [x] eth_sendRawTransaction should return the most common result, not the first
- [x] use redis and redis-cell for rate limits
- [x] it works for a few seconds and then gets stuck on something.
  - [x] its working with one backend node, but multiple breaks. something to do with pending transactions
  - [x] dashmap entry api is easy to deadlock! be careful with it!
- [x] the web3proxyapp object gets cloned for every call. why do we need any arcs inside that? shouldn't they be able to connect to the app's? can we just use static lifetimes
- [x] refactor Connection::spawn. have it return a handle to the spawned future of it running with block and transaction subscriptions
- [x] refactor Connections::spawn. have it return a handle that is selecting on those handles?
- [x] some production configs are occassionally stuck waiting at 100% cpu
  - they stop processing new blocks. i'm guessing 2 blocks arrive at the same time, but i thought our locks would handle that
  - even after removing a bunch of the locks, the deadlock still happens. i can't reliably reproduce. i just let it run for awhile and it happens.
  - running gdb shows the thread at tokio tungstenite thread is spinning near 100% cpu and none of the rest of the program is proceeding
  - fixed by https://github.com/gakonst/ethers-rs/pull/1287
- [x] when sending with private relays, brownie's tx.wait can think the transaction was dropped. smarter retry on eth_getTransactionByHash and eth_getTransactionReceipt (maybe only if we sent the transaction ourselves)
- [x] if web3 proxy gets an http error back, retry another node
- [x] endpoint for health checks. if no synced servers, give a 502 error
- [x] rpc errors propagate too far. one subscription failing ends the app. isolate the providers more (might already be fixed)
- [x] incoming rate limiting (by ip)
- [x] connection pool for redis
- [x] automatically route to archive server when necessary
  - originally, no processing was done to params; they were just serde_json::RawValue. this is probably fastest, but we need to look for "latest" and count elements, so we have to use serde_json::Value
  - when getting the next server, filtering on "archive" isn't going to work well. need to check inner instead
- [x] if the requested block is ahead of the best block, return without querying any backend servers
- [x] http servers should check block at the very start
- [x] subscription id should be per connection, not global
- [x] when under load, i'm seeing "http interval lagging!". sometimes it happens when not loaded.
  - we were skipping our delay interval when block hash wasn't changed. so if a block was ever slow, the http provider would get the same hash twice and then would try eth_getBlockByNumber a ton of times
- [x] inspect any jsonrpc errors. if its something like "header not found" or "block with id $x not found" retry on another node (and add a negative score to that server)
  - this error seems to happen when we use load balanced backend rpcs like pokt and ankr
- [x] RESPONSE_CACHE_CAP in bytes instead of number of entries
- [x] if we don't cache errors, then in-flight request caching is going to bottleneck 
  - i think now that we retry header not found and similar, caching errors should be fine
- [x] RESPONSE_CACHE_CAP from config
- [x] web3_sha3 rpc command
- [x] test that launches anvil and connects the proxy to it and does some basic queries
  - [x] need to have some sort of shutdown signaling. doesn't need to be graceful at this point, but should be eventually
- [x] if the fastest server has hit rate limits, we won't be able to serve any traffic until another server is synced.
  - thundering herd problem if we only allow a lag of 0 blocks
  - we can improve this by only publishing the synced connections once a threshold of total available soft and hard limits is passed. how can we do this without hammering redis? at least its only once per block per server
  - [x] instead of tracking `pending_synced_connections`, have a mapping of where all connections are individually. then each change, re-check for consensus.
- [x] synced connections swap threshold set to 1 so that it always serves something
- [x] cli tool for creating new users
- [x] incoming rate limiting by api key
- [x] sort forked blocks by total difficulty like geth does
- [x] refactor result type on active handlers to use a cleaner success/error so we can use the try operator
- [x] give users different rate limits looked up from the database 
- [x] Add a "weight" key to the servers. Sort on that after block. keep most requests local
- [x] cache db query results for user data. db is a big bottleneck right now
- [x] allow blocking public requests
- [x] Got warning: "WARN subscribe_new_heads:send_block: web3_proxy::connection: unable to get block from https://rpc.ethermine.org: Deserialization Error: expected value at line 1 column 1. Response: error code: 1015". this is cloudflare rate limiting on fetching a block, but this is a private rpc. why is there a block subscription?
- [x] im seeing ethspam occasionally try to query a future block. something must be setting the head block too early
  - [x] we were sorting best block the wrong direction. i flipped a.cmp(b) to b.cmp(a) so that the largest would be first, but then i used 'max_by' which looks at the end of the list
- [x] HTTP GET to the websocket endpoints should redirect instead of giving an ugly error
- [x] load the redirected page from config
- [x] prettier output for create_user command. need the key in hex
- [x] drop redis-cell in favor of a simpler (and faster) implementation. 
  - redis-cell was giving me weird errors and it isn't worth debugging it right now.
- [x] create user script should allow setting the api key
- [x] disable redis persistence in dev
- [x] attach a request id to every web request
- [x] attach user id (not IP!) to each request
- [x] fantom_1    | 2022-08-10T22:19:43.522465Z  WARN web3_proxy::jsonrpc: forwarding error err=missing field `jsonrpc` at line 1 column 60
  - [x] i think the server isn't following the spec. we need a context attached to more errors so we know which one
  - [x] make jsonrpc default to "2.0" (including the custom deserializer that handles the RawValues)
- [x] if the eth_call (or similar) params include a block, we can cache for that
- [x] when block subscribers receive blocks, store them in a block_map
- [x] eth_blockNumber without a backend request
- [x] if we send a transaction to private rpcs and then people query it on public rpcs things, some interfaces might think the transaction is dropped (i saw this happen in a brownie script of mine). how should we handle this?
  - [x] send getTransaction rpc requests to the private rpc tier
- [x] I'm hitting infura rate limits very quickly. I feel like that means something is very inefficient
  - whenever blocks were slow, we started checking as fast as possible
- [x] create user script should allow setting requests per minute
- [x] cache api keys that are not in the database
- [x] improve consensus block selection. Our goal is to find the highest work chain with a block over a minimum threshold of sum_soft_limit.
  - [x] i saw a fork of like 300 blocks. probably just because a node was restarted and had fallen behind. need some checks to ignore things that are far behind. this improvement should fix this problem
  - [x] A new block arrives at a connection.
  - [x] It checks that it isn't the same that it already has (which is a problem with polling nodes)
  - [x] If its new to this node...
    - [x] if the block does not have total work, check our cache. otherwise, query the node
    - [x] save the block num and hash so that http polling doesn't send duplicates
    - [x] send the deduped block through a channel to be handled by the connections grouping.
  - [x] The connections group...
    - [x] input = rpc, new_block
    - [x] adds the block and rpc to it's internal maps
      - [x] connection_heads: HashMap<rpc_name, blockhash>
      - [x] block_map: DashMap<blockhash, Arc<Block>>
      - [x] block_num: DashMap<U64, H256>
      - [x] blockchain: DiGraphMap<blockhash, ?>
    - [x] iterate the rpc_map to find the highest_work_block
    - [x] update synced connections
    - [x] send the block through new head_block_sender
  - [x] rewrite cannonical_block to work as long as there are no forks
  - [x] rewrite cannonical_block (again) and related functions to handle forks
    - [x] got a very large number of possible heads here. i think maybe a server was very far out of sync. we should drop servers behind by too much
    eth_1       | 2022-08-10T23:26:06.377129Z  WARN web3_proxy::connections: chain is forked! 261 possible heads. 1/2/5/5 rpcs have 0xd403…3c5d
    eth_1       | 2022-08-10T23:26:08.917603Z  WARN web3_proxy::connections: chain is forked! 262 possible heads. 1/2/5/5 rpcs have 0x0538…bfff
    eth_1       | 2022-08-10T23:26:10.195014Z  WARN web3_proxy::connections: chain is forked! 262 possible heads. 1/2/5/5 rpcs have 0x0538…bfff
    eth_1       | 2022-08-10T23:26:10.195658Z  WARN web3_proxy::connections: chain is forked! 262 possible heads. 2/3/5/5 rpcs have 0x0538…bfff
    - [x] todo!("handle equal") and also less and greater
    - [x] "chain is forked" message is wrong. it includes nodes just being on different heights of the same chain. need a smarter check
      - i think there is also a bug because i've seen "server not synced" a couple times
- [x] bug around eth_getBlockByHash sometimes causes tokio to lock up
  - i keep a mapping of blocks so that i can go from hash -> block. it has some consistent hashing it does to split them up across multiple maps each with their own lock. so a lot of the time reads dont block writes because they are in different internal maps. this was fine. but after changing my fork detection logic to use the same rules as erigon, i discovered that when you get blocks from a websocket subscription in erigon and geth, theres a missing field (https://github.com/ledgerwatch/erigon/issues/5190). so i added a query to get the block that includes the missing field.
  - but i did this in a way where i was holding the write lock open while doing the query. the "new" block that has the missing field ends up in the same bucket and it also wants a write lock. oops. entry api has very sharp edges. don't ever await inside a match on DashMap::entry
- [x] requests for "Get transactions receipts" are routed to the private_rpcs and not the balanced_rpcs. do this better.
  - [x] quick fix, send to balanced_rpcs for now. we will just live with errors on new transactions.
  - this was intentional so that recently confirmed transactions go to a server that is more likely to have the tx.
  - but under heavy load, we hit their rate limits. need a "retry_until_success" function that goes to balanced_rpcs. or maybe store in redis the txids that we broadcast privately and use that to route.
- [x] some of the DashMaps grow unbounded! Make/find a "SizedDashMap" that cleans up old rows with some garbage collection task
  - moka is exactly what we need
- [x] if block data limit is 0, say Unknown in Debug output
- [x] basic request method stats (using the user_id and other fields that are in the tracing frame)
- [x] refactor from_anyhow_error to have consistent error codes and http codes. maybe implement the Error trait
- [x] improve rpc weights. i think theres still a potential thundering herd
- [x] improved logging with useful instrumentation
- [x] right now the block_map is unbounded. move this to redis and do some calculations to be sure about RAM usage
- [x] synced connections swap threshold should come from config
- [x] right now we send too many getTransaction queries to the private rpc tier and i are being rate limited by some of them. change to be serial and weight by hard/soft limit.  
- [x] ip blocking gives a 500 and not the proper error code
- [x] need a reconnect that doesn't unwrap
- [x] need a retrying_reconnect that is used everywhere reconnect is. have exponential backoff here
- [x] it looks like our reconnect logic is not always firing. we need to make reconnect more robust!
  - i am pretty sure that this is actually servers that fail to connect on initial setup (maybe the rpcs that are on the wrong chain are just timing out and they aren't set to reconnect?)
- [x] chain rolled back 1/1/1 con_head=15510065 (0xa4a3…d2d8) rpc_head=15510065 (0xa4a3…d2d8) rpc=local_erigon_archive
  - include the old head number and block in the log
- [x] exponential backoff when reconnecting a connection
- [x] once the merge happens, we don't want to use total difficulty and instead just care about the number
- [x] rewrite rate limiting to have a tiered cache. do not put redis in the hot path
  - instead, we should check a local cache for the current rate limit (+1) and spawn an update to the local cache from redis in the background.
  - [x] when there are a LOT of concurrent requests, we see errors. i thought that was a problem with redis cell, but it happens with my simpler rate limit. now i think the problem is actually with bb8
  - https://docs.rs/redis/latest/redis/aio/struct.ConnectionManager.html or https://crates.io/crates/deadpool-redis?
  - WARN http_request: redis_rate_limit::errors: redis error err=Response was of incompatible type: "Response type not string compatible." (response was int(500237)) id=01GC6514JWN5PS1NCWJCGJTC94 method=POST
- [x] web3_proxy_error_count{path = "backend_rpc/request"} is inflated by a bunch of reverts. do not log reverts as warn. 
  - erigon gives `method=eth_call reqid=986147 t=1.151551ms err="execution reverted"`
- [x] database migration to change user_keys.requests_per_minute to bigunsigned (max of 18446744073709551615)
- [x] change user creation script to have a "unlimited requests per minute" flag that sets it to u64::MAX (18446744073709551615)
- [x] in /status, block hashes has a lower count than block numbers. how is that possible?
  - we weren't calling sync. now we are
- [x] opt-in debug mode that inspects responses for reverts and saves the request to the database for the user.
- [x] Api keys need option to lock to IP, cors header, referer, user agent, etc
- [x] /user/logout to clear bearer token and jwt
- [x] bearer tokens should expire
- [x] login endpoint needs its own rate limiter
  - we don't want an rpc request limit of 0 to block logins
  - for security, we want these limits low.
- [x] user login should return the bearer token and the user keys
- [x] use siwe messages and signatures for sign up and login
- [x] check for bearer token on /rpc
- [x] ip blocking logs a warn. we don't need that
- [x] Ulid instead of Uuid for user keys
  - <https://discord.com/channels/873880840487206962/900758376164757555/1012942974608474142>
  - since users are actively using our service, we will need to support both
- [x] get to /, when not serving a websocket, should have a simple welcome page. maybe with a button to update your wallet. 
- [x] instead of giving a rate limit error code, delay the connection's response at the start. reject if incoming requests is super high?
  - [x] did this by checking a key/ip-specific semaphore before checking rate limits
- [x] emit user stat on cache hit
- [x] emit user stat on cache miss
- [x] have migration use tokio instead of async-std
- [x] user create script should allow a description field
- [x] change stats to using the database
- [x] emit user stat on retry
- [x] improve `web3_proxy_cli check_config`
  - print out warnings if important settings are missing
- [x] if unknown config items, error
  - unknown configs are almost always a mistake. usually from me changing config parsing on my side and old fields not being updated to the new way
  - [x] also need to change how we disable rpcs since i was using an unknown field
- [x] [paginate responses](https://www.sea-ql.org/SeaORM/docs/basic-crud/select/#paginate-result)
- [x] graceful shutdown. stop taking new requests and don't stop until all outstanding queries are handled
  - https://github.com/tokio-rs/mini-redis/blob/master/src/shutdown.rs
  - we need this because we need to be sure all the queries are saved in the db. maybe put stuff in Drop
  - need an flume::watch on unflushed stats that we can subscribe to. wait for it to flip to true
- [x] don't use unix timestamps for response_millis since leap seconds will confuse it
- [x] config to allow origins even on the anonymous endpoints
- [x] send logs to sentry
- [x] login should return the user id
- [x] when we show keys, also show the key's id
- [x] add config for concurrent requests from public requests
- [x] new endpoints for users (not totally sure about the exact paths, but these features are all needed):
  - [x] sign in
    - [x] login should include the key id, not just the key ULID
  - [x] sign out
  - [x] GET profile endpoint
  - [x] POST profile endpoint
  - [x] GET stats endpoint
    - [x] display distribution of methods per api key (eth_call, eth_getLogs, etc.) (only with authentication!)
  - [x] get aggregate stats endpoint
    - [x] display requests per second per api key (only with authentication!)
  - [x] POST key endpoint
    - [x] generate a new key from a web endpoint
    - [x] modifying key settings such as private relay, revert logging, ip/origin/etc checks
  - [x] GET logged reverts on an endpoint that **requires authentication**.
- [x] endpoint to list keys without having to sign a message to log in again
- [x] rename user_key to rpc_key
  - [x] in code
  - [x] in database with a migration
- [x] instead of requests_per_minute on every key, have a "user_tier" that gets joined
- [x] document url params with examples
- [x] improve "docs/http routes.txt"
- [x] remove request per minute and concurrency limits from the keys. those are on the user tiers now.
- [x] revertLogs db table should have rpc_key_id on it
- [x] the relation in Relation is wrong now. it is called user_key_id,  but point to the rpc key table
- [x] instruments are missing. maybe that is why sentry had broken traces
- [x] description should default to an empty string instead of being nullable
- [x] include if archive query or not in the stats
- [x] fix test not shutting down
- [x] proper authentication on rpc_key_id
  - we have bearer token auth for user_id, but rpc_key_id needs more code
- [x] use rpc_key_id instead of user_id in the redirect
- [x] /status should include the server weights
- [x] improve rate limiting anon ips
- [x] nullable rpc_key_id on revert log
- [x] attach origin to revert_log
    - opt-in origin logging
- [x] test that runs check_config against example.toml
- [x] improve sorting servers by weight. don't force to lower weights, still have a probability that smaller weights might be 
- [x] flamegraphs show 52% of the time to be in tracing. replace with simpler logging
- [x] add optional display name to rpc configs
- [x] add locking around running migrations
- [x] cli tool for checking config
- [x] web3_proxy_cli command should read database settings from config
- [x] cli command to change user_tier by key
- [x] cache the status page for a second
- [x] request accounting for websockets
- [x] database merge scripts
- [x] test that sets up a Web3Rpc and asks "has_block" for old and new blocks
- [x] test that sets up Web3Rpcs with 2 nodes. one behind by several blocks. and see what the "next" server shows as
- [x] ethspam on bsc and polygon gives 1/4 errors. fix whatever is causing this
  - bugfix! we were using the whole connection list instead of just the synced connection list when picking servers. oops!
- [x] actually block unauthenticated requests instead of emitting warning of "allowing without auth during development!"
- [x] smarter reconnection logic
- [x] if a websocket connection hasn't received a new block in a while, do a reconnect or just query the block. its possible that the node was syncing when the proxy started
- [x] on web3-proxy start, if a node fails to connect, it can hold up listening on 8544
    - need to do all the connections in parallel with spawns
- [x] add block timestamp to the /status page
  - [x] be sure to save the timestamp in a way that our request routing logic can make use of it
- [x] node selection still needs improvements. we still send to syncing nodes if they are close
    - try consensus heads first! only if that is empty should we try others. and we should try them sorted by block height and then randomly chosen from there
- [x] logging of "bad response!" is way too verbose
- [x] i think our "best" server picking is incorrect somehow.
    - we upgraded erigon to a version with a broken websocket
    - that made it clear we still route to the lagged server sometimes. this is bad, but retries keep it from giving users bad data.
- [x] more trace logging
- [x] on ETH, we no longer need total difficulty
- [x] cli for creating and editing a user's first api key
- [x] benchmarks of the different Cache implementations (futures vs dash)
  - futures is better
- [x] if archive servers are added to the rotation while they are still syncing, they might get requests too soon. keep archive servers out of the configs until they are done syncing. full nodes should be fine to add to the configs even while syncing, though its a wasted connection
- [x] subscribing to transactions should be configurable per server. listening to paid servers can get expensive
- [x] status page leaks our urls which contain secrets. change that to use names
- [x] for easier errors in the axum code, i think we need to have our own type that wraps anyhow::Result+Error
- [x] hit counts seem wrong. how are we hitting the backend so much more than the frontend? retries on disconnect don't seem to fit that
  web3_proxy_hit_count{path = "app/proxy_web3_rpc_request"} 857270
  web3_proxy_hit_count{path = "backend_rpc/request"}       1396127
  - this was because backend server ordering was including servers that were still syncing from too long ago
- [x] keep it working without redis and a database
- [x] manually tune database and redis connection pool size


## V1

These are not yet ordered. There might be duplicates. We might not actually need all of these.

- [x] cache user stats in redis and with headers
- [x] optional read-only database connection
- [x] put display name into our prod configs
- [x] sometimes when fetching a txid through the proxy it fails, but fetching from the backends works fine
  - check flashprofits logs for examples
  - we were caching too aggressively
- [x] BUG! if sending transactions gets "INTERNAL_ERROR: existing tx with same hash", create a success message
  - we just want to be sure that the server has our tx and in this case, it does.
  - ERROR http_request:request:try_send_all_upstream_servers: web3_proxy::rpcs::request: bad response! err=JsonRpcClientError(JsonRpcError(JsonRpcError { code: -32000, message: "INTERNAL_ERROR: existing tx with same hash", data: None })) method=eth_sendRawTransaction rpc=local_erigon_alpha_archive id=01GF4HV03Y4ZNKQV8DW5NDQ5CG method=POST authorized_request=User(Some(SqlxMySqlPoolConnection), AuthorizedKey { ip: 10.11.12.15, origin: None, user_key_id: 4, log_revert_chance: 0.0000 }) self=Web3Rpcs { conns: {"local_erigon_alpha_archive_ws": Web3Rpc { name: "local_erigon_alpha_archive_ws", blocks: "all", .. }, "local_geth_ws": Web3Rpc { name: "local_geth_ws", blocks: 64, .. }, "local_erigon_alpha_archive": Web3Rpc { name: "local_erigon_alpha_archive", blocks: "all", .. }}, .. } authorized_request=Some(User(Some(SqlxMySqlPoolConnection), AuthorizedKey { ip: 10.11.12.15, origin: None, user_key_id: 4, log_revert_chance: 0.0000 })) request=JsonRpcRequest { id: RawValue(39), method: "eth_sendRawTransaction", .. } request_metadata=Some(RequestMetadata { datetime: 2022-10-11T22:14:57.406829095Z, period_seconds: 60, request_bytes: 633, backend_requests: 0, no_servers: 0, error_response: false, response_bytes: 0, response_millis: 0 }) block_needed=None
- [x] serde collect unknown fields in config instead of crash
- [x] upgrade user tier by address
- [x] all_backend_connections skips syncing servers
- [x] change weight back to tier
- [x] fix multiple origin and referer checks
- [x] ip detection needs work so that everything doesnt show up as 172.x.x.x
  - i think this was done, but am not positive.
- [x] if private txs are disabled, only send trasactions to some of our servers. we were DOSing ourselves with transactions and slowing down sync
- [x] retry if we get "the method X is not available"
- [x] remove weight. we don't use it anymore. tiers are what we use now
- [x] make deadlock feature optional
- [x] standalone healthcheck daemon (sentryd)
- [x] status page should show version
- [x] combine the proxy and cli into one bin
- [x] improve rate limiting on websockets
- [x] retry another server if we get a jsonrpc response error about rate limits
- [x] major refactor to only use backup servers when absolutely necessary
- [x] remove allowed lag
- [x] configurable gas buffer. default to the larger of 25k or 25% on polygon to work around erigon bug
- [x] public is 3900, but free is 360. free should be at least 3900 but probably more
- [x] add --max-wait to wait_for_sync
- [x] add automatic compare urls to wait_for_sync
- [x] send panics to pagerduty
- [x] enable lto on release builds
- [x] less logs for backup servers
- [x] use channels instead of arcswap
  - this will let us easily wait for a new head or a new synced connection
- [x] broadcast transactions to more servers
- [x] send sentryd errors to pagerduty
- [x] improve handling of unknown methods
- [x] don't send pagerduty alerts for websocket panics
- [x] improve waiting for sync when rate limited
- [x] improve pager duty errors for smarter deduping
- [x] add create_key cli command
- [x] short lived cache on /health
- [x] cache /status for longer
- [x] sort connections during eth_sendRawTransaction
- [x] block all admin_ rpc commands
- [x] remove the "metered" crate now that we save aggregate queries?
- [x] add archive depth to app config
- [x] use from_block and to_block so that eth_getLogs is routed correctly
- [x] improve eth_sendRawTransaction server selection
- [x] don't cache methods that are usually very large
- [x] use http provider when available
- [x] per-chain rpc rate limits
- [x] canonical block checks giving weird errors. change healthcheck to use block number
    [2023-02-21T02:58:06Z DEBUG web3_proxy::rpcs::request] error response from blastapi! method=eth_getCode params=(0xa9a8760b8333efae8c9c751e6695a11938ae4b90, 0x73a627f588338804e6dc880154728484f7e0373c29057408c6674d75bdc29d12) err=JsonRpcClientError(JsonRpcError(JsonRpcError { code: -32603, message: "hash 73a627f588338804e6dc880154728484f7e0373c29057408c6674d75bdc29d12 is not currently canonical", data: None }))
    [2023-02-21T02:58:06Z DEBUG web3_proxy::rpcs::one] blastapi failed health check query! Error {
            context: "ProviderError from the backend",
            source: JsonRpcClientError(
                JsonRpcError(
                    JsonRpcError {
                        code: -32603,
                        message: "hash 73a627f588338804e6dc880154728484f7e0373c29057408c6674d75bdc29d12 is not currently canonical",
                        data: None,
                    },
                ),
            ),
        }
- [x] add a "failover" tier that is only used if balanced_rpcs has "no servers synced"
  - use this tier (and private tier) to check timestamp on latest block. if we are behind that by more than a few seconds, something is wrong
- [x] cli flag to set prometheus port
- [x] eth_getLogs is going to unsynced nodes because it only checks start block and not the end block
- [x] have multiple providers on each backend rpc. one websocket for newHeads. and then http providers for handling requests
  - erigon only streams the JSON over HTTP. that code isn't enabled for websockets. so this should save memory on the erigon servers
  - i think this also means we don't need to worry about changing the id that the user gives us.
- [x] eth_getLogs is going to unsynced nodes because it only checks start block and not the end block
- [x] fix caching getLogs with blockhash
- [x] fix trying to send signed transactions to an empty list of private_rpcs
- [x] improve logging around consensus head.
  - it was "num in best synced tier"/num rpc connected/num rpc known.
  - it should be "num with best head in best synced tier/num with best head in any tier/num rpcs connected/num rpcs known
- [x] add /debug/:rpckey endpoint that logs requests and responses to kafka
- [x] refactor so configs can change while running
  - this will probably be a rather large change, but is necessary when we have autoscaling
  - create the app without applying any config to it
  - have a blocking future watching the config file and calling app.apply_config() on first load and on change
  - work started on this in the "config_reloads" branch. because of how we pass channels around during spawn, this requires a larger refactor.
- [-] if we subscribe to a server that is syncing, it gives us null block_data_limit. when it catches up, we don't ever send queries to it. we need to recheck block_data_limit
- [ ] don't use new_head_provider anywhere except new head subscription
- [x] remove the "metered" crate now that we save aggregate queries?
- [x] don't use systemtime. use chrono
- [x] graceful shutdown
  - [x] frontend needs to shut down first. this will stop serving requests on /health and so new requests should quickly stop being routed to us
  - [x] when frontend has finished, tell all the other tasks to stop
  - [x] stats buffer needs to flush to both the database and influxdb
- [x] `rpc_accounting` script
- [x] period_datetime should always round to the start of the minute. this will ensure aggregations use as few rows as possible
- [x] weighted random choice should still prioritize non-archive servers
    - maybe shuffle randomly and then sort by (block_limit, random_index)?
    - maybe sum available_requests grouped by archive/non-archive. only limit to non-archive if they have enough?
- [x] if we subscribe to a server that is syncing, it gives us null block_data_limit. when it catches up, we don't ever send queries to it. we need to recheck block_data_limit
- [x] add a "backup" tier that is only used if balanced_rpcs has "no servers synced"
  - use this tier to check timestamp on latest block. if we are behind that by more than a few seconds, something is wrong
- [x] `change_user_tier_by_address` script
- [x] emit stats for user's successes, retries, failures, with the types of requests, chain, rpc
- [x] add caching to speed up stat queries
- [x] config parsing is strict right now. this makes it hard to deploy on git push since configs need to change along with it
  - changed to only emit a warning if there is an unknown configuration key
- [x] make the "not synced" error more verbose
- [x] short lived cache on /health
- [x] cache /status for longer
- [x] sort connections during eth_sendRawTransaction
- [x] block all admin_ rpc commands
- [x] remove the "metered" crate now that we save aggregate queries?
- [x] add archive depth to app config
- [x] improve "archive_needed" boolean. change to "block_depth"
- [x] keep score of new_head timings for all rpcs
- [x] having the whole block in /status is very verbose. trim it down
- [-] proxy mode for benchmarking all backends
- [-] proxy mode for sending to multiple backends
- [-] let users choose a % of reverts to log (or maybe x/second). someone like curve logging all reverts will be a BIG database very quickly
  - this must be opt-in and spawned in the background since it will slow things down and will make their calls less private
  - [ ] automatic pruning of old revert logs once too many are collected
  - [ ] we currently default to 0.0 and don't expose a way to edit it. we have a database row, but we don't use it
- [-] add configurable size limits to all the Caches
  - instead of configuring each cache with MB sizes, have one value for total memory footprint and then percentages for each cache
  - https://github.com/moka-rs/moka/issues/201
- [ ] all anyhow::Results need to be replaced with FrontendErrorResponse. 
    - [ ] rename FrontendErrorResponse to Web3ProxyError
    - [ ] almost all the anyhows should be Web3ProxyError::BadRequest
    - as is, these errors are seen as 500 errors and so haproxy keeps retrying them
- change premium concurrency limit to be against ip+rpckey
  - then sites like curve.fi don't have to worry about their user count
  - it does mean we will have a harder time capacity planning from the number of keys
- [ ] have the healthcheck get the block over http. if it errors, or doesn't match what the websocket says, something is wrong (likely a deadlock in the websocket code)
- [ ] don't use new_head_provider anywhere except new head subscription
- [x] maybe we shouldn't route eth_getLogs to syncing nodes. serving queries slows down sync significantly
  - change the send_best function to only include servers that are at least close to fully synced
- [ ] enable mev protected transactions with either a /protect/ url (instead of /private/) or the database (when on /rpc/)
- [-] have private transactions be enabled by a url setting rather than a setting on the key
- [ ] eth_sendRawTransaction should only forward if the chain_id matches what we are running
- [ ] cli for adding rpc keys to an existing user
- [ ] rename "private" to "mev protected" to avoid confusion about private transactions being public once they are mined
- [ ] allow restricting an rpc key to specific chains
- [ ] writes to request_latency should be handled by a background task so they don't slow down the request
  - maybe we can use https://docs.rs/hdrhistogram/latest/hdrhistogram/sync/struct.SyncHistogram.html
- [ ] keep re-broadcasting transactions until they are confirmed
- [ ] if mev protection is disabled, we should send to *both* balanced_rpcs *and* private_rps
- [ ] if mev protection is enabled, we should sent to *only* private_rpcs
- [ ] rate limiting/throttling on query_user_stats 
- [ ] web3rpc configs should have a max_concurrent_requests
    - will probably want a tool for calculating a safe value for this. too low and we could kill our performance
- [ ] rename "concurrent" requests to "parallel" requests
- [ ] minimum allowed query_start on query_user_stats
- [ ] setting request limits to None is broken. it does maxu64 and then internal deferred rate limiter counts try to *99/100
- [ ] if kafka fails to connect at the start, automatically reconnect
- [ ] during shutdown, mark the proxy unhealthy and send unsubscribe responses for any open websocket subscriptions
- [ ] setting request limits to None is broken. it does maxu64 and then internal deferred rate limiter counts overflows when it does to `x*99/100`
- [ ] during shutdown, send unsubscribe responses for any open websocket subscriptions
- [ ] some chains still use total_difficulty. have total_difficulty be used only if the chain needs it
  - if total difficulty is not on the block and we aren't on ETH, fetch the full block instead of just the header
  - if total difficulty is set and non-zero, use it for consensus instead of just the number
- [ ] query_user_stats cache hit rate
- [ ] need debounce on reconnect. websockets are closing on us and then we reconnect twice. locks on ProviderState need more thought
- [ ] having the whole block in /status is very verbose. trim it down
- [ ] we have our hard rate limiter set up with a period of 60. but most providers have period of 1- [ ] two servers running will confuse rpc_accounting!
  - it won't happen with users often because they should be sticky to one proxy, but unauthenticated users will definitely hit this
  - one option: we need the insert to be an upsert, but how do we merge historgrams?
- [ ] don't use systemtime. use chrono
- [ ] soft limit needs more thought
    - it should be the min of total_sum_soft_limit (from only non-lagged servers) and min_sum_soft_limit
    - otherwise it won't track anything and will just give errors.
    - but if web3 proxy has just started, we should give some time otherwise we will thundering herd the first server that responds
- [ ] connection pool for websockets. use tokio-tungstenite directly. no need for ethers providers since serde_json is enough for us
    - this should also get us closer to being able to do our own streaming json parser where we can 
- [ ] figure out if "could not get block from params" is a problem worth logging
    - maybe it was an ots request?
- [ ] change redirect_rpc_key_url to match the newest url scheme
- [ ] implement filters
- [ ] implement remaining subscriptions
    - would be nice if our subscriptions had better gaurentees than geth/erigon do, but maybe simpler to just setup a broadcast channel and proxy all the respones to a backend instead
- [ ] tests should use `test-env-log = "0.2.8"`
- [ ] eth_sendRawTransaction should only forward if the chain_id matches what we are running
- [ ] weighted random choice should still prioritize non-archive servers
    - maybe shuffle randomly and then sort by (block_limit, random_index)?
    - maybe sum available_requests grouped by archive/non-archive. only limit to non-archive if they have enough?
- [ ] some places we call it "accounting" others a "stat". be consistent
- [ ] cli commands to search users by key
- [ ] flamegraphs show 25% of the time to be in moka-housekeeper. tune that
- [ ] config parsing is strict right now. this makes it hard to deploy on git push since configs need to change along with it
- [ ] when displaying the user's data, they just see an opaque id for their tier. We should join that data
- [ ] refactor so configs can change while running
  - this will probably be a rather large change, but is necessary when we have autoscaling
  - create the app without applying any config to it
  - have a blocking future watching the config file and calling app.apply_config() on first load and on change
  - work started on this in the "config_reloads" branch. because of how we pass channels around during spawn, this requires a larger refactor.
- [ ] when displaying the user's data, they just see an opaque id for their tier. We should join that data so they see the tier name and limits
- [ ] add indexes to speed up stat queries
- [ ] the public rpc is rate limited by ip and the authenticated rpc is rate limit by key
    - this means if a dapp uses the authenticated RPC on their website, they could get rate limited more easily
- [ ] take an option to set a non-default role when creating a user
- [ ] different prune levels for free tiers
- [ ] have a test that runs ethspam and versus
- [ ] status page show git hash of running version
- [ ] Email confirmation
    - [ ] we'll need a pretty template email that the backend will send.
    - [ ] That will link them to a a page on llamanodes.com
    - [ ] There, they click "confirm" (or JavaScript does it for them automatically) to POST to this new endpoint
- [ ] test in the migration repo that sets up a sqlite database that runs up and down 
- [ ] unbounded queues are risky. add limits
- [ ] after running for a while, https://eth-ski.llamanodes.com/status is only at 157 blocks and hashes. i thought they would be near 10k after running for a while
    - adding uptime to the status should help
    - i think this is already in our todo list
- [ ] write a test that uses the cli to create a user and modifies their key
- [ ] Uuid/Ulid instead of big_unsigned for database ids
  - might have to use Uuid in sea-orm and then convert to Ulid on display
  - https://www.kostolansky.sk/posts/how-to-migrate-to-uuid/
- [ ] emit stdandard deviation?
- [ ] emit global stat on retry
- [ ] emit global stat on no servers synced
- [ ] emit global stat on error (maybe just use sentry, but graphs are handy)
  - if we wait until the error handler to emit the stat, i don't think we have access to the authorized_request
- [ ] endpoint (and cli script) to rotate api key
- [ ] if no bearer token found in redis (likely because it expired), send 401 unauthorized
- [ ] user create script should allow multiple keys per user
- [ ] somehow the proxy thought latest was hours behind. need internal health check that forces reconnect if this happens
- [ ] display concurrent requests per api key (only with authentication!)
- [ ] change "remember me" to last until 4 weeks of no use, rather than 4 weeks since login? that will be a lot more database writes
- [ ] BUG? WARN http_request:request: web3_proxy::block_number: could not get block from params err=unexpected params length id=01GF4HTRKM4JV6NX52XSF9AYMW method=POST authorized_request=User(Some(SqlxMySqlPoolConnection), AuthorizedKey { ip: 10.11.12.15, origin: None, user_key_id: 4, log_revert_chance: 0.0000 })
  - why is it failing to get the block from params when its set to None? That should be the simple case
- [ ] BUG: i think if all backend servers stop, the server doesn't properly reconnect. It appears to stop listening on 8854, but not shut down.
- [ ] if user-specific caches have evictions that aren't from timeouts, log a warning
- [ ] make sure the email address is valid. probably have a "verified" column in the database
- [ ] if invalid user id given, we give a 500. should be a different error code instead
  - WARN http_request: web3_proxy::frontend::errors: anyhow err=UserKey was not a ULID or UUID id=01GER4VBTS0FDHEBR96D1JRDZF method=POST
- [ ] admin-only endpoint for seeing a user's stats for support requests
- [ ] from what i thought, /status should show hashes > numbers!
  - but block numbers count is maxed out (10k)
  - and block hashes count is tiny (83)
  - what is going on? when the server fist launches they are in sync
  - [ ] related BUG? WARN web3_proxy::rpcs::blockchain: Missing connection_head_block in block_hashes. Fetching now connection_head_hash=0x4b7a…14b5 conn_name=local_erigon_alpha_archive rpc=local_erigon_alpha_archive
  - i see this a lot more than expected. why is it happening so much? better logs needed
- [ ] after adding semaphores (or maybe something else), CPU load seems a lot higher. investigate
- [ ] proper support for Finalized and Safe block queries
- [ ] admin-only page for viewing user stat pages
- [ ] geth sometimes gives an empty response instead of an error response. figure out a good way to catch this and not serve it
- [ ] GET balance endpoint
- [ ] POST balance endpoint
- [ ] EIP1271 for siwe
- [ ] Limited throughput during high traffic
- [ ] instead of Option<...> in our frontend function signatures, use result and then the try operator so that we get our errors wrapped in json
- [ ] revert logs should have a maximum age and a maximum count to keep the database from being huge
- [ ] user login should also return a jwt (jsonwebtoken rust crate should make it easy)
- [ ] script that looks at config and estimates max memory used by caches
- [ ] favicon
  - eth_1       | 2022-09-07T17:10:48.431536Z  WARN web3_proxy::jsonrpc: forwarding error err=nothing to see here
  - use the one on https://staging.llamanodes.com/
- [ ] warn if no servers have transaction subscriptions
    - [ ] if no servers have transaction subscriptions, and a user tries to subscribe, make sure the error is user friendly
- [ ] only allow transaction and full block subscriptions if the user is registered?
- [ ] eth_subscribe logs (https://geth.ethereum.org/docs/rpc/pubsub)
- [ ] make private transactions opt in (its already in the database, but not our code)
- [ ] write a function for receipts that tries balanced_rpcs and only if they all error should it try private relays
  - [ ] automatic retries with a timeout or until all the servers have been tried.
    - i had the websocket die on me in the middle of a long test. only one in-flight request failed because of it. the rest delayed. figure out how to catch these ones since websocket fails sadly seem common
- [ ] nice output when cargo doc is run
- [ ] cache more things locally or in redis
- [ ] stats when forks are resolved (and what chain they were on?)
- [ ] Only subscribe to transactions when someone is listening and if the server has opted in to it
- [ ] When sending eth_sendRawTransaction, retry errors
- [ ] If we need an archive server and no servers in sync, exit immediately with an error instead of waiting 60 seconds
- [ ] 120 second timeout is too short. Maybe do that for free tier and larger timeout for paid. Problem is that some queries can take over 1000 seconds
- [ ] when handling errors from axum parsing the Json...Enum in the function signature, the errors don't get wrapped in json. i think we need a axum::Layer
- [ ] don't "unwrap" anywhere. give proper errors
- [ ] handle log subscriptions
  - probably as a paid feature
- [ ] relevant erigon changelogs: add pendingTransactionWithBody subscription method (#5675)
- [ ] change_user_tier_by_key should not show the rpc key id. that way our ansible playbook won't expose it
- [ ] make sure all our responses follow the spec: https://www.jsonrpc.org/specification#examples
- [ ] min_sum_soft_limit should be automatic based on the app's average rps plus a buffer.
  - [ ] add a rate counter to the balanced_rpcs
  - [ ] every time a block is found, update min_sum_soft_limit
  - [ ] add a min_sum_soft_limit_safety
      - keeps the automaticly calculated limit from going so high that we stop serving requests
  - [ ] add a min_sum_soft_limit_max_wait that advances the consensus block even if mins not met yet
- [ ] a script for load testing a server and calculating its hard and soft limits
- [ ] use https://github.com/dherman/esprit or similar to parse https://github.com/DefiLlama/chainlist/blob/main/constants/extraRpcs.js
- [ ] update example.toml
    - might need to make changes so the influxdb stuff is optional. david said it stopped right after starting
- [ ] i'm seeing a bunch of errors with eth_getLogs.
    - i think maybe my block number rewriting is causing problems. but maybe its just a user doing bad queries
- [ ] Use "is_fresh" instead of our atomic bool
    - moka 0.10 - Add entry and entry_by_ref APIs to sync and future caches (#193):
        They allow users to perform more complex operations on a cache entry. At this point, the following operations (methods) are provided:
            or_default
            or_insert
            or_insert_with
            or_insert_with_if
            or_optionally_insert_with
            or_try_insert_with
        The above methods return Entry type, which provides is_fresh method to check if the value was freshly computed or already existed in the cache.
- [ ] lag message always shows on first response
    - http interval on blastapi lagging by 1!
- [ ] change scoring for rpcs again. "p2c ewma"
  - [ ] weighted random sort: (soft_limit - ewma active requests * num web3_proxy servers)
    - 2. soft_limit
  - [ ] pick 2 servers from the random sort.
    - [ ] exponential weighted moving average for block subscriptions of time behind the first server (works well for ws but not http)

## V2

These are not ordered. I think some rows also accidently got deleted here. Check git history.

- [ ] less Arc (and more pin?). we use arcs on a lot of things where i think a &self should work fine.
- [ ] automatically tune database and redis connection pool size
- [ ] if db is down, keep logins cached longer. at least only new logins will have trouble then
- [ ] handle user payments
  - [ ] separate daemon (or users themselves) call POST /users/process_transaction
    - checks a transaction to see if it modifies a user's balance. records results in a sql database
    - we will have our own event subscriber watching for "deposit" events, but sometimes events get missed and users might incorrectly "transfer" the tokens directly to an address instead of using the dapp
- [ ] if a rpc fails to connect at start, retry later instead of skipping it forever (need config hot reloads first)
- [ ] jwt auth so people can easily switch from infura
- [ ] automated soft limit
  - look at average request time for getBlock? i'm not sure how good a proxy that will be for serving eth_call, but its a start
  - https://crates.io/crates/histogram-sampler
- [ ] interval for http subscriptions should be based on block time. load from config is easy, but better to query. currently hard coded to 13 seconds
- [ ] check code to keep us from going backwards. maybe that is causing outages
- [ ] min_backup_rpcs seperate from min_synced_rpcs

in another repo: event subscriber
  - [ ] watch for transfer events to our contract and submit them to /payment/$tx_hash
  - [ ] cli tool that support can run to manually check and submit a transaction

## "Maybe some day" and other Miscellaneous Things

- [ ] tool to revoke bearer tokens that clears redis
- [ ] eth_getBlockByNumber and similar calls served from the block map
  - will need all Block<TxHash> **and** Block<TransactionReceipt> in caches or fetched efficiently
  - so maybe we don't want this. we can just use the general request cache for these. they will only require 1 request and it means requests won't get in the way as much on writes as new blocks arrive.
  - after looking at my request logs, i think its worth doing this. no point hitting the backends with requests for blocks multiple times. will also help with cache hit rates since we can keep recent blocks in a separate cache
- [ ] Public bsc server got “0” for block data limit (ninicoin)
- [ ] cli tool for resetting api keys
- [ ] Advanced load testing scripts so we can find optimal cost servers 
  - [ ] benchmarks from https://github.com/llamafolio/llamafolio-api/
  - [ ] benchmarks from ethspam and versus
  - [ ] benchmarks from other things
  - [ ] quick script that calls all the curve-api endpoints once and checks for success, then calls wrk to hammer it
    - [ ] https://github.com/curvefi/curve-api
    - [ ] test /api/getGaugesmethod
        - usually times out after vercel's 60 second timeout
        - one time got: Error invalid Json response ""
- [ ] page that prints a graphviz dotfile of the blockchain
- [ ] search for all the "TODO" and `todo!(...)` items in the code and move them here
- [ ] add the backend server to the header?
- [ ] have a low-latency option that always tries at least two servers in parallel and then returns the first success?
  - this doubles our request load though. maybe only if the first one doesn't respond very quickly? 
- [ ] zero downtime deploys
- [ ] are we using Acquire/Release/AcqRel properly? or do we need other modes?
- [ ] use https://github.com/ledgerwatch/interfaces to talk to erigon directly instead of through erigon's rpcdaemon (possible example code which uses ledgerwatch/interfaces: https://github.com/akula-bft/akula/tree/master)
- [ ] subscribe to pending transactions and build an intelligent gas estimator
- [ ] flashbots specific methods
  - [ ] flashbots protect fast mode or not? probably fast matches most user's needs, but no reverts is nice.
  - [ ] https://docs.flashbots.net/flashbots-auction/searchers/advanced/rpc-endpoint#authentication maybe have per-user keys. or pass their header on if its set
- [ ] if no redis set, but public rate limits are set, exit with an error
- [ ] i saw "WebSocket connection closed unexpectedly" but no log about reconnecting
  - need better logs on this because afaict it did reconnect
- [ ] better document load tests: docker run --rm --name spam shazow/ethspam --rpc http://$LOCAL_IP:8544 | versus --concurrency=100 --stop-after=10000 http://$LOCAL_IP:8544; docker stop spam
- [ ] if the call is something simple like "symbol" or "decimals", cache that too. though i think this could bite us.
- [ ] add a subscription that returns the head block number and hash but nothing else
- [ ] if chain split detected, what should we do? don't send transactions?
- [ ] archive check works well for local servers, but public nodes (especially on other chains) seem to give unreliable results. likely because of load balancers.
  - [x] configurable block data limit until better checks
- [ ] https://docs.rs/derive_builder/latest/derive_builder/
- [ ] Detect orphaned transactions
- [ ] https://crates.io/crates/reqwest-middleware easy retry with exponential back off
  - Though I think we want retries that go to other backends instead
- [ ] Some of the pub things should probably be "pub(crate)"
- [ ] Maybe storing pending txs on receipt in a dashmap is wrong. We want to store in a timer_heap (or similar) when we actually send. This way there's no lock contention until the race is over.
- [ ] Support "safe" block height. It's planned for eth2 but we can kind of do it now but just doing head block num-3
- [ ] Archive check on BSC gave “archive” when it isn’t. and FTM gave 90k for all servers even though they should be archive
- [ ] cache eth_getLogs in a database?
- [ ] stats for "read amplification". how many backend requests do we send compared to frontend requests we received?
- [ ] fully test retrying when "header not found"
  - i saw "header not found" on a simple eth_getCode query to a public load balanced bsc archive node on block 1
- [ ] weird flapping fork could have more useful logs. like, howd we get to 1/1/4 and fork. geth changed its mind 3 times?
  - should we change our code to follow the same consensus rules as geth? our first seen still seems like a reasonable choice
  -  other chains might change all sorts of things about their fork choice rules
    2022-07-22T23:52:18.593956Z  WARN block_receiver: web3_proxy::connections: chain is forked! 1 possible heads. 1/1/4 rpcs have 0xa906…5bc1 rpc=Web3Rpc { url: "ws://127.0.0.1:8546", data: 64, .. } new_block_num=15195517
    2022-07-22T23:52:18.983441Z  WARN block_receiver: web3_proxy::connections: chain is forked! 1 possible heads. 1/1/4 rpcs have 0x70e8…48e0 rpc=Web3Rpc { url: "ws://127.0.0.1:8546", data: 64, .. } new_block_num=15195517
    2022-07-22T23:52:19.350720Z  WARN block_receiver: web3_proxy::connections: chain is forked! 2 possible heads. 1/2/4 rpcs have 0x70e8…48e0 rpc=Web3Rpc { url: "ws://127.0.0.1:8549", data: "archive", .. } new_block_num=15195517
    2022-07-22T23:52:26.041140Z  WARN block_receiver: web3_proxy::connections: chain is forked! 2 possible heads. 2/4/4 rpcs have 0x70e8…48e0 rpc=Web3Rpc { url: "http://127.0.0.1:8549", data: "archive", .. } new_block_num=15195517
  - [ ] threshold should check actual available request limits (if any) instead of just the soft limit
- [ ] foreign key on_update and on_delete
- [ ] database creation timestamps
- [ ] better error handling. we warn too often for validation errors and use the same error code for most every request
- [ ] use &str more instead of String. lifetime annotations get really annoying though
- [ ] tarpit instead of reject requests (unless theres a lot)
- [ ] archive servers should be lowest priority
- [ ] docker build context is really big. we must be including target or something
- [ ] fix ip detection when running in dev
- [ ] PR to add this to sea orm prelude:
  ```
  #[cfg(feature = "with-uuid")]
  pub use uuid::Builder as UuidBuilder;
  ```
- [ ] rate limit thoughts:
  - if someone subscribes to all pending transactions, how should that count against rate limits
  - when those rate limits are hit, what should happen?
  - missing pending transactions might be okay, but not missing confirmed blocks 
- [ ] sea-orm brings in async-std, but we are using tokio. benchmark switching 
- [ ] this query always times out, but erigon can serve it quickly: `curl -X POST -H "Content-Type: application/json" --data '{"jsonrpc":"2.0","method":"debug_traceBlockByNumber","params":["latest"],"id":1}' 127.0.0.1:8544' 127.0.0.1:8544`
  {"jsonrpc":"2.0","id":null,"error":{"code":-32099,"message":"deadline has elapsed"}}
  - [ ] figure out rate limits for private rpcs. eden v1 gives 500 error instead of a code for rate limits
- [ ] https://gitlab.com/moka-labs/tiered-cache-example
- [ ] web3connection3.block(...) might wait forever. be sure to do it safely
- [ ] search for all "todo!"
- [ ] when using a bunch of slow public servers, i see "no servers in sync" even when things should be right
  - maybe iterate connection heads by total weight? i still think we need to include parent hashes
- [ ] i see "No block found" sometimes for a single server's block. Not sure why since reads should happen after writes
- [ ] better handling for offline http servers
  - if we get a connection refused, we should remove the server's block info so it is taken out of rotation
- [ ] how should we handle reverting transactions? they won't confirm for a while after we send them
- [ ] allow configuration of the expiration time of bearer tokens. currently defaults to 4 weeks
- [ ] emit stat when an IP/key goes over rate limits
- [ ] readme command should run create_user commands via docker-compose
- [ ] helper for UUID <-> ULID
- [ ] Wrapping extractors in Result makes them optional and gives you the reason the extraction failed
- [ ] at concurrency 100, ethspam is getting 400 and 422 errors. figure out why. probably something with redis or mysql, but maybe its something else like spawning
- [ ] emit per-key stats for latency of semaphore awaits. if this starts to grow, people will know they are hitting limits and need a higher tier
- [ ] need a status page for your wallet's rpc. show head block information with age
- [ ] replace serde_json::Value with https://lib.rs/crates/ijson (more memory efficient)
- [ ] have a log all option? instead of just reverts, log all request/responses? can be very useful for debugging but would flood our database. maybe better for them to do that on their client side
- [ ] failsafe. if no blocks or transactions in some time, warn and reset the connection
- [ ] WARN http_request:request: web3_proxy::block_number: could not get block from params err=unexpected params length id=01GF4HTRKM4JV6NX52XSF9AYMW method=POST authorized_request=User(Some(SqlxMySqlPoolConnection), AuthorizedKey { ip: 10.11.12.15, origin: None, user_key_id: 4, log_revert_chance: 0.0000 })
- [ ] having tons of worker threads can actually make us slower if they keep waking to steal work from eachother. need benchmarks
- [ ] change the wrk data to log requests and errors to a file
- [ ] if redis is not set and login page is visited, users get a 502. should be 501
- [ ] allow passing the authorization header to the anonymous rpc endpoint
- [ ] sentry profiling
- [ ] support alchemy_minedTransactions
- [ ] debug print of user::Model's address is a big vec of numbers. make that hex somehow
- [ ] make it so you can put a string like "LN arbitrum" into the create_user script, and have it automatically turn it into 0x4c4e20617262697472756d000000000000000000.
  - [ ] if --address not given, use the --description
  - [ ] if it is too long, (the last 4 bytes must be zero), give an error so descriptions like this stand out
- [ ] we need to use docker-compose's proper environment variable handling. because now if someone tries to start dev containers in their prod, remove orphans stops and removes them
- [ ] change invite codes to set the user_tier
- [ ] some cli commands should use the replica if possible
- [ ] some third party rpcs have limits on the size of eth_getLogs. include those limits in server config
- [ ] some internal requests should go through app.proxy_rpc_request so that they get caching!
    - be careful not to make an infinite loop
- [ ] request timeout messages should include the request id
- [ ] have an upgrade tier that queries multiple backends at once. returns on first Ok result, collects errors. if no Ok, find the most common error and then respond with that
- [ ] give public_recent_ips_salt a better, more general, name
- [ ] include tier in the head block logs?
- [ ] i think i use FuturesUnordered when a try_join_all might be better
- [ ] since we are read-heavy on our configs, maybe we should use a cache
  - "using a thread local storage and explicit types" https://docs.rs/arc-swap/latest/arc_swap/cache/struct.Cache.html
- [ ] tests for config reloading
- [ ] use pin instead of arc for a bunch of things?
  - https://fasterthanli.me/articles/pin-and-suffering
- [ ] calculate archive depth automatically based on block_data_limits
-												watch new heads

											
										
										
											2022-04-25 22:14:10 +03:00
+								# Todo
-												clean up todos

											
										
										
											2022-06-21 04:02:49 +03:00
+								## MVP
-												order most of the todos

											
										
										
											2022-09-12 17:31:57 +03:00
+								These are roughly in order of completition
-												clean up todos

											
										
										
											2022-06-21 04:02:49 +03:00
+								- [x] simple proxy
 								- [x] better locking. when lots of requests come in, we seem to be in the way of block updates
 								- [x] load balance between multiple RPC servers
 								- [x] support more than just ETH
 								- [x] option to disable private rpc and send everything to primary
 								- [x] support websocket clients
 								  - we support websockets for the backends already, but we need them for the frontend too
 								- [x] health check nodes by block height
 								- [x] Dockerfile
 								- [x] docker-compose.yml
 								- [x] after connecting to a server, check that it gives the expected chainId
 								- [x] the ethermine rpc is usually fastest. but its in the private tier. since we only allow synced rpcs, we are going to not have an rpc a lot of the time
 								- [x] if not backends. return a 502 instead of delaying?
 								- [x] move from warp to axum
 								- [x] handle websocket disconnect and reconnect
 								- [x] eth_sendRawTransaction should return the most common result, not the first
 								- [x] use redis and redis-cell for rate limits
-												funnel survive rate limiting

											
										
										
											2022-06-17 01:23:41 +03:00
+								- [x] it works for a few seconds and then gets stuck on something.
 								  - [x] its working with one backend node, but multiple breaks. something to do with pending transactions
 								  - [x] dashmap entry api is easy to deadlock! be careful with it!
-												clean up todos

											
										
										
											2022-06-21 04:02:49 +03:00
+								- [x] the web3proxyapp object gets cloned for every call. why do we need any arcs inside that? shouldn't they be able to connect to the app's? can we just use static lifetimes
-												it works, but we need it to be optional

											
										
										
											2022-06-15 01:02:18 +03:00
+								- [x] refactor Connection::spawn. have it return a handle to the spawned future of it running with block and transaction subscriptions
 								- [x] refactor Connections::spawn. have it return a handle that is selecting on those handles?
-												start adding redis-cell for rate limits

											
										
										
											2022-05-21 23:40:22 +03:00
+								- [x] some production configs are occassionally stuck waiting at 100% cpu
-												check to see if this gets stuck

											
										
										
											2022-05-19 06:00:54 +03:00
+								  - they stop processing new blocks. i'm guessing 2 blocks arrive at the same time, but i thought our locks would handle that
 								  - even after removing a bunch of the locks, the deadlock still happens. i can't reliably reproduce. i just let it run for awhile and it happens.
 								  - running gdb shows the thread at tokio tungstenite thread is spinning near 100% cpu and none of the rest of the program is proceeding
-												start adding redis-cell for rate limits

											
										
										
											2022-05-21 23:40:22 +03:00
+								  - fixed by https://github.com/gakonst/ethers-rs/pull/1287
-												retries

											
										
										
											2022-07-02 04:20:28 +03:00
+								- [x] when sending with private relays, brownie's tx.wait can think the transaction was dropped. smarter retry on eth_getTransactionByHash and eth_getTransactionReceipt (maybe only if we sent the transaction ourselves)
 								- [x] if web3 proxy gets an http error back, retry another node
 								- [x] endpoint for health checks. if no synced servers, give a 502 error
-												todos

											
										
										
											2022-07-07 03:00:15 +03:00
+								- [x] rpc errors propagate too far. one subscription failing ends the app. isolate the providers more (might already be fixed)
-												connection pooling

											
										
										
											2022-07-07 06:22:09 +03:00
+								- [x] incoming rate limiting (by ip)
-												todo complete

											
										
										
											2022-07-07 06:30:04 +03:00
+								- [x] connection pool for redis
-												better archive split

											
										
										
											2022-07-16 07:13:02 +03:00
+								- [x] automatically route to archive server when necessary
-												improve redis connection pool

											
										
										
											2022-07-09 02:02:32 +03:00
+								  - originally, no processing was done to params; they were just serde_json::RawValue. this is probably fastest, but we need to look for "latest" and count elements, so we have to use serde_json::Value
-												better archive split

											
										
										
											2022-07-16 07:13:02 +03:00
+								  - when getting the next server, filtering on "archive" isn't going to work well. need to check inner instead
-												error if future block is requested

											
										
										
											2022-07-21 02:49:29 +03:00
+								- [x] if the requested block is ahead of the best block, return without querying any backend servers
-												better error handling

											
										
										
											2022-07-08 21:27:06 +03:00
+								- [x] http servers should check block at the very start
-												rearrange todos

											
										
										
											2022-07-21 05:57:14 +03:00
+								- [x] subscription id should be per connection, not global
 								- [x] when under load, i'm seeing "http interval lagging!". sometimes it happens when not loaded.
 								  - we were skipping our delay interval when block hash wasn't changed. so if a block was ever slow, the http provider would get the same hash twice and then would try eth_getBlockByNumber a ton of times
 								- [x] inspect any jsonrpc errors. if its something like "header not found" or "block with id $x not found" retry on another node (and add a negative score to that server)
 								  - this error seems to happen when we use load balanced backend rpcs like pokt and ankr
-												improve caching

											
										
										
											2022-07-22 22:30:39 +03:00
+								- [x] RESPONSE_CACHE_CAP in bytes instead of number of entries
 								- [x] if we don't cache errors, then in-flight request caching is going to bottleneck
-												rearrange todos

											
										
										
											2022-07-21 05:57:14 +03:00
+								  - i think now that we retry header not found and similar, caching errors should be fine
-												improve caching

											
										
										
											2022-07-22 22:30:39 +03:00
+								- [x] RESPONSE_CACHE_CAP from config
 								- [x] web3_sha3 rpc command
-												test more

											
										
										
											2022-07-23 03:19:13 +03:00
+								- [x] test that launches anvil and connects the proxy to it and does some basic queries
 								  - [x] need to have some sort of shutdown signaling. doesn't need to be graceful at this point, but should be eventually
-												thresholds and fork detection

											
										
										
											2022-07-25 03:27:00 +03:00
+								- [x] if the fastest server has hit rate limits, we won't be able to serve any traffic until another server is synced.
-												todos

											
										
										
											2022-07-19 10:01:55 +03:00
+								  - thundering herd problem if we only allow a lag of 0 blocks
-												thresholds and fork detection

											
										
										
											2022-07-25 03:27:00 +03:00
+								  - we can improve this by only publishing the synced connections once a threshold of total available soft and hard limits is passed. how can we do this without hammering redis? at least its only once per block per server
 								  - [x] instead of tracking `pending_synced_connections`, have a mapping of where all connections are individually. then each change, re-check for consensus.
-												always serve something

											
										
										
											2022-07-25 21:00:29 +03:00
+								- [x] synced connections swap threshold set to 1 so that it always serves something
-												more todos

											
										
										
											2022-08-06 05:29:55 +03:00
+								- [x] cli tool for creating new users
-												better results and errors

											
										
										
											2022-08-07 09:48:57 +03:00
+								- [x] incoming rate limiting by api key
-												sorting on total difficulty doesnt work with geth websocket

											
										
										
											2022-08-07 23:44:56 +03:00
+								- [x] sort forked blocks by total difficulty like geth does
 								- [x] refactor result type on active handlers to use a cleaner success/error so we can use the try operator
 								- [x] give users different rate limits looked up from the database
-												add weight to rpcs

											
										
										
											2022-08-08 22:57:54 +03:00
+								- [x] Add a "weight" key to the servers. Sort on that after block. keep most requests local
-												disable less used chains for now

											
										
										
											2022-08-10 07:27:27 +03:00
+								- [x] cache db query results for user data. db is a big bottleneck right now
-												did this earlier

											
										
										
											2022-08-10 08:23:32 +03:00
+								- [x] allow blocking public requests
-												dont subscribe to blocks on the private tier

											
										
										
											2022-08-11 00:52:28 +03:00
+								- [x] Got warning: "WARN subscribe_new_heads:send_block: web3_proxy::connection: unable to get block from https://rpc.ethermine.org: Deserialization Error: expected value at line 1 column 1. Response: error code: 1015". this is cloudflare rate limiting on fetching a block, but this is a private rpc. why is there a block subscription?
 								- [x] im seeing ethspam occasionally try to query a future block. something must be setting the head block too early
 								  - [x] we were sorting best block the wrong direction. i flipped a.cmp(b) to b.cmp(a) so that the largest would be first, but then i used 'max_by' which looks at the end of the list
-												better redirect and jsonrpc handling

											
										
										
											2022-08-11 04:53:27 +03:00
+								- [x] HTTP GET to the websocket endpoints should redirect instead of giving an ugly error
-												load the redirected page from config

											
										
										
											2022-08-12 22:07:14 +03:00
+								- [x] load the redirected page from config
-												todos

											
										
										
											2022-08-12 22:16:50 +03:00
+								- [x] prettier output for create_user command. need the key in hex
-												missed these todos

											
										
										
											2022-08-16 02:09:18 +03:00
+								- [x] drop redis-cell in favor of a simpler (and faster) implementation.
 								  - redis-cell was giving me weird errors and it isn't worth debugging it right now.
 								- [x] create user script should allow setting the api key
-												setup volatile redis

											
										
										
											2022-08-16 08:00:29 +03:00
+								- [x] disable redis persistence in dev
-												tower-request-id

											
										
										
											2022-08-16 03:33:26 +03:00
+								- [x] attach a request id to every web request
-												instrument with spans and allow skipping jsonrpc

											
										
										
											2022-08-16 07:56:01 +03:00
+								- [x] attach user id (not IP!) to each request
 								- [x] fantom_1    | 2022-08-10T22:19:43.522465Z  WARN web3_proxy::jsonrpc: forwarding error err=missing field `jsonrpc` at line 1 column 60
 								  - [x] i think the server isn't following the spec. we need a context attached to more errors so we know which one
 								  - [x] make jsonrpc default to "2.0" (including the custom deserializer that handles the RawValues)
-												rearrange todos

											
										
										
											2022-07-21 05:57:14 +03:00
+								- [x] if the eth_call (or similar) params include a block, we can cache for that
 								- [x] when block subscribers receive blocks, store them in a block_map
 								- [x] eth_blockNumber without a backend request
 								- [x] if we send a transaction to private rpcs and then people query it on public rpcs things, some interfaces might think the transaction is dropped (i saw this happen in a brownie script of mine). how should we handle this?
 								  - [x] send getTransaction rpc requests to the private rpc tier
-												merge todo list from phone

											
										
										
											2022-07-21 06:30:39 +03:00
+								- [x] I'm hitting infura rate limits very quickly. I feel like that means something is very inefficient
 								  - whenever blocks were slow, we started checking as fast as possible
-												Address, not String

											
										
										
											2022-08-16 20:55:44 +03:00
+								- [x] create user script should allow setting requests per minute
-												cache api keys that are not in the database

											
										
										
											2022-08-17 00:10:09 +03:00
+								- [x] cache api keys that are not in the database
-												todo cleanup

											
										
										
											2022-09-07 07:47:06 +03:00
+								- [x] improve consensus block selection. Our goal is to find the highest work chain with a block over a minimum threshold of sum_soft_limit.
-												more fork detection work

											
										
										
											2022-09-01 08:58:55 +03:00
+								  - [x] i saw a fork of like 300 blocks. probably just because a node was restarted and had fallen behind. need some checks to ignore things that are far behind. this improvement should fix this problem
-												rewrite cannonical block

											
										
										
											2022-08-28 02:49:41 +03:00
+								  - [x] A new block arrives at a connection.
 								  - [x] It checks that it isn't the same that it already has (which is a problem with polling nodes)
 								  - [x] If its new to this node...
 								    - [x] if the block does not have total work, check our cache. otherwise, query the node
 								    - [x] save the block num and hash so that http polling doesn't send duplicates
 								    - [x] send the deduped block through a channel to be handled by the connections grouping.
 								  - [x] The connections group...
 								    - [x] input = rpc, new_block
 								    - [x] adds the block and rpc to it's internal maps
 								      - [x] connection_heads: HashMap<rpc_name, blockhash>
 								      - [x] block_map: DashMap<blockhash, Arc<Block>>
 								      - [x] block_num: DashMap<U64, H256>
 								      - [x] blockchain: DiGraphMap<blockhash, ?>
 								    - [x] iterate the rpc_map to find the highest_work_block
-												todo cleanup

											
										
										
											2022-09-07 07:47:06 +03:00
+								    - [x] update synced connections
-												rewrite cannonical block

											
										
										
											2022-08-28 02:49:41 +03:00
+								    - [x] send the block through new head_block_sender
-												more fork detection work

											
										
										
											2022-09-01 08:58:55 +03:00
+								  - [x] rewrite cannonical_block to work as long as there are no forks
-												todo cleanup

											
										
										
											2022-09-07 07:47:06 +03:00
+								  - [x] rewrite cannonical_block (again) and related functions to handle forks
 								    - [x] got a very large number of possible heads here. i think maybe a server was very far out of sync. we should drop servers behind by too much
 								    eth_1       | 2022-08-10T23:26:06.377129Z  WARN web3_proxy::connections: chain is forked! 261 possible heads. 1/2/5/5 rpcs have 0xd403…3c5d
 								    eth_1       | 2022-08-10T23:26:08.917603Z  WARN web3_proxy::connections: chain is forked! 262 possible heads. 1/2/5/5 rpcs have 0x0538…bfff
 								    eth_1       | 2022-08-10T23:26:10.195014Z  WARN web3_proxy::connections: chain is forked! 262 possible heads. 1/2/5/5 rpcs have 0x0538…bfff
 								    eth_1       | 2022-08-10T23:26:10.195658Z  WARN web3_proxy::connections: chain is forked! 262 possible heads. 2/3/5/5 rpcs have 0x0538…bfff
 								    - [x] todo!("handle equal") and also less and greater
-												more fork detection work

											
										
										
											2022-09-01 08:58:55 +03:00
+								    - [x] "chain is forked" message is wrong. it includes nodes just being on different heights of the same chain. need a smarter check
 								      - i think there is also a bug because i've seen "server not synced" a couple times
-												update TODO list

											
										
										
											2022-08-31 00:02:35 +03:00
+								- [x] bug around eth_getBlockByHash sometimes causes tokio to lock up
-												wip

											
										
										
											2022-09-22 02:50:55 +03:00
+								  - i keep a mapping of blocks so that i can go from hash -> block. it has some consistent hashing it does to split them up across multiple maps each with their own lock. so a lot of the time reads dont block writes because they are in different internal maps. this was fine. but after changing my fork detection logic to use the same rules as erigon, i discovered that when you get blocks from a websocket subscription in erigon and geth, theres a missing field (https://github.com/ledgerwatch/erigon/issues/5190). so i added a query to get the block that includes the missing field.
-												update TODO list

											
										
										
											2022-08-31 00:02:35 +03:00
+								  - but i did this in a way where i was holding the write lock open while doing the query. the "new" block that has the missing field ends up in the same bucket and it also wants a write lock. oops. entry api has very sharp edges. don't ever await inside a match on DashMap::entry
-												todo cleanup

											
										
										
											2022-09-07 07:47:06 +03:00
+								- [x] requests for "Get transactions receipts" are routed to the private_rpcs and not the balanced_rpcs. do this better.
 								  - [x] quick fix, send to balanced_rpcs for now. we will just live with errors on new transactions.
-												temp fix for routing to eth_getTransactionByHash and eth_getTransactionReceipt

											
										
										
											2022-08-18 01:19:34 +03:00
+								  - this was intentional so that recently confirmed transactions go to a server that is more likely to have the tx.
 								  - but under heavy load, we hit their rate limits. need a "retry_until_success" function that goes to balanced_rpcs. or maybe store in redis the txids that we broadcast privately and use that to route.
-												use sized Caches

											
										
										
											2022-09-05 08:53:58 +03:00
+								- [x] some of the DashMaps grow unbounded! Make/find a "SizedDashMap" that cleans up old rows with some garbage collection task
 								  - moka is exactly what we need
-												todo cleanup

											
										
										
											2022-09-07 07:47:06 +03:00
+								- [x] if block data limit is 0, say Unknown in Debug output
-												order most of the todos

											
										
										
											2022-09-12 17:31:57 +03:00
+								- [x] basic request method stats (using the user_id and other fields that are in the tracing frame)
 								- [x] refactor from_anyhow_error to have consistent error codes and http codes. maybe implement the Error trait
 								- [x] improve rpc weights. i think theres still a potential thundering herd
 								- [x] improved logging with useful instrumentation
 								- [x] right now the block_map is unbounded. move this to redis and do some calculations to be sure about RAM usage
 								- [x] synced connections swap threshold should come from config
 								- [x] right now we send too many getTransaction queries to the private rpc tier and i are being rate limited by some of them. change to be serial and weight by hard/soft limit.
-												retrying reconnect

											
										
										
											2022-09-14 04:43:09 +03:00
+								- [x] ip blocking gives a 500 and not the proper error code
 								- [x] need a reconnect that doesn't unwrap
 								- [x] need a retrying_reconnect that is used everywhere reconnect is. have exponential backoff here
 								- [x] it looks like our reconnect logic is not always firing. we need to make reconnect more robust!
 								  - i am pretty sure that this is actually servers that fail to connect on initial setup (maybe the rpcs that are on the wrong chain are just timing out and they aren't set to reconnect?)
-												add old block to log and more todos

											
										
										
											2022-09-14 09:57:24 +03:00
+								- [x] chain rolled back 1/1/1 con_head=15510065 (0xa4a3…d2d8) rpc_head=15510065 (0xa4a3…d2d8) rpc=local_erigon_archive
-												no longer need to use total difficulty on ETH 2.0

											
										
										
											2022-09-14 22:39:08 +03:00
+								  - include the old head number and block in the log
-												we did this already

											
										
										
											2022-09-14 23:02:42 +03:00
+								- [x] exponential backoff when reconnecting a connection
-												no longer need to use total difficulty on ETH 2.0

											
										
										
											2022-09-14 22:39:08 +03:00
+								- [x] once the merge happens, we don't want to use total difficulty and instead just care about the number
-												no timeouts here, we already have a timeout on requests

											
										
										
											2022-09-20 06:26:12 +03:00
+								- [x] rewrite rate limiting to have a tiered cache. do not put redis in the hot path
-												order most of the todos

											
										
										
											2022-09-12 17:31:57 +03:00
+								  - instead, we should check a local cache for the current rate limit (+1) and spawn an update to the local cache from redis in the background.
-												add old block to log and more todos

											
										
										
											2022-09-14 09:57:24 +03:00
+								  - [x] when there are a LOT of concurrent requests, we see errors. i thought that was a problem with redis cell, but it happens with my simpler rate limit. now i think the problem is actually with bb8
 								  - https://docs.rs/redis/latest/redis/aio/struct.ConnectionManager.html or https://crates.io/crates/deadpool-redis?
-												todo cleanup

											
										
										
											2022-09-07 07:47:06 +03:00
+								  - WARN http_request: redis_rate_limit::errors: redis error err=Response was of incompatible type: "Response type not string compatible." (response was int(500237)) id=01GC6514JWN5PS1NCWJCGJTC94 method=POST
-												split error counts

											
										
										
											2022-09-20 09:58:40 +03:00
+								- [x] web3_proxy_error_count{path = "backend_rpc/request"} is inflated by a bunch of reverts. do not log reverts as warn.
-												order most of the todos

											
										
										
											2022-09-12 17:31:57 +03:00
+								  - erigon gives `method=eth_call reqid=986147 t=1.151551ms err="execution reverted"`
-												change rpm to u64 and create RevertLogs table

											
										
										
											2022-09-21 22:54:40 +03:00
+								- [x] database migration to change user_keys.requests_per_minute to bigunsigned (max of 18446744073709551615)
 								- [x] change user creation script to have a "unlimited requests per minute" flag that sets it to u64::MAX (18446744073709551615)
-												wip

											
										
										
											2022-09-22 02:50:55 +03:00
+								- [x] in /status, block hashes has a lower count than block numbers. how is that possible?
 								  - we weren't calling sync. now we are
-												let the frontend handle their own cookies

											
										
										
											2022-09-24 05:47:44 +03:00
+								- [x] opt-in debug mode that inspects responses for reverts and saves the request to the database for the user.
-												add logout endpoint and prefix with /rpc

											
										
										
											2022-09-24 00:46:27 +03:00
+								- [x] Api keys need option to lock to IP, cors header, referer, user agent, etc
-												let the frontend handle their own cookies

											
										
										
											2022-09-24 05:47:44 +03:00
+								- [x] /user/logout to clear bearer token and jwt
 								- [x] bearer tokens should expire
-												login needs its own rate limiter

											
										
										
											2022-09-24 06:59:21 +03:00
+								- [x] login endpoint needs its own rate limiter
 								  - we don't want an rpc request limit of 0 to block logins
 								  - for security, we want these limits low.
 								- [x] user login should return the bearer token and the user keys
 								- [x] use siwe messages and signatures for sign up and login
-												ULID or UUID. Prefer ULID

											
										
										
											2022-09-24 08:53:45 +03:00
+								- [x] check for bearer token on /rpc
 								- [x] ip blocking logs a warn. we don't need that
 								- [x] Ulid instead of Uuid for user keys
 								  - <https://discord.com/channels/873880840487206962/900758376164757555/1012942974608474142>
 								  - since users are actively using our service, we will need to support both
-												not everything needs to be under /rpc

											
										
										
											2022-09-25 19:37:45 +03:00
+								- [x] get to /, when not serving a websocket, should have a simple welcome page. maybe with a button to update your wallet.
-												todos

											
										
										
											2022-09-28 07:24:02 +03:00
+								- [x] instead of giving a rate limit error code, delay the connection's response at the start. reject if incoming requests is super high?
 								  - [x] did this by checking a key/ip-specific semaphore before checking rate limits
-												add per-user rpc accounting

											
										
										
											2022-10-10 07:15:07 +03:00
+								- [x] emit user stat on cache hit
 								- [x] emit user stat on cache miss
 								- [x] have migration use tokio instead of async-std
 								- [x] user create script should allow a description field
-												todos

											
										
										
											2022-10-12 01:25:44 +03:00
+								- [x] change stats to using the database
 								- [x] emit user stat on retry
-												stricter configs

											
										
										
											2022-10-19 02:27:33 +03:00
+								- [x] improve `web3_proxy_cli check_config`
 								  - print out warnings if important settings are missing
 								- [x] if unknown config items, error
 								  - unknown configs are almost always a mistake. usually from me changing config parsing on my side and old fields not being updated to the new way
 								  - [x] also need to change how we disable rpcs since i was using an unknown field
-												wait on background threads

											
										
										
											2022-10-21 01:51:56 +03:00
+								- [x] [paginate responses](https://www.sea-ql.org/SeaORM/docs/basic-crud/select/#paginate-result)
 								- [x] graceful shutdown. stop taking new requests and don't stop until all outstanding queries are handled
 								  - https://github.com/tokio-rs/mini-redis/blob/master/src/shutdown.rs
 								  - we need this because we need to be sure all the queries are saved in the db. maybe put stuff in Drop
 								  - need an flume::watch on unflushed stats that we can subscribe to. wait for it to flip to true
-												more context. don't use unix timestamps

											
										
										
											2022-10-21 02:50:23 +03:00
+								- [x] don't use unix timestamps for response_millis since leap seconds will confuse it
-												allow origins on public entrypoints

											
										
										
											2022-10-21 23:59:05 +03:00
+								- [x] config to allow origins even on the anonymous endpoints
-												semafore cleanup

											
										
										
											2022-10-25 07:01:41 +03:00
+								- [x] send logs to sentry
-												alphabetical

											
										
										
											2022-10-25 07:37:19 +03:00
+								- [x] login should return the user id
-												upgrades and todo cleanup

											
										
										
											2022-10-27 01:29:38 +03:00
+								- [x] when we show keys, also show the key's id
-												rename api_key to rpc_key

											
										
										
											2022-10-27 03:12:42 +03:00
+								- [x] add config for concurrent requests from public requests
-												upgrades and todo cleanup

											
										
										
											2022-10-27 01:29:38 +03:00
+								- [x] new endpoints for users (not totally sure about the exact paths, but these features are all needed):
-												more docs

											
										
										
											2022-10-18 00:47:58 +03:00
+								  - [x] sign in
-												rename api_key to rpc_key

											
										
										
											2022-10-27 03:12:42 +03:00
+								    - [x] login should include the key id, not just the key ULID
-												more docs

											
										
										
											2022-10-18 00:47:58 +03:00
+								  - [x] sign out
-												user post endpoint

											
										
										
											2022-10-27 00:39:26 +03:00
+								  - [x] GET profile endpoint
 								  - [x] POST profile endpoint
-												query_window_seconds

											
										
										
											2022-10-20 07:44:33 +03:00
+								  - [x] GET stats endpoint
 								    - [x] display distribution of methods per api key (eth_call, eth_getLogs, etc.) (only with authentication!)
-												aggregate stats endpoint

											
										
										
											2022-10-19 03:56:57 +03:00
+								  - [x] get aggregate stats endpoint
-												user post endpoint

											
										
										
											2022-10-27 00:39:26 +03:00
+								    - [x] display requests per second per api key (only with authentication!)
-												upgrades and todo cleanup

											
										
										
											2022-10-27 01:29:38 +03:00
+								  - [x] POST key endpoint
 								    - [x] generate a new key from a web endpoint
 								    - [x] modifying key settings such as private relay, revert logging, ip/origin/etc checks
-												user post endpoint

											
										
										
											2022-10-27 00:39:26 +03:00
+								  - [x] GET logged reverts on an endpoint that **requires authentication**.
-												wip. add user tiers

											
										
										
											2022-10-28 09:38:21 +03:00
+								- [x] endpoint to list keys without having to sign a message to log in again
 								- [x] rename user_key to rpc_key
-												rename api_key to rpc_key

											
										
										
											2022-10-27 03:12:42 +03:00
+								  - [x] in code
-												wip. add user tiers

											
										
										
											2022-10-28 09:38:21 +03:00
+								  - [x] in database with a migration
 								- [x] instead of requests_per_minute on every key, have a "user_tier" that gets joined
-												better stats aggregations

											
										
										
											2022-11-03 02:14:16 +03:00
+								- [x] document url params with examples
 								- [x] improve "docs/http routes.txt"
-												bug fixes

											
										
										
											2022-11-04 01:16:27 +03:00
+								- [x] remove request per minute and concurrency limits from the keys. those are on the user tiers now.
 								- [x] revertLogs db table should have rpc_key_id on it
 								- [x] the relation in Relation is wrong now. it is called user_key_id,  but point to the rpc key table
-												fix joins for user_stats_aggregate_get

											
										
										
											2022-11-04 06:40:43 +03:00
+								- [x] instruments are missing. maybe that is why sentry had broken traces
 								- [x] description should default to an empty string instead of being nullable
-												clean up todos

											
										
										
											2022-11-04 07:46:37 +03:00
+								- [x] include if archive query or not in the stats
-												more auth when only rpc_key_id is set

											
										
										
											2022-11-05 01:58:15 +03:00
+								- [x] fix test not shutting down
 								- [x] proper authentication on rpc_key_id
-												clean up todos

											
										
										
											2022-11-04 07:46:37 +03:00
+								  - we have bearer token auth for user_id, but rpc_key_id needs more code
-												rpc_key_id in the redirect. weights in the /status page

											
										
										
											2022-11-08 01:10:19 +03:00
+								- [x] use rpc_key_id instead of user_id in the redirect
 								- [x] /status should include the server weights
-												simplify authorization types so we can pass them deeper easily

											
										
										
											2022-11-08 23:06:16 +03:00
+								- [x] improve rate limiting anon ips
-												todos

											
										
										
											2022-11-10 23:20:39 +03:00
+								- [x] nullable rpc_key_id on revert log
 								- [x] attach origin to revert_log
 								    - opt-in origin logging
-												add test for checking the example.toml

											
										
										
											2022-11-12 00:16:32 +03:00
+								- [x] test that runs check_config against example.toml
-												todos

											
										
										
											2022-11-12 12:31:38 +03:00
+								- [x] improve sorting servers by weight. don't force to lower weights, still have a probability that smaller weights might be
 								- [x] flamegraphs show 52% of the time to be in tracing. replace with simpler logging
-												comments

											
										
										
											2022-11-14 00:25:58 +03:00
+								- [x] add optional display name to rpc configs
-												simple lock around database migrations

											
										
										
											2022-11-14 21:24:52 +03:00
+								- [x] add locking around running migrations
-												optional config in web3_proxy_cli

											
										
										
											2022-11-14 22:35:33 +03:00
+								- [x] cli tool for checking config
 								- [x] web3_proxy_cli command should read database settings from config
-												comments and todos

											
										
										
											2022-11-16 23:18:37 +03:00
+								- [x] cli command to change user_tier by key
 								- [x] cache the status page for a second
-												eth_subscribe rpc_accounting logging

											
										
										
											2022-11-20 01:05:51 +03:00
+								- [x] request accounting for websockets
-												comments/todos

											
										
										
											2022-11-22 08:42:02 +03:00
+								- [x] database merge scripts
-												cargo upgrade and shorten variable names

also begin adding a latency tracker for rpc stats

											
										
										
											2023-02-06 20:55:27 +03:00
+								- [x] test that sets up a Web3Rpc and asks "has_block" for old and new blocks
 								- [x] test that sets up Web3Rpcs with 2 nodes. one behind by several blocks. and see what the "next" server shows as
-												lower log level

											
										
										
											2022-11-24 14:04:10 +03:00
+								- [x] ethspam on bsc and polygon gives 1/4 errors. fix whatever is causing this
 								  - bugfix! we were using the whole connection list instead of just the synced connection list when picking servers. oops!
-												template needs two curly braces

											
										
										
											2022-11-30 00:29:34 +03:00
+								- [x] actually block unauthenticated requests instead of emitting warning of "allowing without auth during development!"
-												upgrade things except axum

											
										
										
											2022-12-06 01:18:47 +03:00
+								- [x] smarter reconnection logic
 								- [x] if a websocket connection hasn't received a new block in a while, do a reconnect or just query the block. its possible that the node was syncing when the proxy started
-												todos

											
										
										
											2022-12-06 01:19:34 +03:00
+								- [x] on web3-proxy start, if a node fails to connect, it can hold up listening on 8544
 								    - need to do all the connections in parallel with spawns
-												todos

											
										
										
											2022-12-06 01:40:32 +03:00
+								- [x] add block timestamp to the /status page
 								  - [x] be sure to save the timestamp in a way that our request routing logic can make use of it
-												another pass at server selection

											
										
										
											2022-12-08 09:54:38 +03:00
+								- [x] node selection still needs improvements. we still send to syncing nodes if they are close
 								    - try consensus heads first! only if that is empty should we try others. and we should try them sorted by block height and then randomly chosen from there
-												polish todos

											
										
										
											2022-12-19 21:57:11 +03:00
+								- [x] logging of "bad response!" is way too verbose
 								- [x] i think our "best" server picking is incorrect somehow.
 								    - we upgraded erigon to a version with a broken websocket
 								    - that made it clear we still route to the lagged server sometimes. this is bad, but retries keep it from giving users bad data.
 								- [x] more trace logging
 								- [x] on ETH, we no longer need total difficulty
 								- [x] cli for creating and editing a user's first api key
 								- [x] benchmarks of the different Cache implementations (futures vs dash)
 								  - futures is better
 								- [x] if archive servers are added to the rotation while they are still syncing, they might get requests too soon. keep archive servers out of the configs until they are done syncing. full nodes should be fine to add to the configs even while syncing, though its a wasted connection
 								- [x] subscribing to transactions should be configurable per server. listening to paid servers can get expensive
 								- [x] status page leaks our urls which contain secrets. change that to use names
 								- [x] for easier errors in the axum code, i think we need to have our own type that wraps anyhow::Result+Error
 								- [x] hit counts seem wrong. how are we hitting the backend so much more than the frontend? retries on disconnect don't seem to fit that
 								  web3_proxy_hit_count{path = "app/proxy_web3_rpc_request"} 857270
 								  web3_proxy_hit_count{path = "backend_rpc/request"}       1396127
 								  - this was because backend server ordering was including servers that were still syncing from too long ago
 								- [x] keep it working without redis and a database
 								- [x] manually tune database and redis connection pool size
-												improve docs

											
										
										
											2022-12-16 09:21:19 +03:00
 								## V1
 								These are not yet ordered. There might be duplicates. We might not actually need all of these.
-												add support for optional db replica

also add cleanup of expired login data

											
										
										
											2022-12-16 11:48:24 +03:00
+								- [x] cache user stats in redis and with headers
 								- [x] optional read-only database connection
-												polish todos

											
										
										
											2022-12-19 21:57:11 +03:00
+								- [x] put display name into our prod configs
 								- [x] sometimes when fetching a txid through the proxy it fails, but fetching from the backends works fine
 								  - check flashprofits logs for examples
 								  - we were caching too aggressively
-												improve eth_sendRawTransaction

											
										
										
											2022-12-24 04:32:58 +03:00
+								- [x] BUG! if sending transactions gets "INTERNAL_ERROR: existing tx with same hash", create a success message
 								  - we just want to be sure that the server has our tx and in this case, it does.
-												cargo upgrade and shorten variable names

also begin adding a latency tracker for rpc stats

											
										
										
											2023-02-06 20:55:27 +03:00
+								  - ERROR http_request:request:try_send_all_upstream_servers: web3_proxy::rpcs::request: bad response! err=JsonRpcClientError(JsonRpcError(JsonRpcError { code: -32000, message: "INTERNAL_ERROR: existing tx with same hash", data: None })) method=eth_sendRawTransaction rpc=local_erigon_alpha_archive id=01GF4HV03Y4ZNKQV8DW5NDQ5CG method=POST authorized_request=User(Some(SqlxMySqlPoolConnection), AuthorizedKey { ip: 10.11.12.15, origin: None, user_key_id: 4, log_revert_chance: 0.0000 }) self=Web3Rpcs { conns: {"local_erigon_alpha_archive_ws": Web3Rpc { name: "local_erigon_alpha_archive_ws", blocks: "all", .. }, "local_geth_ws": Web3Rpc { name: "local_geth_ws", blocks: 64, .. }, "local_erigon_alpha_archive": Web3Rpc { name: "local_erigon_alpha_archive", blocks: "all", .. }}, .. } authorized_request=Some(User(Some(SqlxMySqlPoolConnection), AuthorizedKey { ip: 10.11.12.15, origin: None, user_key_id: 4, log_revert_chance: 0.0000 })) request=JsonRpcRequest { id: RawValue(39), method: "eth_sendRawTransaction", .. } request_metadata=Some(RequestMetadata { datetime: 2022-10-11T22:14:57.406829095Z, period_seconds: 60, request_bytes: 633, backend_requests: 0, no_servers: 0, error_response: false, response_bytes: 0, response_millis: 0 }) block_needed=None
-												lint

											
										
										
											2022-12-28 19:49:21 +03:00
+								- [x] serde collect unknown fields in config instead of crash
 								- [x] upgrade user tier by address
-												all_backend_connections skips syncing servers

											
										
										
											2023-01-02 21:34:16 +03:00
+								- [x] all_backend_connections skips syncing servers
-												change weight to tier

											
										
										
											2023-01-04 09:37:51 +03:00
+								- [x] change weight back to tier
-												transfer key script

											
										
										
											2023-01-10 04:50:09 +03:00
+								- [x] fix multiple origin and referer checks
-												broadcast txs to less servers

											
										
										
											2023-01-12 01:51:01 +03:00
+								- [x] ip detection needs work so that everything doesnt show up as 172.x.x.x
 								  - i think this was done, but am not positive.
 								- [x] if private txs are disabled, only send trasactions to some of our servers. we were DOSing ourselves with transactions and slowing down sync
-												retry if we get the method X is not available

											
										
										
											2023-01-13 09:40:47 +03:00
+								- [x] retry if we get "the method X is not available"
-												remove weight now that we use tiers

											
										
										
											2023-01-14 00:45:48 +03:00
+								- [x] remove weight. we don't use it anymore. tiers are what we use now
-												one bin for everything

											
										
										
											2023-01-18 08:26:10 +03:00
+								- [x] make deadlock feature optional
 								- [x] standalone healthcheck daemon (sentryd)
 								- [x] status page should show version
 								- [x] combine the proxy and cli into one bin
-												improved rate limiting on websockets

											
										
										
											2023-01-19 03:17:43 +03:00
+								- [x] improve rate limiting on websockets
-												major refactor to only use backup servers when absolutely necessary

											
										
										
											2023-01-19 13:13:00 +03:00
+								- [x] retry another server if we get a jsonrpc response error about rate limits
 								- [x] major refactor to only use backup servers when absolutely necessary
-												remove allowed lag

											
										
										
											2023-01-19 14:05:39 +03:00
+								- [x] remove allowed lag
-												configurable gas buffer

											
										
										
											2023-01-20 05:08:53 +03:00
+								- [x] configurable gas buffer. default to the larger of 25k or 25% on polygon to work around erigon bug
-												clean up todos

											
										
										
											2023-01-20 08:30:48 +03:00
+								- [x] public is 3900, but free is 360. free should be at least 3900 but probably more
-												improve wait_for_sync

											
										
										
											2023-01-21 02:43:16 +03:00
+								- [x] add --max-wait to wait_for_sync
 								- [x] add automatic compare urls to wait_for_sync
-												broadcast transactions to more servers

											
										
										
											2023-01-24 12:58:31 +03:00
+								- [x] send panics to pagerduty
 								- [x] enable lto on release builds
 								- [x] less logs for backup servers
 								- [x] use channels instead of arcswap
 								  - this will let us easily wait for a new head or a new synced connection
 								- [x] broadcast transactions to more servers
-												sentryd to pagerduty

											
										
										
											2023-01-24 14:12:23 +03:00
+								- [x] send sentryd errors to pagerduty
-												don't send pagerduty alerts for websocket panics

											
										
										
											2023-01-24 20:38:12 +03:00
+								- [x] improve handling of unknown methods
 								- [x] don't send pagerduty alerts for websocket panics
-												improve waiting for sync when rate limited

											
										
										
											2023-01-25 07:44:50 +03:00
+								- [x] improve waiting for sync when rate limited
-												upgrade sentry and fix pagerduty features so we do not need openssl

											
										
										
											2023-01-26 01:11:20 +03:00
+								- [x] improve pager duty errors for smarter deduping
-												add create_key cli command

											
										
										
											2023-01-26 04:58:10 +03:00
+								- [x] add create_key cli command
-												improve sort order during eth_sendRawTransaction

											
										
										
											2023-02-03 01:48:23 +03:00
+								- [x] short lived cache on /health
 								- [x] cache /status for longer
 								- [x] sort connections during eth_sendRawTransaction
-												block all admin_ commands

											
										
										
											2023-02-03 21:56:05 +03:00
+								- [x] block all admin_ rpc commands
-												remove metered in favor of influxdb stats

											
										
										
											2023-02-06 05:16:09 +03:00
+								- [x] remove the "metered" crate now that we save aggregate queries?
 								- [x] add archive depth to app config
-												include to_block more places

											
										
										
											2023-02-11 07:45:57 +03:00
+								- [x] use from_block and to_block so that eth_getLogs is routed correctly
 								- [x] improve eth_sendRawTransaction server selection
-												variable rename

											
										
										
											2023-02-12 21:22:20 +03:00
+								- [x] don't cache methods that are usually very large
 								- [x] use http provider when available
-												per-chain rpc rate limits

											
										
										
											2023-02-22 08:10:23 +03:00
+								- [x] per-chain rpc rate limits
-												lots of todos

											
										
										
											2023-02-26 02:07:05 +03:00
+								- [x] canonical block checks giving weird errors. change healthcheck to use block number
 								    [2023-02-21T02:58:06Z DEBUG web3_proxy::rpcs::request] error response from blastapi! method=eth_getCode params=(0xa9a8760b8333efae8c9c751e6695a11938ae4b90, 0x73a627f588338804e6dc880154728484f7e0373c29057408c6674d75bdc29d12) err=JsonRpcClientError(JsonRpcError(JsonRpcError { code: -32603, message: "hash 73a627f588338804e6dc880154728484f7e0373c29057408c6674d75bdc29d12 is not currently canonical", data: None }))
 								    [2023-02-21T02:58:06Z DEBUG web3_proxy::rpcs::one] blastapi failed health check query! Error {
 								            context: "ProviderError from the backend",
 								            source: JsonRpcClientError(
 								                JsonRpcError(
 								                    JsonRpcError {
 								                        code: -32603,
 								                        message: "hash 73a627f588338804e6dc880154728484f7e0373c29057408c6674d75bdc29d12 is not currently canonical",
 								                        data: None,
 								                    },
 								                ),
 								            ),
 								        }
-												cargo upgrade

											
										
										
											2023-02-25 11:47:16 +03:00
+								- [x] add a "failover" tier that is only used if balanced_rpcs has "no servers synced"
 								  - use this tier (and private tier) to check timestamp on latest block. if we are behind that by more than a few seconds, something is wrong
-												lots of todos

											
										
										
											2023-02-26 02:07:05 +03:00
+								- [x] cli flag to set prometheus port
 								- [x] eth_getLogs is going to unsynced nodes because it only checks start block and not the end block
 								- [x] have multiple providers on each backend rpc. one websocket for newHeads. and then http providers for handling requests
 								  - erigon only streams the JSON over HTTP. that code isn't enabled for websockets. so this should save memory on the erigon servers
 								  - i think this also means we don't need to worry about changing the id that the user gives us.
 								- [x] eth_getLogs is going to unsynced nodes because it only checks start block and not the end block
-												cache getLogs with blockhash

											
										
										
											2023-02-27 09:59:42 +03:00
+								- [x] fix caching getLogs with blockhash
-												todos

											
										
										
											2023-03-02 21:36:44 +03:00
+								- [x] fix trying to send signed transactions to an empty list of private_rpcs
 								- [x] improve logging around consensus head.
 								  - it was "num in best synced tier"/num rpc connected/num rpc known.
 								  - it should be "num with best head in best synced tier/num with best head in any tier/num rpcs connected/num rpcs known
-												todos

											
										
										
											2023-03-03 22:14:32 +03:00
+								- [x] add /debug/:rpckey endpoint that logs requests and responses to kafka
 								- [x] refactor so configs can change while running
 								  - this will probably be a rather large change, but is necessary when we have autoscaling
 								  - create the app without applying any config to it
 								  - have a blocking future watching the config file and calling app.apply_config() on first load and on change
 								  - work started on this in the "config_reloads" branch. because of how we pass channels around during spawn, this requires a larger refactor.
-												lots of todos

											
										
										
											2023-02-26 02:07:05 +03:00
+								- [-] if we subscribe to a server that is syncing, it gives us null block_data_limit. when it catches up, we don't ever send queries to it. we need to recheck block_data_limit
-												stats v2

rebased all my commits and squashed them down to one

											
										
										
											2023-01-26 08:24:09 +03:00
+								- [ ] don't use new_head_provider anywhere except new head subscription
 								- [x] remove the "metered" crate now that we save aggregate queries?
 								- [x] don't use systemtime. use chrono
 								- [x] graceful shutdown
 								  - [x] frontend needs to shut down first. this will stop serving requests on /health and so new requests should quickly stop being routed to us
 								  - [x] when frontend has finished, tell all the other tasks to stop
 								  - [x] stats buffer needs to flush to both the database and influxdb
 								- [x] `rpc_accounting` script
 								- [x] period_datetime should always round to the start of the minute. this will ensure aggregations use as few rows as possible
 								- [x] weighted random choice should still prioritize non-archive servers
 								    - maybe shuffle randomly and then sort by (block_limit, random_index)?
 								    - maybe sum available_requests grouped by archive/non-archive. only limit to non-archive if they have enough?
 								- [x] if we subscribe to a server that is syncing, it gives us null block_data_limit. when it catches up, we don't ever send queries to it. we need to recheck block_data_limit
 								- [x] add a "backup" tier that is only used if balanced_rpcs has "no servers synced"
 								  - use this tier to check timestamp on latest block. if we are behind that by more than a few seconds, something is wrong
 								- [x] `change_user_tier_by_address` script
 								- [x] emit stats for user's successes, retries, failures, with the types of requests, chain, rpc
 								- [x] add caching to speed up stat queries
 								- [x] config parsing is strict right now. this makes it hard to deploy on git push since configs need to change along with it
 								  - changed to only emit a warning if there is an unknown configuration key
 								- [x] make the "not synced" error more verbose
 								- [x] short lived cache on /health
 								- [x] cache /status for longer
 								- [x] sort connections during eth_sendRawTransaction
 								- [x] block all admin_ rpc commands
 								- [x] remove the "metered" crate now that we save aggregate queries?
 								- [x] add archive depth to app config
 								- [x] improve "archive_needed" boolean. change to "block_depth"
 								- [x] keep score of new_head timings for all rpcs
 								- [x] having the whole block in /status is very verbose. trim it down
-												one bin for everything

											
										
										
											2023-01-18 08:26:10 +03:00
+								- [-] proxy mode for benchmarking all backends
 								- [-] proxy mode for sending to multiple backends
-												transfer key script

											
										
										
											2023-01-10 04:50:09 +03:00
+								- [-] let users choose a % of reverts to log (or maybe x/second). someone like curve logging all reverts will be a BIG database very quickly
 								  - this must be opt-in and spawned in the background since it will slow things down and will make their calls less private
-												polish todos

											
										
										
											2022-12-19 21:57:11 +03:00
+								  - [ ] automatic pruning of old revert logs once too many are collected
 								  - [ ] we currently default to 0.0 and don't expose a way to edit it. we have a database row, but we don't use it
 								- [-] add configurable size limits to all the Caches
 								  - instead of configuring each cache with MB sizes, have one value for total memory footprint and then percentages for each cache
 								  - https://github.com/moka-rs/moka/issues/201
-												todos

											
										
										
											2023-03-03 22:14:32 +03:00
+								- [ ] all anyhow::Results need to be replaced with FrontendErrorResponse.
 								    - [ ] rename FrontendErrorResponse to Web3ProxyError
 								    - [ ] almost all the anyhows should be Web3ProxyError::BadRequest
-												add steps for kafka library

											
										
										
											2023-03-04 02:25:44 +03:00
+								    - as is, these errors are seen as 500 errors and so haproxy keeps retrying them
-												todos

											
										
										
											2023-03-03 22:14:32 +03:00
+								- change premium concurrency limit to be against ip+rpckey
-												lots of todos

											
										
										
											2023-02-26 02:07:05 +03:00
+								  - then sites like curve.fi don't have to worry about their user count
 								  - it does mean we will have a harder time capacity planning from the number of keys
 								- [ ] have the healthcheck get the block over http. if it errors, or doesn't match what the websocket says, something is wrong (likely a deadlock in the websocket code)
-												cargo upgrade

											
										
										
											2023-02-25 11:47:16 +03:00
+								- [ ] don't use new_head_provider anywhere except new head subscription
-												todos

											
										
										
											2023-03-04 18:29:01 +03:00
+								- [x] maybe we shouldn't route eth_getLogs to syncing nodes. serving queries slows down sync significantly
-												include to_block more places

											
										
										
											2023-02-11 07:45:57 +03:00
+								  - change the send_best function to only include servers that are at least close to fully synced
-												stats v2

rebased all my commits and squashed them down to one

											
										
										
											2023-01-26 08:24:09 +03:00
+								- [ ] enable mev protected transactions with either a /protect/ url (instead of /private/) or the database (when on /rpc/)
-												todos

											
										
										
											2023-03-04 18:29:01 +03:00
+								- [-] have private transactions be enabled by a url setting rather than a setting on the key
-												more direct consensus finding code

this hopefully has less bugs. speed isn't super important since this isn't on the host path.

											
										
										
											2023-03-21 21:16:18 +03:00
+								- [ ] eth_sendRawTransaction should only forward if the chain_id matches what we are running
-												polish todos

											
										
										
											2022-12-19 21:57:11 +03:00
+								- [ ] cli for adding rpc keys to an existing user
-												stats v2

rebased all my commits and squashed them down to one

											
										
										
											2023-01-26 08:24:09 +03:00
+								- [ ] rename "private" to "mev protected" to avoid confusion about private transactions being public once they are mined
 								- [ ] allow restricting an rpc key to specific chains
 								- [ ] writes to request_latency should be handled by a background task so they don't slow down the request
 								  - maybe we can use https://docs.rs/hdrhistogram/latest/hdrhistogram/sync/struct.SyncHistogram.html
 								- [ ] keep re-broadcasting transactions until they are confirmed
 								- [ ] if mev protection is disabled, we should send to *both* balanced_rpcs *and* private_rps
 								- [ ] if mev protection is enabled, we should sent to *only* private_rpcs
-												query_user_stats caching

											
										
										
											2022-12-16 09:32:58 +03:00
+								- [ ] rate limiting/throttling on query_user_stats
-												todos

											
										
										
											2023-03-02 21:36:44 +03:00
+								- [ ] web3rpc configs should have a max_concurrent_requests
 								    - will probably want a tool for calculating a safe value for this. too low and we could kill our performance
 								- [ ] rename "concurrent" requests to "parallel" requests
-												query_user_stats caching

											
										
										
											2022-12-16 09:32:58 +03:00
+								- [ ] minimum allowed query_start on query_user_stats
-												move more into the spawned task

											
										
										
											2022-12-20 21:54:13 +03:00
+								- [ ] setting request limits to None is broken. it does maxu64 and then internal deferred rate limiter counts try to *99/100
-												add optional kafka feature

											
										
										
											2023-03-03 04:39:50 +03:00
+								- [ ] if kafka fails to connect at the start, automatically reconnect
-												more logging

											
										
										
											2022-12-20 00:53:38 +03:00
+								- [ ] during shutdown, mark the proxy unhealthy and send unsubscribe responses for any open websocket subscriptions
-												stats v2

rebased all my commits and squashed them down to one

											
										
										
											2023-01-26 08:24:09 +03:00
+								- [ ] setting request limits to None is broken. it does maxu64 and then internal deferred rate limiter counts overflows when it does to `x*99/100`
 								- [ ] during shutdown, send unsubscribe responses for any open websocket subscriptions
-												polish todos

											
										
										
											2022-12-19 21:57:11 +03:00
+								- [ ] some chains still use total_difficulty. have total_difficulty be used only if the chain needs it
 								  - if total difficulty is not on the block and we aren't on ETH, fetch the full block instead of just the header
 								  - if total difficulty is set and non-zero, use it for consensus instead of just the number
-												query_user_stats caching

											
										
										
											2022-12-16 09:32:58 +03:00
+								- [ ] query_user_stats cache hit rate
-												better sorting of connections

											
										
										
											2023-02-11 07:24:20 +03:00
+								- [ ] need debounce on reconnect. websockets are closing on us and then we reconnect twice. locks on ProviderState need more thought
-												stats v2

rebased all my commits and squashed them down to one

											
										
										
											2023-01-26 08:24:09 +03:00
+								- [ ] having the whole block in /status is very verbose. trim it down
-												improve responses when blocks are not available

											
										
										
											2023-01-25 09:45:20 +03:00
+								- [ ] we have our hard rate limiter set up with a period of 60. but most providers have period of 1- [ ] two servers running will confuse rpc_accounting!
-												start adding user_export and user_import scripts

											
										
										
											2022-11-22 01:52:47 +03:00
+								  - it won't happen with users often because they should be sticky to one proxy, but unauthenticated users will definitely hit this
 								  - one option: we need the insert to be an upsert, but how do we merge historgrams?
-												improve docs

											
										
										
											2022-12-16 09:21:19 +03:00
+								- [ ] don't use systemtime. use chrono
-												stats v2

rebased all my commits and squashed them down to one

											
										
										
											2023-01-26 08:24:09 +03:00
+								- [ ] soft limit needs more thought
 								    - it should be the min of total_sum_soft_limit (from only non-lagged servers) and min_sum_soft_limit
 								    - otherwise it won't track anything and will just give errors.
 								    - but if web3 proxy has just started, we should give some time otherwise we will thundering herd the first server that responds
-												todos

											
										
										
											2022-12-06 01:19:34 +03:00
+								- [ ] connection pool for websockets. use tokio-tungstenite directly. no need for ethers providers since serde_json is enough for us
 								    - this should also get us closer to being able to do our own streaming json parser where we can
-												lower log level

											
										
										
											2022-11-24 14:04:10 +03:00
+								- [ ] figure out if "could not get block from params" is a problem worth logging
 								    - maybe it was an ots request?
-												stats v2

rebased all my commits and squashed them down to one

											
										
										
											2023-01-26 08:24:09 +03:00
+								- [ ] change redirect_rpc_key_url to match the newest url scheme
-												eth_subscribe rpc_accounting logging

											
										
										
											2022-11-20 01:05:51 +03:00
+								- [ ] implement filters
 								- [ ] implement remaining subscriptions
 								    - would be nice if our subscriptions had better gaurentees than geth/erigon do, but maybe simpler to just setup a broadcast channel and proxy all the respones to a backend instead
 								- [ ] tests should use `test-env-log = "0.2.8"`
-												more direct consensus finding code

this hopefully has less bugs. speed isn't super important since this isn't on the host path.

											
										
										
											2023-03-21 21:16:18 +03:00
+								- [ ] eth_sendRawTransaction should only forward if the chain_id matches what we are running
-												eth_subscribe rpc_accounting logging

											
										
										
											2022-11-20 01:05:51 +03:00
+								- [ ] weighted random choice should still prioritize non-archive servers
 								    - maybe shuffle randomly and then sort by (block_limit, random_index)?
 								    - maybe sum available_requests grouped by archive/non-archive. only limit to non-archive if they have enough?
 								- [ ] some places we call it "accounting" others a "stat". be consistent
-												comments and todos

											
										
										
											2022-11-16 23:18:37 +03:00
+								- [ ] cli commands to search users by key
-												cut out tracing for now

											
										
										
											2022-11-12 11:24:32 +03:00
+								- [ ] flamegraphs show 25% of the time to be in moka-housekeeper. tune that
-												config todo

											
										
										
											2022-11-04 07:57:16 +03:00
+								- [ ] config parsing is strict right now. this makes it hard to deploy on git push since configs need to change along with it
-												clean up todos

											
										
										
											2022-11-04 07:46:37 +03:00
+								- [ ] when displaying the user's data, they just see an opaque id for their tier. We should join that data
-												stats v2

rebased all my commits and squashed them down to one

											
										
										
											2023-01-26 08:24:09 +03:00
+								- [ ] refactor so configs can change while running
 								  - this will probably be a rather large change, but is necessary when we have autoscaling
 								  - create the app without applying any config to it
 								  - have a blocking future watching the config file and calling app.apply_config() on first load and on change
 								  - work started on this in the "config_reloads" branch. because of how we pass channels around during spawn, this requires a larger refactor.
 								- [ ] when displaying the user's data, they just see an opaque id for their tier. We should join that data so they see the tier name and limits
-												clean up todos

											
										
										
											2022-11-04 07:46:37 +03:00
+								- [ ] add indexes to speed up stat queries
-												fix joins for user_stats_aggregate_get

											
										
										
											2022-11-04 06:40:43 +03:00
+								- [ ] the public rpc is rate limited by ip and the authenticated rpc is rate limit by key
 								    - this means if a dapp uses the authenticated RPC on their website, they could get rate limited more easily
 								- [ ] take an option to set a non-default role when creating a user
 								- [ ] different prune levels for free tiers
 								- [ ] have a test that runs ethspam and versus
 								- [ ] status page show git hash of running version
 								- [ ] Email confirmation
 								    - [ ] we'll need a pretty template email that the backend will send.
 								    - [ ] That will link them to a a page on llamanodes.com
 								    - [ ] There, they click "confirm" (or JavaScript does it for them automatically) to POST to this new endpoint
 								- [ ] test in the migration repo that sets up a sqlite database that runs up and down
 								- [ ] unbounded queues are risky. add limits
 								- [ ] after running for a while, https://eth-ski.llamanodes.com/status is only at 157 blocks and hashes. i thought they would be near 10k after running for a while
 								    - adding uptime to the status should help
 								    - i think this is already in our todo list
-												polish todos

											
										
										
											2022-12-19 21:57:11 +03:00
+								- [ ] write a test that uses the cli to create a user and modifies their key
-												better stats aggregations

											
										
										
											2022-11-03 02:14:16 +03:00
+								- [ ] Uuid/Ulid instead of big_unsigned for database ids
 								  - might have to use Uuid in sea-orm and then convert to Ulid on display
 								  - https://www.kostolansky.sk/posts/how-to-migrate-to-uuid/
-												update TODOs

											
										
										
											2022-10-12 01:39:02 +03:00
+								- [ ] emit stdandard deviation?
-												add per-user rpc accounting

											
										
										
											2022-10-10 07:15:07 +03:00
+								- [ ] emit global stat on retry
 								- [ ] emit global stat on no servers synced
 								- [ ] emit global stat on error (maybe just use sentry, but graphs are handy)
-												stats in redis that actually work

we should still investigate a real time series db, but stats in redis is much simpler for now

											
										
										
											2022-10-07 05:15:53 +03:00
+								  - if we wait until the error handler to emit the stat, i don't think we have access to the authorized_request
-												update TODOs

											
										
										
											2022-10-12 01:39:02 +03:00
+								- [ ] endpoint (and cli script) to rotate api key
-												ULID or UUID. Prefer ULID

											
										
										
											2022-09-24 08:53:45 +03:00
+								- [ ] if no bearer token found in redis (likely because it expired), send 401 unauthorized
-												add influxdb to example prod config

											
										
										
											2022-09-27 03:35:33 +03:00
+								- [ ] user create script should allow multiple keys per user
-												move warning

											
										
										
											2022-09-28 20:01:11 +03:00
+								- [ ] somehow the proxy thought latest was hours behind. need internal health check that forces reconnect if this happens
-												rename api_key to rpc_key

											
										
										
											2022-10-27 03:12:42 +03:00
+								- [ ] display concurrent requests per api key (only with authentication!)
-												polish todos

											
										
										
											2022-12-19 21:57:11 +03:00
+								- [ ] change "remember me" to last until 4 weeks of no use, rather than 4 weeks since login? that will be a lot more database writes
-												upgrades and todo cleanup

											
										
										
											2022-10-27 01:29:38 +03:00
+								- [ ] BUG? WARN http_request:request: web3_proxy::block_number: could not get block from params err=unexpected params length id=01GF4HTRKM4JV6NX52XSF9AYMW method=POST authorized_request=User(Some(SqlxMySqlPoolConnection), AuthorizedKey { ip: 10.11.12.15, origin: None, user_key_id: 4, log_revert_chance: 0.0000 })
 								  - why is it failing to get the block from params when its set to None? That should be the simple case
 								- [ ] BUG: i think if all backend servers stop, the server doesn't properly reconnect. It appears to stop listening on 8854, but not shut down.
 								- [ ] if user-specific caches have evictions that aren't from timeouts, log a warning
 								- [ ] make sure the email address is valid. probably have a "verified" column in the database
 								- [ ] if invalid user id given, we give a 500. should be a different error code instead
 								  - WARN http_request: web3_proxy::frontend::errors: anyhow err=UserKey was not a ULID or UUID id=01GER4VBTS0FDHEBR96D1JRDZF method=POST
 								- [ ] admin-only endpoint for seeing a user's stats for support requests
 								- [ ] from what i thought, /status should show hashes > numbers!
 								  - but block numbers count is maxed out (10k)
 								  - and block hashes count is tiny (83)
 								  - what is going on? when the server fist launches they are in sync
 								  - [ ] related BUG? WARN web3_proxy::rpcs::blockchain: Missing connection_head_block in block_hashes. Fetching now connection_head_hash=0x4b7a…14b5 conn_name=local_erigon_alpha_archive rpc=local_erigon_alpha_archive
 								  - i see this a lot more than expected. why is it happening so much? better logs needed
 								- [ ] after adding semaphores (or maybe something else), CPU load seems a lot higher. investigate
 								- [ ] proper support for Finalized and Safe block queries
-												cleanup

											
										
										
											2022-10-20 00:34:05 +03:00
+								- [ ] admin-only page for viewing user stat pages
-												more todo

											
										
										
											2022-10-18 02:16:09 +03:00
+								- [ ] geth sometimes gives an empty response instead of an error response. figure out a good way to catch this and not serve it
-												more docs

											
										
										
											2022-10-18 00:47:58 +03:00
+								- [ ] GET balance endpoint
 								- [ ] POST balance endpoint
-												stats in redis that actually work

we should still investigate a real time series db, but stats in redis is much simpler for now

											
										
										
											2022-10-07 05:15:53 +03:00
+								- [ ] EIP1271 for siwe
 								- [ ] Limited throughput during high traffic
-												login needs its own rate limiter

											
										
										
											2022-09-24 06:59:21 +03:00
+								- [ ] instead of Option<...> in our frontend function signatures, use result and then the try operator so that we get our errors wrapped in json
 								- [ ] revert logs should have a maximum age and a maximum count to keep the database from being huge
 								- [ ] user login should also return a jwt (jsonwebtoken rust crate should make it easy)
-												larger max_capacity now that there is a weigher

											
										
										
											2022-09-20 01:24:56 +03:00
+								- [ ] script that looks at config and estimates max memory used by caches
-												dry errors so that rate limits dont log so much

											
										
										
											2022-09-10 03:12:14 +03:00
+								- [ ] favicon
-												more small todos

											
										
										
											2022-09-07 23:24:35 +03:00
+								  - eth_1       | 2022-09-07T17:10:48.431536Z  WARN web3_proxy::jsonrpc: forwarding error err=nothing to see here
 								  - use the one on https://staging.llamanodes.com/
-												todo cleanup

											
										
										
											2022-09-07 07:47:06 +03:00
+								- [ ] warn if no servers have transaction subscriptions
 								    - [ ] if no servers have transaction subscriptions, and a user tries to subscribe, make sure the error is user friendly
-												order most of the todos

											
										
										
											2022-09-12 17:31:57 +03:00
+								- [ ] only allow transaction and full block subscriptions if the user is registered?
-												todo cleanup

											
										
										
											2022-09-07 07:47:06 +03:00
+								- [ ] eth_subscribe logs (https://geth.ethereum.org/docs/rpc/pubsub)
 								- [ ] make private transactions opt in (its already in the database, but not our code)
-												order most of the todos

											
										
										
											2022-09-12 17:31:57 +03:00
+								- [ ] write a function for receipts that tries balanced_rpcs and only if they all error should it try private relays
-												todo cleanup

											
										
										
											2022-09-07 07:47:06 +03:00
+								  - [ ] automatic retries with a timeout or until all the servers have been tried.
 								    - i had the websocket die on me in the middle of a long test. only one in-flight request failed because of it. the rest delayed. figure out how to catch these ones since websocket fails sadly seem common
-												better logs

											
										
										
											2022-07-26 01:36:02 +03:00
+								- [ ] nice output when cargo doc is run
-												simple page instead of websocket error

											
										
										
											2022-08-11 03:16:13 +03:00
+								- [ ] cache more things locally or in redis
-												todos

											
										
										
											2022-06-25 05:45:50 +03:00
+								- [ ] stats when forks are resolved (and what chain they were on?)
-												merge todo list from phone

											
										
										
											2022-07-21 06:30:39 +03:00
+								- [ ] Only subscribe to transactions when someone is listening and if the server has opted in to it
 								- [ ] When sending eth_sendRawTransaction, retry errors
 								- [ ] If we need an archive server and no servers in sync, exit immediately with an error instead of waiting 60 seconds
-												polish todos

											
										
										
											2022-12-19 21:57:11 +03:00
+								- [ ] 120 second timeout is too short. Maybe do that for free tier and larger timeout for paid. Problem is that some queries can take over 1000 seconds
-												stats v2

rebased all my commits and squashed them down to one

											
										
										
											2023-01-26 08:24:09 +03:00
+								- [ ] when handling errors from axum parsing the Json...Enum in the function signature, the errors don't get wrapped in json. i think we need a axum::Layer
-												move no unwrap todo to v1

											
										
										
											2022-08-20 00:09:03 +03:00
+								- [ ] don't "unwrap" anywhere. give proper errors
-												todo cleanup

											
										
										
											2022-09-07 07:47:06 +03:00
+								- [ ] handle log subscriptions
-												order most of the todos

											
										
										
											2022-09-12 17:31:57 +03:00
+								  - probably as a paid feature
-												turn on more production servers

											
										
										
											2022-10-13 02:30:50 +03:00
+								- [ ] relevant erigon changelogs: add pendingTransactionWithBody subscription method (#5675)
-												polish todos

											
										
										
											2022-12-19 21:57:11 +03:00
+								- [ ] change_user_tier_by_key should not show the rpc key id. that way our ansible playbook won't expose it
-												link to jsonrpc spec

											
										
										
											2023-02-25 10:29:54 +03:00
+								- [ ] make sure all our responses follow the spec: https://www.jsonrpc.org/specification#examples
-												lots of todos

											
										
										
											2023-02-26 02:07:05 +03:00
+								- [ ] min_sum_soft_limit should be automatic based on the app's average rps plus a buffer.
 								  - [ ] add a rate counter to the balanced_rpcs
 								  - [ ] every time a block is found, update min_sum_soft_limit
 								  - [ ] add a min_sum_soft_limit_safety
 								      - keeps the automaticly calculated limit from going so high that we stop serving requests
 								  - [ ] add a min_sum_soft_limit_max_wait that advances the consensus block even if mins not met yet
-												cargo upgrade

											
										
										
											2023-02-25 11:47:16 +03:00
+								- [ ] a script for load testing a server and calculating its hard and soft limits
-												lots of todos

											
										
										
											2023-02-26 02:07:05 +03:00
+								- [ ] use https://github.com/dherman/esprit or similar to parse https://github.com/DefiLlama/chainlist/blob/main/constants/extraRpcs.js
 								- [ ] update example.toml
 								    - might need to make changes so the influxdb stuff is optional. david said it stopped right after starting
 								- [ ] i'm seeing a bunch of errors with eth_getLogs.
 								    - i think maybe my block number rewriting is causing problems. but maybe its just a user doing bad queries
 								- [ ] Use "is_fresh" instead of our atomic bool
 								    - moka 0.10 - Add entry and entry_by_ref APIs to sync and future caches (#193):
 								        They allow users to perform more complex operations on a cache entry. At this point, the following operations (methods) are provided:
 								            or_default
 								            or_insert
 								            or_insert_with
 								            or_insert_with_if
 								            or_optionally_insert_with
 								            or_try_insert_with
 								        The above methods return Entry type, which provides is_fresh method to check if the value was freshly computed or already existed in the cache.
 								- [ ] lag message always shows on first response
 								    - http interval on blastapi lagging by 1!
-												stats v2

rebased all my commits and squashed them down to one

											
										
										
											2023-01-26 08:24:09 +03:00
+								- [ ] change scoring for rpcs again. "p2c ewma"
 								  - [ ] weighted random sort: (soft_limit - ewma active requests * num web3_proxy servers)
 								    - 2. soft_limit
 								  - [ ] pick 2 servers from the random sort.
 								    - [ ] exponential weighted moving average for block subscriptions of time behind the first server (works well for ws but not http)
-												just do one app for now

											
										
										
											2022-07-14 00:49:57 +03:00
-												clean up todos

											
										
										
											2022-06-21 04:02:49 +03:00
+								## V2
-												dry errors so that rate limits dont log so much

											
										
										
											2022-09-10 03:12:14 +03:00
+								These are not ordered. I think some rows also accidently got deleted here. Check git history.
-												todo cleanup

											
										
										
											2022-09-07 07:47:06 +03:00
-												wip

											
										
										
											2023-02-25 20:48:40 +03:00
+								- [ ] less Arc (and more pin?). we use arcs on a lot of things where i think a &self should work fine.
-												broadcast txs to less servers

											
										
										
											2023-01-12 01:51:01 +03:00
+								- [ ] automatically tune database and redis connection pool size
 								- [ ] if db is down, keep logins cached longer. at least only new logins will have trouble then
-												order most of the todos

											
										
										
											2022-09-12 17:31:57 +03:00
+								- [ ] handle user payments
 								  - [ ] separate daemon (or users themselves) call POST /users/process_transaction
 								    - checks a transaction to see if it modifies a user's balance. records results in a sql database
 								    - we will have our own event subscriber watching for "deposit" events, but sometimes events get missed and users might incorrectly "transfer" the tokens directly to an address instead of using the dapp
 								- [ ] if a rpc fails to connect at start, retry later instead of skipping it forever (need config hot reloads first)
-												first pass at a schema

											
										
										
											2022-07-26 03:38:00 +03:00
+								- [ ] jwt auth so people can easily switch from infura
-												todos

											
										
										
											2022-07-19 10:01:55 +03:00
+								- [ ] automated soft limit
 								  - look at average request time for getBlock? i'm not sure how good a proxy that will be for serving eth_call, but its a start
-												merge todo list from phone

											
										
										
											2022-07-21 06:30:39 +03:00
+								  - https://crates.io/crates/histogram-sampler
-												error if future block is requested

											
										
										
											2022-07-21 02:49:29 +03:00
+								- [ ] interval for http subscriptions should be based on block time. load from config is easy, but better to query. currently hard coded to 13 seconds
-												serialize best_rpcs

											
										
										
											2023-03-23 00:23:14 +03:00
+								- [ ] check code to keep us from going backwards. maybe that is causing outages
 								- [ ] min_backup_rpcs seperate from min_synced_rpcs
-												rearrange todos

											
										
										
											2022-07-21 05:57:14 +03:00
 								in another repo: event subscriber
 								  - [ ] watch for transfer events to our contract and submit them to /payment/$tx_hash
 								  - [ ] cli tool that support can run to manually check and submit a transaction
-												clean up todos

											
										
										
											2022-06-21 04:02:49 +03:00
 								## "Maybe some day" and other Miscellaneous Things
-												ULID or UUID. Prefer ULID

											
										
										
											2022-09-24 08:53:45 +03:00
+								- [ ] tool to revoke bearer tokens that clears redis
-												order most of the todos

											
										
										
											2022-09-12 17:31:57 +03:00
+								- [ ] eth_getBlockByNumber and similar calls served from the block map
 								  - will need all Block<TxHash> **and** Block<TransactionReceipt> in caches or fetched efficiently
 								  - so maybe we don't want this. we can just use the general request cache for these. they will only require 1 request and it means requests won't get in the way as much on writes as new blocks arrive.
 								  - after looking at my request logs, i think its worth doing this. no point hitting the backends with requests for blocks multiple times. will also help with cache hit rates since we can keep recent blocks in a separate cache
 								- [ ] Public bsc server got “0” for block data limit (ninicoin)
 								- [ ] cli tool for resetting api keys
 								- [ ] Advanced load testing scripts so we can find optimal cost servers
 								  - [ ] benchmarks from https://github.com/llamafolio/llamafolio-api/
 								  - [ ] benchmarks from ethspam and versus
 								  - [ ] benchmarks from other things
 								  - [ ] quick script that calls all the curve-api endpoints once and checks for success, then calls wrk to hammer it
 								    - [ ] https://github.com/curvefi/curve-api
 								    - [ ] test /api/getGaugesmethod
 								        - usually times out after vercel's 60 second timeout
 								        - one time got: Error invalid Json response ""
 								- [ ] page that prints a graphviz dotfile of the blockchain
-												todo cleanup

											
										
										
											2022-09-07 07:47:06 +03:00
+								- [ ] search for all the "TODO" and `todo!(...)` items in the code and move them here
-												set overall max inside the lock

											
										
										
											2022-05-06 23:44:12 +03:00
+								- [ ] add the backend server to the header?
-												rearrange todos

											
										
										
											2022-07-21 05:57:14 +03:00
+								- [ ] have a low-latency option that always tries at least two servers in parallel and then returns the first success?
 								  - this doubles our request load though. maybe only if the first one doesn't respond very quickly?
-												move todos

											
										
										
											2022-05-13 09:54:47 +03:00
+								- [ ] zero downtime deploys
 								- [ ] are we using Acquire/Release/AcqRel properly? or do we need other modes?
-												clean up todos

											
										
										
											2022-06-21 04:02:49 +03:00
+								- [ ] use https://github.com/ledgerwatch/interfaces to talk to erigon directly instead of through erigon's rpcdaemon (possible example code which uses ledgerwatch/interfaces: https://github.com/akula-bft/akula/tree/master)
-												retries

											
										
										
											2022-07-02 04:20:28 +03:00
+								- [ ] subscribe to pending transactions and build an intelligent gas estimator
-												add is_archive_needed and a bunch of rpc commands

											
										
										
											2022-07-09 05:23:26 +03:00
+								- [ ] flashbots specific methods
 								  - [ ] flashbots protect fast mode or not? probably fast matches most user's needs, but no reverts is nice.
 								  - [ ] https://docs.flashbots.net/flashbots-auction/searchers/advanced/rpc-endpoint#authentication maybe have per-user keys. or pass their header on if its set
-												todos

											
										
										
											2022-07-10 21:06:20 +03:00
+								- [ ] if no redis set, but public rate limits are set, exit with an error
-												rearrange todos

											
										
										
											2022-07-21 05:57:14 +03:00
+								- [ ] i saw "WebSocket connection closed unexpectedly" but no log about reconnecting
 								  - need better logs on this because afaict it did reconnect
 								- [ ] better document load tests: docker run --rm --name spam shazow/ethspam --rpc http://$LOCAL_IP:8544 | versus --concurrency=100 --stop-after=10000 http://$LOCAL_IP:8544; docker stop spam
-												todos

											
										
										
											2022-07-19 10:01:55 +03:00
+								- [ ] if the call is something simple like "symbol" or "decimals", cache that too. though i think this could bite us.
-												error if future block is requested

											
										
										
											2022-07-21 02:49:29 +03:00
+								- [ ] add a subscription that returns the head block number and hash but nothing else
 								- [ ] if chain split detected, what should we do? don't send transactions?
-												improved rate limiting on websockets

											
										
										
											2023-01-19 03:17:43 +03:00
+								- [ ] archive check works well for local servers, but public nodes (especially on other chains) seem to give unreliable results. likely because of load balancers.
 								  - [x] configurable block data limit until better checks
-												merge todo list from phone

											
										
										
											2022-07-21 06:30:39 +03:00
+								- [ ] https://docs.rs/derive_builder/latest/derive_builder/
 								- [ ] Detect orphaned transactions
 								- [ ] https://crates.io/crates/reqwest-middleware easy retry with exponential back off
 								  - Though I think we want retries that go to other backends instead
 								- [ ] Some of the pub things should probably be "pub(crate)"
 								- [ ] Maybe storing pending txs on receipt in a dashmap is wrong. We want to store in a timer_heap (or similar) when we actually send. This way there's no lock contention until the race is over.
 								- [ ] Support "safe" block height. It's planned for eth2 but we can kind of do it now but just doing head block num-3
 								- [ ] Archive check on BSC gave “archive” when it isn’t. and FTM gave 90k for all servers even though they should be archive
-												improve caching

											
										
										
											2022-07-22 22:30:39 +03:00
+								- [ ] cache eth_getLogs in a database?
 								- [ ] stats for "read amplification". how many backend requests do we send compared to frontend requests we received?
-												shutdown signal

											
										
										
											2022-07-23 02:26:04 +03:00
+								- [ ] fully test retrying when "header not found"
 								  - i saw "header not found" on a simple eth_getCode query to a public load balanced bsc archive node on block 1
-												test more

											
										
										
											2022-07-23 03:19:13 +03:00
+								- [ ] weird flapping fork could have more useful logs. like, howd we get to 1/1/4 and fork. geth changed its mind 3 times?
-												and yet more todo

											
										
										
											2022-08-06 09:57:29 +03:00
+								  - should we change our code to follow the same consensus rules as geth? our first seen still seems like a reasonable choice
 								  -  other chains might change all sorts of things about their fork choice rules
-												cargo upgrade and shorten variable names

also begin adding a latency tracker for rpc stats

											
										
										
											2023-02-06 20:55:27 +03:00
+-07-22T23:52:18.593956Z  WARN block_receiver: web3_proxy::connections: chain is forked! 1 possible heads. 1/1/4 rpcs have 0xa906…5bc1 rpc=Web3Rpc { url: "ws://127.0.0.1:8546", data: 64, .. } new_block_num=15195517
 -07-22T23:52:18.983441Z  WARN block_receiver: web3_proxy::connections: chain is forked! 1 possible heads. 1/1/4 rpcs have 0x70e8…48e0 rpc=Web3Rpc { url: "ws://127.0.0.1:8546", data: 64, .. } new_block_num=15195517
 -07-22T23:52:19.350720Z  WARN block_receiver: web3_proxy::connections: chain is forked! 2 possible heads. 1/2/4 rpcs have 0x70e8…48e0 rpc=Web3Rpc { url: "ws://127.0.0.1:8549", data: "archive", .. } new_block_num=15195517
 -07-22T23:52:26.041140Z  WARN block_receiver: web3_proxy::connections: chain is forked! 2 possible heads. 2/4/4 rpcs have 0x70e8…48e0 rpc=Web3Rpc { url: "http://127.0.0.1:8549", data: "archive", .. } new_block_num=15195517
-												thresholds and fork detection

											
										
										
											2022-07-25 03:27:00 +03:00
+								  - [ ] threshold should check actual available request limits (if any) instead of just the soft limit
-												shorter function names

											
										
										
											2022-08-04 01:23:10 +03:00
+								- [ ] foreign key on_update and on_delete
-												use uuid earlier

											
										
										
											2022-08-06 04:17:25 +03:00
+								- [ ] database creation timestamps
-												more todos

											
										
										
											2022-08-06 05:29:55 +03:00
+								- [ ] better error handling. we warn too often for validation errors and use the same error code for most every request
 								- [ ] use &str more instead of String. lifetime annotations get really annoying though
 								- [ ] tarpit instead of reject requests (unless theres a lot)
-												another todo

											
										
										
											2022-08-06 08:33:32 +03:00
+								- [ ] archive servers should be lowest priority
-												more todo

											
										
										
											2022-08-06 09:20:29 +03:00
+								- [ ] docker build context is really big. we must be including target or something
-												polish todos

											
										
										
											2022-12-19 21:57:11 +03:00
+								- [ ] fix ip detection when running in dev
-												and yet more todo

											
										
										
											2022-08-06 09:57:29 +03:00
+								- [ ] PR to add this to sea orm prelude:
 								  ```
 								  #[cfg(feature = "with-uuid")]
 								  pub use uuid::Builder as UuidBuilder;
 								  ```
-												todos

											
										
										
											2022-08-07 23:49:46 +03:00
+								- [ ] rate limit thoughts:
 								  - if someone subscribes to all pending transactions, how should that count against rate limits
 								  - when those rate limits are hit, what should happen?
 								  - missing pending transactions might be okay, but not missing confirmed blocks
-												polish todo list

											
										
										
											2022-08-16 08:13:19 +03:00
+								- [ ] sea-orm brings in async-std, but we are using tokio. benchmark switching
-												part of the command got deleted

											
										
										
											2022-08-16 20:14:47 +03:00
+								- [ ] this query always times out, but erigon can serve it quickly: `curl -X POST -H "Content-Type: application/json" --data '{"jsonrpc":"2.0","method":"debug_traceBlockByNumber","params":["latest"],"id":1}' 127.0.0.1:8544' 127.0.0.1:8544`
 								  {"jsonrpc":"2.0","id":null,"error":{"code":-32099,"message":"deadline has elapsed"}}
-												move todo

											
										
										
											2022-08-27 03:19:49 +03:00
+								  - [ ] figure out rate limits for private rpcs. eden v1 gives 500 error instead of a code for rate limits
-												user_address change not made yet

											
										
										
											2022-09-05 09:29:27 +03:00
+								- [ ] https://gitlab.com/moka-labs/tiered-cache-example
-												todo cleanup

											
										
										
											2022-09-07 07:47:06 +03:00
+								- [ ] web3connection3.block(...) might wait forever. be sure to do it safely
 								- [ ] search for all "todo!"
 								- [ ] when using a bunch of slow public servers, i see "no servers in sync" even when things should be right
 								  - maybe iterate connection heads by total weight? i still think we need to include parent hashes
 								- [ ] i see "No block found" sometimes for a single server's block. Not sure why since reads should happen after writes
-												wip

											
										
										
											2022-09-22 02:50:55 +03:00
+								- [ ] better handling for offline http servers
 								  - if we get a connection refused, we should remove the server's block info so it is taken out of rotation
 								- [ ] how should we handle reverting transactions? they won't confirm for a while after we send them
-												ULID or UUID. Prefer ULID

											
										
										
											2022-09-24 08:53:45 +03:00
+								- [ ] allow configuration of the expiration time of bearer tokens. currently defaults to 4 weeks
 								- [ ] emit stat when an IP/key goes over rate limits
-												very permissive cors policy

											
										
										
											2022-09-25 07:26:13 +03:00
+								- [ ] readme command should run create_user commands via docker-compose
 								- [ ] helper for UUID <-> ULID
 								- [ ] Wrapping extractors in Result makes them optional and gives you the reason the extraction failed
-												todos

											
										
										
											2022-09-28 07:24:02 +03:00
+								- [ ] at concurrency 100, ethspam is getting 400 and 422 errors. figure out why. probably something with redis or mysql, but maybe its something else like spawning
 								- [ ] emit per-key stats for latency of semaphore awaits. if this starts to grow, people will know they are hitting limits and need a higher tier
-												move warning

											
										
										
											2022-09-28 20:01:11 +03:00
+								- [ ] need a status page for your wallet's rpc. show head block information with age
-												proper sizes for caches and emit all stats

											
										
										
											2022-10-11 22:58:25 +03:00
+								- [ ] replace serde_json::Value with https://lib.rs/crates/ijson (more memory efficient)
-												update TODOs

											
										
										
											2022-10-12 01:39:02 +03:00
+								- [ ] have a log all option? instead of just reverts, log all request/responses? can be very useful for debugging but would flood our database. maybe better for them to do that on their client side
 								- [ ] failsafe. if no blocks or transactions in some time, warn and reset the connection
-												bug todos

											
										
										
											2022-10-12 02:18:18 +03:00
+								- [ ] WARN http_request:request: web3_proxy::block_number: could not get block from params err=unexpected params length id=01GF4HTRKM4JV6NX52XSF9AYMW method=POST authorized_request=User(Some(SqlxMySqlPoolConnection), AuthorizedKey { ip: 10.11.12.15, origin: None, user_key_id: 4, log_revert_chance: 0.0000 })
 								- [ ] having tons of worker threads can actually make us slower if they keep waking to steal work from eachother. need benchmarks
 								- [ ] change the wrk data to log requests and errors to a file
-												aggregate stats endpoint

											
										
										
											2022-10-19 03:56:57 +03:00
+								- [ ] if redis is not set and login page is visited, users get a 502. should be 501
-												upgrades and todo cleanup

											
										
										
											2022-10-27 01:29:38 +03:00
+								- [ ] allow passing the authorization header to the anonymous rpc endpoint
-												more auth when only rpc_key_id is set

											
										
										
											2022-11-05 01:58:15 +03:00
+								- [ ] sentry profiling
-												comments and todos

											
										
										
											2022-11-16 23:18:37 +03:00
+								- [ ] support alchemy_minedTransactions
 								- [ ] debug print of user::Model's address is a big vec of numbers. make that hex somehow
-												todos

											
										
										
											2022-11-22 23:22:15 +03:00
+								- [ ] make it so you can put a string like "LN arbitrum" into the create_user script, and have it automatically turn it into 0x4c4e20617262697472756d000000000000000000.
 								  - [ ] if --address not given, use the --description
 								  - [ ] if it is too long, (the last 4 bytes must be zero), give an error so descriptions like this stand out
 								- [ ] we need to use docker-compose's proper environment variable handling. because now if someone tries to start dev containers in their prod, remove orphans stops and removes them
-												error for unauthenticated user queries

											
										
										
											2022-11-27 22:49:32 +03:00
+								- [ ] change invite codes to set the user_tier
-												improve request caching

											
										
										
											2022-12-17 07:05:01 +03:00
+								- [ ] some cli commands should use the replica if possible
 								- [ ] some third party rpcs have limits on the size of eth_getLogs. include those limits in server config
-												more logging

											
										
										
											2022-12-20 00:53:38 +03:00
+								- [ ] some internal requests should go through app.proxy_rpc_request so that they get caching!
 								    - be careful not to make an infinite loop
-												move more into the spawned task

											
										
										
											2022-12-20 21:54:13 +03:00
+								- [ ] request timeout messages should include the request id
-												tx stats too

											
										
										
											2022-12-29 09:21:09 +03:00
+								- [ ] have an upgrade tier that queries multiple backends at once. returns on first Ok result, collects errors. if no Ok, find the most common error and then respond with that
-												transfer key script

											
										
										
											2023-01-10 04:50:09 +03:00
+								- [ ] give public_recent_ips_salt a better, more general, name
 								- [ ] include tier in the head block logs?
-												refactors to make configs partially reloadable

											
										
										
											2023-02-26 10:52:33 +03:00
+								- [ ] i think i use FuturesUnordered when a try_join_all might be better
-												by_name needs to a lock

											
										
										
											2023-02-27 07:00:13 +03:00
+								- [ ] since we are read-heavy on our configs, maybe we should use a cache
 								  - "using a thread local storage and explicit types" https://docs.rs/arc-swap/latest/arc_swap/cache/struct.Cache.html
-												add inotify and rpc disconnect

											
										
										
											2023-02-27 10:52:37 +03:00
+								- [ ] tests for config reloading
 								- [ ] use pin instead of arc for a bunch of things?
 								  - https://fasterthanli.me/articles/pin-and-suffering
-												stats v2

rebased all my commits and squashed them down to one

											
										
										
											2023-01-26 08:24:09 +03:00
+								- [ ] calculate archive depth automatically based on block_data_limits