- [x] better locking. when lots of requests come in, we seem to be in the way of block updates
- [x] load balance between multiple RPC servers
- [x] support more than just ETH
- [x] option to disable private rpc and send everything to primary
- [x] support websocket clients
- we support websockets for the backends already, but we need them for the frontend too
- [x] health check nodes by block height
- [x] Dockerfile
- [x] docker-compose.yml
- [x] after connecting to a server, check that it gives the expected chainId
- [x] the ethermine rpc is usually fastest. but its in the private tier. since we only allow synced rpcs, we are going to not have an rpc a lot of the time
- [x] if not backends. return a 502 instead of delaying?
- [x] move from warp to axum
- [x] handle websocket disconnect and reconnect
- [x] eth_sendRawTransaction should return the most common result, not the first
- [x] the web3proxyapp object gets cloned for every call. why do we need any arcs inside that? shouldn't they be able to connect to the app's? can we just use static lifetimes
- [x] when sending with private relays, brownie's tx.wait can think the transaction was dropped. smarter retry on eth_getTransactionByHash and eth_getTransactionReceipt (maybe only if we sent the transaction ourselves)
- [x] if web3 proxy gets an http error back, retry another node
- [x] endpoint for health checks. if no synced servers, give a 502 error
- originally, no processing was done to params; they were just serde_json::RawValue. this is probably fastest, but we need to look for "latest" and count elements, so we have to use serde_json::Value
- [x] subscription id should be per connection, not global
- [x] when under load, i'm seeing "http interval lagging!". sometimes it happens when not loaded.
- we were skipping our delay interval when block hash wasn't changed. so if a block was ever slow, the http provider would get the same hash twice and then would try eth_getBlockByNumber a ton of times
- [x] inspect any jsonrpc errors. if its something like "header not found" or "block with id $x not found" retry on another node (and add a negative score to that server)
- this error seems to happen when we use load balanced backend rpcs like pokt and ankr
- we can improve this by only publishing the synced connections once a threshold of total available soft and hard limits is passed. how can we do this without hammering redis? at least its only once per block per server
- [x] instead of tracking `pending_synced_connections`, have a mapping of where all connections are individually. then each change, re-check for consensus.
- [x] Got warning: "WARN subscribe_new_heads:send_block: web3_proxy::connection: unable to get block from https://rpc.ethermine.org: Deserialization Error: expected value at line 1 column 1. Response: error code: 1015". this is cloudflare rate limiting on fetching a block, but this is a private rpc. why is there a block subscription?
- [x] im seeing ethspam occasionally try to query a future block. something must be setting the head block too early
- [x] we were sorting best block the wrong direction. i flipped a.cmp(b) to b.cmp(a) so that the largest would be first, but then i used 'max_by' which looks at the end of the list
- [x] if the eth_call (or similar) params include a block, we can cache for that
- [x] when block subscribers receive blocks, store them in a block_map
- [x] eth_blockNumber without a backend request
- [x] if we send a transaction to private rpcs and then people query it on public rpcs things, some interfaces might think the transaction is dropped (i saw this happen in a brownie script of mine). how should we handle this?
- [x] sendgetTransaction rpc requests to the private rpc tier
- [x] i saw a fork of like 300 blocks. probably just because a node was restarted and had fallen behind. need some checks to ignore things that are far behind. this improvement should fix this problem
- i keep a mapping of blocks so that i can go from hash -> block. it has some consistent hashing it does to split them up across multiple maps each with their own lock. so a lot of the time reads dont block writes because they are in different internal maps. this was fine. but after changing my fork detection logic to use the same rules as erigon, i discovered that when you get blocks from a websocket subscription in erigon and geth, theres a missing field (https://github.com/ledgerwatch/erigon/issues/5190). so i added a query to get the block that includes the missing field.
- but i did this in a way where i was holding the write lock open while doing the query. the "new" block that has the missing field ends up in the same bucket and it also wants a write lock. oops. entry api has very sharp edges. don't ever await inside a match on DashMap::entry
- this was intentional so that recently confirmed transactions go to a server that is more likely to have the tx.
- but under heavy load, we hit their rate limits. need a "retry_until_success" function that goes to balanced_rpcs. or maybe store in redis the txids that we broadcast privately and use that to route.
- [x] basic request method stats (using the user_id and other fields that are in the tracing frame)
- [x] refactor from_anyhow_error to have consistent error codes and http codes. maybe implement the Error trait
- [x] improve rpc weights. i think theres still a potential thundering herd
- [x] improved logging with useful instrumentation
- [x] right now the block_map is unbounded. move this to redis and do some calculations to be sure about RAM usage
- [x] synced connections swap threshold should come from config
- [x] right now we send too many getTransaction queries to the private rpc tier and i are being rate limited by some of them. change to be serial and weight by hard/soft limit.
- [x] ip blocking gives a 500 and not the proper error code
- [x] need a reconnect that doesn't unwrap
- [x] need a retrying_reconnect that is used everywhere reconnect is. have exponential backoff here
- [x] it looks like our reconnect logic is not always firing. we need to make reconnect more robust!
- i am pretty sure that this is actually servers that fail to connect on initial setup (maybe the rpcs that are on the wrong chain are just timing out and they aren't set to reconnect?)
- [x] when there are a LOT of concurrent requests, we see errors. i thought that was a problem with redis cell, but it happens with my simpler rate limit. now i think the problem is actually with bb8
- https://docs.rs/redis/latest/redis/aio/struct.ConnectionManager.html or https://crates.io/crates/deadpool-redis?
- WARN http_request: redis_rate_limit::errors: redis error err=Response was of incompatible type: "Response type not string compatible." (response was int(500237)) id=01GC6514JWN5PS1NCWJCGJTC94 method=POST
- [x] if a websocket connection hasn't received a new block in a while, do a reconnect or just query the block. its possible that the node was syncing when the proxy started
- [x] node selection still needs improvements. we still send to syncing nodes if they are close
- try consensus heads first! only if that is empty should we try others. and we should try them sorted by block height and then randomly chosen from there
- [x] logging of "bad response!" is way too verbose
- [x] i think our "best" server picking is incorrect somehow.
- we upgraded erigon to a version with a broken websocket
- that made it clear we still route to the lagged server sometimes. this is bad, but retries keep it from giving users bad data.
- [x] more trace logging
- [x] on ETH, we no longer need total difficulty
- [x] cli for creating and editing a user's first api key
- [x] benchmarks of the different Cache implementations (futures vs dash)
- futures is better
- [x] if archive servers are added to the rotation while they are still syncing, they might get requests too soon. keep archive servers out of the configs until they are done syncing. full nodes should be fine to add to the configs even while syncing, though its a wasted connection
- [x] subscribing to transactions should be configurable per server. listening to paid servers can get expensive
- [x] status page leaks our urls which contain secrets. change that to use names
- [x] for easier errors in the axum code, i think we need to have our own type that wraps anyhow::Result+Error
- [x] hit counts seem wrong. how are we hitting the backend so much more than the frontend? retries on disconnect don't seem to fit that
- [-] if we subscribe to a server that is syncing, it gives us null block_data_limit. when it catches up, we don't ever send queries to it. we need to recheck block_data_limit
- [ ] don't use new_head_provider anywhere except new head subscription
- [x] remove the "metered" crate now that we save aggregate queries?
- [x] don't use systemtime. use chrono
- [x] graceful shutdown
- [x] frontend needs to shut down first. this will stop serving requests on /health and so new requests should quickly stop being routed to us
- [x] when frontend has finished, tell all the other tasks to stop
- [x] stats buffer needs to flush to both the database and influxdb
- [x]`rpc_accounting` script
- [x] period_datetime should always round to the start of the minute. this will ensure aggregations use as few rows as possible
- [x] weighted random choice should still prioritize non-archive servers
- maybe shuffle randomly and then sort by (block_limit, random_index)?
- maybe sum available_requests grouped by archive/non-archive. only limit to non-archive if they have enough?
- [x] if we subscribe to a server that is syncing, it gives us null block_data_limit. when it catches up, we don't ever send queries to it. we need to recheck block_data_limit
- [x] add a "backup" tier that is only used if balanced_rpcs has "no servers synced"
- use this tier to check timestamp on latest block. if we are behind that by more than a few seconds, something is wrong
- [x]`change_user_tier_by_address` script
- [x] emit stats for user's successes, retries, failures, with the types of requests, chain, rpc
- [x] add caching to speed up stat queries
- [x] config parsing is strict right now. this makes it hard to deploy on git push since configs need to change along with it
- changed to only emit a warning if there is an unknown configuration key
- [x] make the "not synced" error more verbose
- [x] short lived cache on /health
- [x] cache /status for longer
- [x] sort connections during eth_sendRawTransaction
- [x] block all admin_ rpc commands
- [x] remove the "metered" crate now that we save aggregate queries?
- [x] add archive depth to app config
- [x] improve "archive_needed" boolean. change to "block_depth"
- [x] keep score of new_head timings for all rpcs
- [x] having the whole block in /status is very verbose. trim it down
- then sites like curve.fi don't have to worry about their user count
- it does mean we will have a harder time capacity planning from the number of keys
- [ ] have the healthcheck get the block over http. if it errors, or doesn't match what the websocket says, something is wrong (likely a deadlock in the websocket code)
- [ ] we have our hard rate limiter set up with a period of 60. but most providers have period of 1- [ ] two servers running will confuse rpc_accounting!
- would be nice if our subscriptions had better gaurentees than geth/erigon do, but maybe simpler to just setup a broadcast channel and proxy all the respones to a backend instead
- [ ] the public rpc is rate limited by ip and the authenticated rpc is rate limit by key
- this means if a dapp uses the authenticated RPC on their website, they could get rate limited more easily
- [ ] take an option to set a non-default role when creating a user
- [ ] different prune levels for free tiers
- [ ] have a test that runs ethspam and versus
- [ ] status page show git hash of running version
- [ ] Email confirmation
- [ ] we'll need a pretty template email that the backend will send.
- [ ] That will link them to a a page on llamanodes.com
- [ ] There, they click "confirm" (or JavaScript does it for them automatically) to POST to this new endpoint
- [ ] test in the migration repo that sets up a sqlite database that runs up and down
- [ ] unbounded queues are risky. add limits
- [ ] after running for a while, https://eth-ski.llamanodes.com/status is only at 157 blocks and hashes. i thought they would be near 10k after running for a while
- [ ] automatic retries with a timeout or until all the servers have been tried.
- i had the websocket die on me in the middle of a long test. only one in-flight request failed because of it. the rest delayed. figure out how to catch these ones since websocket fails sadly seem common
- [ ] 120 second timeout is too short. Maybe do that for free tier and larger timeout for paid. Problem is that some queries can take over 1000 seconds
- [ ] when handling errors from axum parsing the Json...Enum in the function signature, the errors don't get wrapped in json. i think we need a axum::Layer
- [ ] separate daemon (or users themselves) call POST /users/process_transaction
- checks a transaction to see if it modifies a user's balance. records results in a sql database
- we will have our own event subscriber watching for "deposit" events, but sometimes events get missed and users might incorrectly "transfer" the tokens directly to an address instead of using the dapp
- [ ] if a rpc fails to connect at start, retry later instead of skipping it forever (need config hot reloads first)
- [ ] eth_getBlockByNumber and similar calls served from the block map
- will need all Block<TxHash>**and** Block<TransactionReceipt> in caches or fetched efficiently
- so maybe we don't want this. we can just use the general request cache for these. they will only require 1 request and it means requests won't get in the way as much on writes as new blocks arrive.
- after looking at my request logs, i think its worth doing this. no point hitting the backends with requests for blocks multiple times. will also help with cache hit rates since we can keep recent blocks in a separate cache
- [ ] Public bsc server got “0” for block data limit (ninicoin)
- [ ] cli tool for resetting api keys
- [ ] Advanced load testing scripts so we can find optimal cost servers
- [ ] benchmarks from https://github.com/llamafolio/llamafolio-api/
- [ ] benchmarks from ethspam and versus
- [ ] benchmarks from other things
- [ ] quick script that calls all the curve-api endpoints once and checks for success, then calls wrk to hammer it
- [ ] https://github.com/curvefi/curve-api
- [ ] test /api/getGaugesmethod
- usually times out after vercel's 60 second timeout
- one time got: Error invalid Json response ""
- [ ] page that prints a graphviz dotfile of the blockchain
- [ ] use https://github.com/ledgerwatch/interfaces to talk to erigon directly instead of through erigon's rpcdaemon (possible example code which uses ledgerwatch/interfaces: https://github.com/akula-bft/akula/tree/master)
- [ ] flashbots protect fast mode or not? probably fast matches most user's needs, but no reverts is nice.
- [ ] https://docs.flashbots.net/flashbots-auction/searchers/advanced/rpc-endpoint#authentication maybe have per-user keys. or pass their header on if its set
- [ ] archive check works well for local servers, but public nodes (especially on other chains) seem to give unreliable results. likely because of load balancers.
- [x] configurable block data limit until better checks
- [ ] https://crates.io/crates/reqwest-middleware easy retry with exponential back off
- Though I think we want retries that go to other backends instead
- [ ] Some of the pub things should probably be "pub(crate)"
- [ ] Maybe storing pending txs on receipt in a dashmap is wrong. We want to store in a timer_heap (or similar) when we actually send. This way there's no lock contention until the race is over.
- [ ] Support "safe" block height. It's planned for eth2 but we can kind of do it now but just doing head block num-3
- [ ] Archive check on BSC gave “archive” when it isn’t. and FTM gave 90k for all servers even though they should be archive
- [ ] this query always times out, but erigon can serve it quickly: `curl -X POST -H "Content-Type: application/json" --data '{"jsonrpc":"2.0","method":"debug_traceBlockByNumber","params":["latest"],"id":1}' 127.0.0.1:8544' 127.0.0.1:8544`
{"jsonrpc":"2.0","id":null,"error":{"code":-32099,"message":"deadline has elapsed"}}
- [ ] at concurrency 100, ethspam is getting 400 and 422 errors. figure out why. probably something with redis or mysql, but maybe its something else like spawning
- [ ] emit per-key stats for latency of semaphore awaits. if this starts to grow, people will know they are hitting limits and need a higher tier
- [ ] have a log all option? instead of just reverts, log all request/responses? can be very useful for debugging but would flood our database. maybe better for them to do that on their client side
- [ ] failsafe. if no blocks or transactions in some time, warn and reset the connection
- [ ] make it so you can put a string like "LN arbitrum" into the create_user script, and have it automatically turn it into 0x4c4e20617262697472756d000000000000000000.
- [ ] if --address not given, use the --description
- [ ] if it is too long, (the last 4 bytes must be zero), give an error so descriptions like this stand out
- [ ] we need to use docker-compose's proper environment variable handling. because now if someone tries to start dev containers in their prod, remove orphans stops and removes them
- [ ] have an upgrade tier that queries multiple backends at once. returns on first Ok result, collects errors. if no Ok, find the most common error and then respond with that