- [x] better locking. when lots of requests come in, we seem to be in the way of block updates
- [x] load balance between multiple RPC servers
- [x] support more than just ETH
- [x] option to disable private rpc and send everything to primary
- [x] support websocket clients
- we support websockets for the backends already, but we need them for the frontend too
- [x] health check nodes by block height
- [x] Dockerfile
- [x] docker-compose.yml
- [x] after connecting to a server, check that it gives the expected chainId
- [x] the ethermine rpc is usually fastest. but its in the private tier. since we only allow synced rpcs, we are going to not have an rpc a lot of the time
- [x] if not backends. return a 502 instead of delaying?
- [x] move from warp to axum
- [x] handle websocket disconnect and reconnect
- [x] eth_sendRawTransaction should return the most common result, not the first
- [x] the web3proxyapp object gets cloned for every call. why do we need any arcs inside that? shouldn't they be able to connect to the app's? can we just use static lifetimes
- [x] when sending with private relays, brownie's tx.wait can think the transaction was dropped. smarter retry on eth_getTransactionByHash and eth_getTransactionReceipt (maybe only if we sent the transaction ourselves)
- [x] if web3 proxy gets an http error back, retry another node
- [x] endpoint for health checks. if no synced servers, give a 502 error
- originally, no processing was done to params; they were just serde_json::RawValue. this is probably fastest, but we need to look for "latest" and count elements, so we have to use serde_json::Value
- [x] subscription id should be per connection, not global
- [x] when under load, i'm seeing "http interval lagging!". sometimes it happens when not loaded.
- we were skipping our delay interval when block hash wasn't changed. so if a block was ever slow, the http provider would get the same hash twice and then would try eth_getBlockByNumber a ton of times
- [x] inspect any jsonrpc errors. if its something like "header not found" or "block with id $x not found" retry on another node (and add a negative score to that server)
- this error seems to happen when we use load balanced backend rpcs like pokt and ankr
- we can improve this by only publishing the synced connections once a threshold of total available soft and hard limits is passed. how can we do this without hammering redis? at least its only once per block per server
- [x] instead of tracking `pending_synced_connections`, have a mapping of where all connections are individually. then each change, re-check for consensus.
- [x] Got warning: "WARN subscribe_new_heads:send_block: web3_proxy::connection: unable to get block from https://rpc.ethermine.org: Deserialization Error: expected value at line 1 column 1. Response: error code: 1015". this is cloudflare rate limiting on fetching a block, but this is a private rpc. why is there a block subscription?
- [x] im seeing ethspam occasionally try to query a future block. something must be setting the head block too early
- [x] we were sorting best block the wrong direction. i flipped a.cmp(b) to b.cmp(a) so that the largest would be first, but then i used 'max_by' which looks at the end of the list
- [x] if the eth_call (or similar) params include a block, we can cache for that
- [x] when block subscribers receive blocks, store them in a block_map
- [x] eth_blockNumber without a backend request
- [x] if we send a transaction to private rpcs and then people query it on public rpcs things, some interfaces might think the transaction is dropped (i saw this happen in a brownie script of mine). how should we handle this?
- [x] send getTransaction rpc requests to the private rpc tier
- [x] i saw a fork of like 300 blocks. probably just because a node was restarted and had fallen behind. need some checks to ignore things that are far behind. this improvement should fix this problem
- i keep a mapping of blocks so that i can go from hash -> block. it has some consistent hashing it does to split them up across multiple maps each with their own lock. so a lot of the time reads dont block writes because they are in different internal maps. this was fine. but after changing my fork detection logic to use the same rules as erigon, i discovered that when you get blocks from a websocket subscription in erigon and geth, theres a missing field (https://github.com/ledgerwatch/erigon/issues/5190). so i added a query to get the block that includes the missing field.
- but i did this in a way where i was holding the write lock open while doing the query. the "new" block that has the missing field ends up in the same bucket and it also wants a write lock. oops. entry api has very sharp edges. don't ever await inside a match on DashMap::entry
- this was intentional so that recently confirmed transactions go to a server that is more likely to have the tx.
- but under heavy load, we hit their rate limits. need a "retry_until_success" function that goes to balanced_rpcs. or maybe store in redis the txids that we broadcast privately and use that to route.
- [x] basic request method stats (using the user_id and other fields that are in the tracing frame)
- [x] refactor from_anyhow_error to have consistent error codes and http codes. maybe implement the Error trait
- [x] improve rpc weights. i think theres still a potential thundering herd
- [x] improved logging with useful instrumentation
- [x] right now the block_map is unbounded. move this to redis and do some calculations to be sure about RAM usage
- [x] synced connections swap threshold should come from config
- [x] right now we send too many getTransaction queries to the private rpc tier and i are being rate limited by some of them. change to be serial and weight by hard/soft limit.
- [x] ip blocking gives a 500 and not the proper error code
- [x] need a reconnect that doesn't unwrap
- [x] need a retrying_reconnect that is used everywhere reconnect is. have exponential backoff here
- [x] it looks like our reconnect logic is not always firing. we need to make reconnect more robust!
- i am pretty sure that this is actually servers that fail to connect on initial setup (maybe the rpcs that are on the wrong chain are just timing out and they aren't set to reconnect?)
- [x] when there are a LOT of concurrent requests, we see errors. i thought that was a problem with redis cell, but it happens with my simpler rate limit. now i think the problem is actually with bb8
- https://docs.rs/redis/latest/redis/aio/struct.ConnectionManager.html or https://crates.io/crates/deadpool-redis?
- WARN http_request: redis_rate_limit::errors: redis error err=Response was of incompatible type: "Response type not string compatible." (response was int(500237)) id=01GC6514JWN5PS1NCWJCGJTC94 method=POST
- [ ] eth_subscribe rpc_accounting has everything as cache_hits. should we instead count it as one background request?
- [ ] implement filters
- [ ] implement remaining subscriptions
- would be nice if our subscriptions had better gaurentees than geth/erigon do, but maybe simpler to just setup a broadcast channel and proxy all the respones to a backend instead
- [ ] tests should use `test-env-log = "0.2.8"`
- [ ] weighted random choice should still prioritize non-archive servers
- maybe shuffle randomly and then sort by (block_limit, random_index)?
- maybe sum available_requests grouped by archive/non-archive. only limit to non-archive if they have enough?
- [ ] some places we call it "accounting" others a "stat". be consistent
- [ ] the public rpc is rate limited by ip and the authenticated rpc is rate limit by key
- this means if a dapp uses the authenticated RPC on their website, they could get rate limited more easily
- [ ] add cacheing to speed up stat queries
- [ ] take an option to set a non-default role when creating a user
- [ ] different prune levels for free tiers
- [ ] have a test that runs ethspam and versus
- [ ] status page show git hash of running version
- [ ] Email confirmation
- [ ] we'll need a pretty template email that the backend will send.
- [ ] That will link them to a a page on llamanodes.com
- [ ] There, they click "confirm" (or JavaScript does it for them automatically) to POST to this new endpoint
- [ ] test in the migration repo that sets up a sqlite database that runs up and down
- [ ] unbounded queues are risky. add limits
- [-] let users choose a % to log (or maybe x/second). someone like curve logging all reverts will be a BIG database very quickly
- this must be opt-in or spawned since it will slow things down and will make their calls less private
- [ ] automatic pruning of old revert logs once too many are collected
- [ ] we currently default to 0.0 and don't expose a way to edit it. we have a database row, but we don't use it
- [ ] after running for a while, https://eth-ski.llamanodes.com/status is only at 157 blocks and hashes. i thought they would be near 10k after running for a while
- adding uptime to the status should help
- i think this is already in our todo list
- [ ] improve private transactions. keep re-broadcasting until they are confirmed
- eth_sendRawTransaction should accept "INTERNAL_ERROR: existing tx with same hash" as a successful response. we just want to be sure that the server has our tx and in this case, it does.
- [-] if we request an old block, more servers can handle it than we currently use.
- [ ] instead of the one list of just heads, store our intermediate mappings (rpcs_by_hash, rpcs_by_num, blocks_by_hash) in SyncedConnections. this shouldn't be too much slower than what we have now
- [ ] remove the if/else where we optionally route to archive and refactor to require a BlockNumber enum
- [ ] then check syncedconnections for the blockNum. if num given, use the cannonical chain to figure out the winning hash
- [ ] this means if someone requests a recent but not ancient block, they can use all our servers, even the slower ones. need smart sorting for priority here
- [ ] automatic retries with a timeout or until all the servers have been tried.
- i had the websocket die on me in the middle of a long test. only one in-flight request failed because of it. the rest delayed. figure out how to catch these ones since websocket fails sadly seem common
- [ ] if we subscribe to a server that is syncing, it gives us null block_data_limit. when it catches up, we don't ever send queries to it. we need to recheck block_data_limit
- [ ] separate daemon (or users themselves) call POST /users/process_transaction
- checks a transaction to see if it modifies a user's balance. records results in a sql database
- we will have our own event subscriber watching for "deposit" events, but sometimes events get missed and users might incorrectly "transfer" the tokens directly to an address instead of using the dapp
- [ ] if a rpc fails to connect at start, retry later instead of skipping it forever (need config hot reloads first)
- [ ] eth_getBlockByNumber and similar calls served from the block map
- will need all Block<TxHash>**and** Block<TransactionReceipt> in caches or fetched efficiently
- so maybe we don't want this. we can just use the general request cache for these. they will only require 1 request and it means requests won't get in the way as much on writes as new blocks arrive.
- after looking at my request logs, i think its worth doing this. no point hitting the backends with requests for blocks multiple times. will also help with cache hit rates since we can keep recent blocks in a separate cache
- [ ] Public bsc server got “0” for block data limit (ninicoin)
- [ ] cli tool for resetting api keys
- [ ] benchmarks of the different Cache implementations (futures vs dash)
- [ ] Advanced load testing scripts so we can find optimal cost servers
- [ ] benchmarks from https://github.com/llamafolio/llamafolio-api/
- [ ] benchmarks from ethspam and versus
- [ ] benchmarks from other things
- [ ] quick script that calls all the curve-api endpoints once and checks for success, then calls wrk to hammer it
- [ ] https://github.com/curvefi/curve-api
- [ ] test /api/getGaugesmethod
- usually times out after vercel's 60 second timeout
- one time got: Error invalid Json response ""
- [ ] page that prints a graphviz dotfile of the blockchain
- [ ] use https://github.com/ledgerwatch/interfaces to talk to erigon directly instead of through erigon's rpcdaemon (possible example code which uses ledgerwatch/interfaces: https://github.com/akula-bft/akula/tree/master)
- [ ] flashbots protect fast mode or not? probably fast matches most user's needs, but no reverts is nice.
- [ ] https://docs.flashbots.net/flashbots-auction/searchers/advanced/rpc-endpoint#authentication maybe have per-user keys. or pass their header on if its set
- [ ] if archive servers are added to the rotation while they are still syncing, they might get requests too soon. keep archive servers out of the configs until they are done syncing. full nodes should be fine to add to the configs even while syncing, though its a wasted connection
- [ ] archive check works well for local servers, but public nodes (especially on other chains) seem to give unreliable results. likely because of load balancers. maybe have a "max block data limit"
- [ ] https://crates.io/crates/reqwest-middleware easy retry with exponential back off
- Though I think we want retries that go to other backends instead
- [ ] Some of the pub things should probably be "pub(crate)"
- [ ] Maybe storing pending txs on receipt in a dashmap is wrong. We want to store in a timer_heap (or similar) when we actually send. This way there's no lock contention until the race is over.
- [ ] Support "safe" block height. It's planned for eth2 but we can kind of do it now but just doing head block num-3
- [ ] Archive check on BSC gave “archive” when it isn’t. and FTM gave 90k for all servers even though they should be archive
- [ ] this query always times out, but erigon can serve it quickly: `curl -X POST -H "Content-Type: application/json" --data '{"jsonrpc":"2.0","method":"debug_traceBlockByNumber","params":["latest"],"id":1}' 127.0.0.1:8544' 127.0.0.1:8544`
{"jsonrpc":"2.0","id":null,"error":{"code":-32099,"message":"deadline has elapsed"}}
- [ ] at concurrency 100, ethspam is getting 400 and 422 errors. figure out why. probably something with redis or mysql, but maybe its something else like spawning
- [ ] emit per-key stats for latency of semaphore awaits. if this starts to grow, people will know they are hitting limits and need a higher tier
- [ ] have a log all option? instead of just reverts, log all request/responses? can be very useful for debugging but would flood our database. maybe better for them to do that on their client side
- [ ] failsafe. if no blocks or transactions in some time, warn and reset the connection