This change speeds up trie hashing and all other activities that require
RLP encoding of trie nodes by approximately 20%. The speedup is achieved by
avoiding reflection overhead during node encoding.
The interface type trie.node now contains a method 'encode' that works with
rlp.EncoderBuffer. Management of EncoderBuffers is left to calling code.
trie.hasher, which is pooled to avoid allocations, now maintains an
EncoderBuffer. This means memory resources related to trie node encoding
are tied to the hasher pool.
Co-authored-by: Felix Lange <fjl@twurst.com>
This change makes use of the new code generator rlp/rlpgen to improve the
performance of RLP encoding for Header and StateAccount. It also speeds up
encoding of ReceiptForStorage using the new rlp.EncoderBuffer API.
The change is much less transparent than I wanted it to be, because Header and
StateAccount now have an EncodeRLP method defined with pointer receiver. It
used to be possible to encode non-pointer values of these types, but the new
method prevents that and attempting to encode unadressable values (even if
part of another value) will return an error. The error can be surprising and may
pop up in places that previously didn't expect any errors.
To make things work, I also needed to update all code paths (mostly in unit tests)
that lead to encoding of non-pointer values, and pass a pointer instead.
Benchmark results:
name old time/op new time/op delta
EncodeRLP/legacy-header-8 328ns ± 0% 237ns ± 1% -27.63% (p=0.000 n=8+8)
EncodeRLP/london-header-8 353ns ± 0% 247ns ± 1% -30.06% (p=0.000 n=8+8)
EncodeRLP/receipt-for-storage-8 237ns ± 0% 123ns ± 0% -47.86% (p=0.000 n=8+7)
EncodeRLP/receipt-full-8 297ns ± 0% 301ns ± 1% +1.39% (p=0.000 n=8+8)
name old speed new speed delta
EncodeRLP/legacy-header-8 1.66GB/s ± 0% 2.29GB/s ± 1% +38.19% (p=0.000 n=8+8)
EncodeRLP/london-header-8 1.55GB/s ± 0% 2.22GB/s ± 1% +42.99% (p=0.000 n=8+8)
EncodeRLP/receipt-for-storage-8 38.0MB/s ± 0% 64.8MB/s ± 0% +70.48% (p=0.000 n=8+7)
EncodeRLP/receipt-full-8 910MB/s ± 0% 897MB/s ± 1% -1.37% (p=0.000 n=8+8)
name old alloc/op new alloc/op delta
EncodeRLP/legacy-header-8 0.00B 0.00B ~ (all equal)
EncodeRLP/london-header-8 0.00B 0.00B ~ (all equal)
EncodeRLP/receipt-for-storage-8 64.0B ± 0% 0.0B -100.00% (p=0.000 n=8+8)
EncodeRLP/receipt-full-8 320B ± 0% 320B ± 0% ~ (all equal)
This change adds a code generator tool for creating EncodeRLP method
implementations. The generated methods will behave identically to the
reflect-based encoder, but run faster because there is no reflection overhead.
Package rlp now provides the EncoderBuffer type for incremental encoding. This
is used by generated code, but the new methods can also be useful for
hand-written encoders.
There is also experimental support for generating DecodeRLP, and some new
methods have been added to the existing Stream type to support this. Creating
decoders with rlpgen is not recommended at this time because the generated
methods create very poor error reporting.
More detail about package rlp changes:
* rlp: externalize struct field processing / validation
This adds a new package, rlp/internal/rlpstruct, in preparation for the
RLP encoder generator.
I think the struct field rules are subtle enough to warrant extracting
this into their own package, even though it means that a bunch of
adapter code is needed for converting to/from rlpstruct.Type.
* rlp: add more decoder methods (for rlpgen)
This adds new methods on rlp.Stream:
- Uint64, Uint32, Uint16, Uint8, BigInt
- ReadBytes for decoding into []byte
- MoreDataInList - useful for optional list elements
* rlp: expose encoder buffer (for rlpgen)
This exposes the internal encoder buffer type for use in EncodeRLP
implementations.
The new EncoderBuffer type is a sort-of 'opaque handle' for a pointer to
encBuffer. It is implemented this way to ensure the global encBuffer pool
is handled correctly.
* trie: fix memory leak in trie iterator
In the trie iterator, live nodes are tracked in a stack while iterating.
Popped node states should be explictly set to nil in order to get
garbage-collected.
* trie: fix empty trie iterator
* feature: do trie prefetch on state prefetch
Currently, state prefetch just pre execute the transactions and discard the results.
It is helpful to increase the snapshot cache hit rate.
It would be more helpful, if it can do trie prefetch at the same time, since the it will
preload the trie node and build the trie tree in advance.
This patch is to implement it, by reusing the main trie prefetch and doing finalize after
transaction is executed.
* some code improvements for trie prefetch
** increase pendingSize before dispatch tasks
** use throwaway StateDB for TriePrefetchInAdvance and remove the prefetcherLock
** remove the necessary drain operation in trie prefetch mainloop,
trie prefetcher won't be used after close.
* trie prefetcher for From/To address in advance
We found that trie prefetch could be not fast enough, especially trie prefetch of
the outer big state trie tree.
Instead of do trie prefetch until a transaction is finalized, we could do trie prefetch
in advance. Try to prefetch the trie node of the From/To accounts, since their root hash
are most likely to be changed.
* Parallel TriePrefetch for large trie update.
Currently, we create a subfetch for each account address to do trie prefetch. If the address
has very large state change, trie prefetch could be not fast enough, e.g. a contract modified
lots of KV pair or a large number of account's root hash is changed in a block.
With this commit, there will be children subfetcher created to do trie prefetch in parallell if
the parent subfetch's workload exceed the threshold.
* some improvemnts of parallel trie prefetch implementation
1.childrenLock is removed, since it is not necessary
APIs of triePrefetcher is not thread safe, they should be used sequentially.
A prefetch will be interrupted by trie() or clos(), so we only need mark it as
interrupted and check before call scheduleParallel to avoid the concurrent access to paraChildren
2.rename subfetcher.children to subfetcher.paraChildren
3.use subfetcher.pendingSize to replace totalSize & processedIndex
4.randomly select the start child to avoid always feed the first one
5.increase threshold and capacity to avoid create too many child routine
* fix review comments
** nil check refine
** create a separate routine for From/To prefetch, avoid blocking the cirtical path
* remove the interrupt member
* not create a signer for each transaction
* some changes to triePrefetcher
** remove the abortLoop, move the subfetcher abort operation into mainLoop
since we want to make subfetcher's create & schedule & abort within a loop to
avoid concurrent access locks.
** no wait subfetcher's term signal in abort()
it could speed up the close by closing subfetcher concurrently.
we send stop signnal to all subfetchers in burst and wait their term signal later.
* some coding improve for subfetcher.scheduleParallel
* fix a UT crash of s.prefetcher == nil
* update parallel trie prefetcher configuration
tested with different combination of parallelTriePrefetchThreshold & parallelTriePrefetchCapacity,
found the most efficient configure could be:
parallelTriePrefetchThreshold = 10
parallelTriePrefetchCapacity = 20
* fix review comments: code refine