ethers.js/docs.wrm/api/utils/strings.wrm

189 lines
8.2 KiB
Plaintext
Raw Permalink Normal View History

2020-05-08 03:24:40 -04:00
_section: Strings @<strings>
2021-02-08 14:52:31 -05:00
A **String** is a representation of a human-readable input of output,
which are often taken for granted.
2021-02-08 14:52:31 -05:00
When dealing with blockchains, properly handling human-readable and
human-provided data is important to prevent loss of funds, assets,
incorrect permissions, etc.
2020-05-08 03:24:40 -04:00
_subsection: Bytes32String @<Bytes32String>
A string in Solidity is length prefixed with its 256-bit (32 byte)
length, which means that even short strings require 2 words (64 bytes)
of storage.
In many cases, we deal with short strings, so instead of prefixing
the string with its length, we can null-terminate it and fit it in a
single word (32 bytes). Since we need only a single byte for the
null termination, we can store strings up to 31 bytes long in a
word.
2019-12-13 22:05:10 -05:00
_note: Note
Strings that are 31 __//bytes//__ long may contain fewer than 31 __//characters//__,
since UTF-8 requires multiple bytes to encode international characters.
2020-05-08 03:24:40 -04:00
_property: ethers.utils.parseBytes32String(aBytesLike) => string @<utils-parseBytes32> @SRC<strings>
Returns the decoded string represented by the ``Bytes32`` encoded data.
2020-05-08 03:24:40 -04:00
_property: ethers.utils.formatBytes32String(text) => string<[[DataHexString]]<32>> @<utils-formatBytes32> @SRC<strings>
Returns a ``bytes32`` string representation of //text//. If the
length of //text// exceeds 31 bytes, it will throw an error.
2020-05-08 03:24:40 -04:00
_subsection: UTF-8 Strings @<strings-utf8>
2020-05-08 03:24:40 -04:00
_property: ethers.utils.toUtf8Bytes(text [ , form = current ] ) => Uint8Array @<utils-toUtf8Bytes> @SRC<strings>
Returns the UTF-8 bytes of //text//, optionally normalizing it using the
2020-05-08 03:24:40 -04:00
[[strings--unicode-normalization-form]] //form//.
2020-05-08 03:24:40 -04:00
_property: ethers.utils.toUtf8CodePoints(text [ , form = current ] ) => Array<number> @<utils-toUtf8CodePoints> @SRC<strings>
2020-02-02 00:52:20 -05:00
Returns the Array of codepoints of //text//, optionally normalized using the
2020-05-08 03:24:40 -04:00
[[strings--unicode-normalization-form]] //form//.
2020-01-10 01:01:00 -05:00
_note: Note
This function correctly splits each **user-perceived character** into
its codepoint, accounting for surrogate pairs. This should not be confused with
2020-10-03 13:30:15 -03:00
``string.split("")``, which destroys surrogate pairs, splitting between each UTF-16
codeunit instead.
2020-05-08 03:24:40 -04:00
_property: ethers.utils.toUtf8String(aBytesLike [ , onError = error ] ) => string @<utils-toUtf8String> @SRC<strings>
2020-02-02 00:52:20 -05:00
Returns the string represented by the UTF-8 bytes of //aBytesLike//.
2020-05-08 03:24:40 -04:00
The //onError// is a [Custom UTF-8 Error function](strings--error-handling) and if not specified
it defaults to the [error](strings--Utf8Error) function, which throws an error
2020-02-02 00:52:20 -05:00
on **any** UTF-8 error.
2020-05-08 03:24:40 -04:00
_subsection: UnicodeNormalizationForm @<strings--unicode-normalization-form> @SRC<strings/utf8:enum.UnicodeNormalizationForm>
2020-02-02 07:58:29 -05:00
There are several [commonly used forms](link-wiki-unicode-equivalence)
when normalizing UTF-8 data, which allow strings to be compared or hashed in a stable
way.
2020-02-02 00:52:20 -05:00
_property: ethers.utils.UnicodeNormalizationForm.current
Maintain the current normalization form.
2020-02-02 00:52:20 -05:00
_property: ethers.utils.UnicodeNormalizationForm.NFC
The Composed Normalization Form. This form uses single codepoints
which represent the fully composed character.
For example, the **&eacute;** is a single codepoint, ``0x00e9``.
2020-02-02 00:52:20 -05:00
_property: ethers.utils.UnicodeNormalizationForm.NFD
The Decomposed Normalization Form. This form uses multiple codepoints
(when necessary) to compose a character.
For example, the **&eacute;**
is made up of two codepoints, ``"0x0065"`` (which is the letter ``"e"``)
and ``"0x0301"`` which is a special diacritic UTF-8 codepoint which
indicates the previous character should have an acute accent.
2020-02-02 00:52:20 -05:00
_property: ethers.utils.UnicodeNormalizationForm.NFKC
The Composed Normalization Form with Canonical Equivalence. The Canonical
representation folds characters which have the same syntactic representation
but different semantic meaning.
For example, the Roman Numeral **I**, which has a UTF-8
codepoint ``"0x2160"``, is folded into the capital letter I, ``"0x0049"``.
2020-02-02 00:52:20 -05:00
_property: ethers.utils.UnicodeNormalizationForm.NFKD
The Decomposed Normalization Form with Canonical Equivalence.
See NFKC for more an example.
2019-12-13 22:05:10 -05:00
_note: Note
Only certain specified characters are folded in Canonical Equivalence, and thus
2020-10-03 13:30:15 -03:00
it should **not** be considered a method to achieve //any// level of security from
2020-02-02 07:58:29 -05:00
[homoglyph attacks](link-wiki-homoglyph).
2020-02-02 00:52:20 -05:00
2020-05-08 03:24:40 -04:00
_subsection: Custom UTF-8 Error Handling @<strings--error-handling>
2020-02-02 00:52:20 -05:00
When converting a string to its codepoints, there is the possibility
of invalid byte sequences. Since certain situations may need specific
ways to handle UTF-8 errors, a custom error handling function can be used,
which has the signature:
_property: errorFunction(reason, offset, bytes, output [ , badCodepoint ]) => number
2020-05-08 03:24:40 -04:00
The //reason// is one of the [UTF-8 Error Reasons](strings--error-reasons), //offset// is the index
2020-02-02 00:52:20 -05:00
into //bytes// where the error was first encountered, output is the list
of codepoints already processed (and may be modified) and in certain Error
Reasons, the //badCodepoint// indicates the currently computed codepoint,
but which would be rejected because its value is invalid.
This function should return the number of bytes to skip past keeping in
mind the value at //offset// will already be consumed.
2020-05-08 03:24:40 -04:00
_heading: UTF-8 Error Reasons @<strings--error-reasons> @SRC<strings/utf8:Utf8ErrorReason>
2020-02-02 00:52:20 -05:00
_property: ethers.utils.Utf8ErrorReason.BAD_PREFIX
A byte was encountered which is invalid to begin a UTF-8 byte
sequence with.
_property: ethers.utils.Utf8ErrorReason.MISSING_CONTINUE
A UTF-8 sequence was begun, but did not have enough continuation
bytes for the sequence. For this error the //ofset// is the index
at which a continuation byte was expected.
_property: ethers.utils.Utf8ErrorReason.OUT_OF_RANGE
The computed codepoint is outside the range for valid UTF-8
codepoints (i.e. the codepoint is greater than 0x10ffff).
This reason will pass the computed //badCountpoint// into
the custom error function.
_property: ethers.utils.Utf8ErrorReason.OVERLONG
Due to the way UTF-8 allows variable length byte sequences
to be used, it is possible to have multiple representations
of the same character, which means
2020-02-02 07:58:29 -05:00
[overlong sequences](link-wiki-utf8-overlong)
2020-02-02 00:52:20 -05:00
allow for a non-distinguished string to be formed, which can
impact security as multiple strings that are otherwise
equal can have different hashes.
Generally, overlong sequences are an attempt to circumvent
some part of security, but in rare cases may be produced by
lazy libraries or used to encode the null terminating
character in a way that is safe to include in a ``char*``.
This reason will pass the computed //badCountpoint// into the
custom error function, which is actually a valid codepoint, just
one that was arrived at through unsafe methods.
_property: ethers.utils.Utf8ErrorReason.OVERRUN
The string does not have enough characters remaining for the
length of this sequence.
_property: ethers.utils.Utf8ErrorReason.UNEXPECTED_CONTINUE
This error is similar to BAD_PREFIX, since a continuation byte
cannot begin a valid sequence, but many may wish to process this
differently. However, most developers would want to trap this
and perform the same operation as a BAD_PREFIX.
_property: ethers.utils.Utf8ErrorReason.UTF16_SURROGATE
The computed codepoint represents a value reserved for
UTF-16 surrogate pairs.
This reason will pass the computed surrogate half
//badCountpoint// into the custom error function.
_heading: Provided UTF-8 Error Handling Functions
There are already several functions available for the most common
situations.
2020-05-08 03:24:40 -04:00
_property: ethers.utils.Utf8ErrorFuncs.error @<strings--Utf8Error> @SRC<strings/utf8:errorFunc>
2020-02-02 00:52:20 -05:00
The will throw an error on **any** error with a UTF-8 sequence, including
invalid prefix bytes, overlong sequences, UTF-16 surrogate pairs.
2020-05-08 03:24:40 -04:00
_property: ethers.utils.Utf8ErrorFuncs.ignore @<strings--Utf8Ignore> @SRC<strings/utf8:ignoreFunc>
2020-02-02 00:52:20 -05:00
This will drop all invalid sequences (by consuming invalid prefix bytes and
any following continuation bytes) from the final string as well as permit
overlong sequences to be converted to their equivalent string.
2020-05-08 03:24:40 -04:00
_property: ethers.utils.Utf8ErrorFuncs.replace @<strings--Utf8Replace> @SRC<strings/utf8:replaceFunc>
2020-02-02 00:52:20 -05:00
This will replace all invalid sequences (by consuming invalid prefix bytes and
any following continuation bytes) with the
2020-02-02 07:58:29 -05:00
[UTF-8 Replacement Character](link-wiki-utf8-replacement),
2020-02-02 00:52:20 -05:00
(i.e. U+FFFD).