2020-05-08 03:24:40 -04:00
|
|
|
_section: Strings @<strings>
|
2019-08-21 01:53:47 -04:00
|
|
|
|
2021-02-08 14:52:31 -05:00
|
|
|
A **String** is a representation of a human-readable input of output,
|
|
|
|
which are often taken for granted.
|
2019-08-21 01:53:47 -04:00
|
|
|
|
2021-02-08 14:52:31 -05:00
|
|
|
When dealing with blockchains, properly handling human-readable and
|
|
|
|
human-provided data is important to prevent loss of funds, assets,
|
|
|
|
incorrect permissions, etc.
|
2019-08-21 01:53:47 -04:00
|
|
|
|
2020-05-08 03:24:40 -04:00
|
|
|
_subsection: Bytes32String @<Bytes32String>
|
2019-08-21 01:53:47 -04:00
|
|
|
|
|
|
|
A string in Solidity is length prefixed with its 256-bit (32 byte)
|
|
|
|
length, which means that even short strings require 2 words (64 bytes)
|
|
|
|
of storage.
|
|
|
|
|
|
|
|
In many cases, we deal with short strings, so instead of prefixing
|
|
|
|
the string with its length, we can null-terminate it and fit it in a
|
|
|
|
single word (32 bytes). Since we need only a single byte for the
|
|
|
|
null termination, we can store strings up to 31 bytes long in a
|
|
|
|
word.
|
|
|
|
|
2019-12-13 22:05:10 -05:00
|
|
|
_note: Note
|
2019-08-21 01:53:47 -04:00
|
|
|
Strings that are 31 __//bytes//__ long may contain fewer than 31 __//characters//__,
|
|
|
|
since UTF-8 requires multiple bytes to encode international characters.
|
|
|
|
|
2020-05-08 03:24:40 -04:00
|
|
|
_property: ethers.utils.parseBytes32String(aBytesLike) => string @<utils-parseBytes32> @SRC<strings>
|
2019-08-21 01:53:47 -04:00
|
|
|
Returns the decoded string represented by the ``Bytes32`` encoded data.
|
|
|
|
|
2020-05-08 03:24:40 -04:00
|
|
|
_property: ethers.utils.formatBytes32String(text) => string<[[DataHexString]]<32>> @<utils-formatBytes32> @SRC<strings>
|
2019-08-21 01:53:47 -04:00
|
|
|
Returns a ``bytes32`` string representation of //text//. If the
|
|
|
|
length of //text// exceeds 31 bytes, it will throw an error.
|
|
|
|
|
|
|
|
|
2020-05-08 03:24:40 -04:00
|
|
|
_subsection: UTF-8 Strings @<strings-utf8>
|
2019-08-21 01:53:47 -04:00
|
|
|
|
2020-05-08 03:24:40 -04:00
|
|
|
_property: ethers.utils.toUtf8Bytes(text [ , form = current ] ) => Uint8Array @<utils-toUtf8Bytes> @SRC<strings>
|
2019-08-21 01:53:47 -04:00
|
|
|
Returns the UTF-8 bytes of //text//, optionally normalizing it using the
|
2020-05-08 03:24:40 -04:00
|
|
|
[[strings--unicode-normalization-form]] //form//.
|
2019-08-21 01:53:47 -04:00
|
|
|
|
2020-05-08 03:24:40 -04:00
|
|
|
_property: ethers.utils.toUtf8CodePoints(text [ , form = current ] ) => Array<number> @<utils-toUtf8CodePoints> @SRC<strings>
|
2020-02-02 00:52:20 -05:00
|
|
|
Returns the Array of codepoints of //text//, optionally normalized using the
|
2020-05-08 03:24:40 -04:00
|
|
|
[[strings--unicode-normalization-form]] //form//.
|
2019-08-21 01:53:47 -04:00
|
|
|
|
2020-01-10 01:01:00 -05:00
|
|
|
_note: Note
|
|
|
|
This function correctly splits each **user-perceived character** into
|
2019-08-21 01:53:47 -04:00
|
|
|
its codepoint, accounting for surrogate pairs. This should not be confused with
|
2020-10-03 13:30:15 -03:00
|
|
|
``string.split("")``, which destroys surrogate pairs, splitting between each UTF-16
|
2019-08-21 01:53:47 -04:00
|
|
|
codeunit instead.
|
|
|
|
|
2020-05-08 03:24:40 -04:00
|
|
|
_property: ethers.utils.toUtf8String(aBytesLike [ , onError = error ] ) => string @<utils-toUtf8String> @SRC<strings>
|
2020-02-02 00:52:20 -05:00
|
|
|
Returns the string represented by the UTF-8 bytes of //aBytesLike//.
|
2019-08-21 01:53:47 -04:00
|
|
|
|
2020-05-08 03:24:40 -04:00
|
|
|
The //onError// is a [Custom UTF-8 Error function](strings--error-handling) and if not specified
|
|
|
|
it defaults to the [error](strings--Utf8Error) function, which throws an error
|
2020-02-02 00:52:20 -05:00
|
|
|
on **any** UTF-8 error.
|
2019-08-21 01:53:47 -04:00
|
|
|
|
2020-05-08 03:24:40 -04:00
|
|
|
_subsection: UnicodeNormalizationForm @<strings--unicode-normalization-form> @SRC<strings/utf8:enum.UnicodeNormalizationForm>
|
2019-08-21 01:53:47 -04:00
|
|
|
|
2020-02-02 07:58:29 -05:00
|
|
|
There are several [commonly used forms](link-wiki-unicode-equivalence)
|
2019-08-21 01:53:47 -04:00
|
|
|
when normalizing UTF-8 data, which allow strings to be compared or hashed in a stable
|
|
|
|
way.
|
|
|
|
|
2020-02-02 00:52:20 -05:00
|
|
|
_property: ethers.utils.UnicodeNormalizationForm.current
|
2019-08-21 01:53:47 -04:00
|
|
|
Maintain the current normalization form.
|
|
|
|
|
2020-02-02 00:52:20 -05:00
|
|
|
_property: ethers.utils.UnicodeNormalizationForm.NFC
|
2019-08-21 01:53:47 -04:00
|
|
|
The Composed Normalization Form. This form uses single codepoints
|
|
|
|
which represent the fully composed character.
|
|
|
|
|
|
|
|
For example, the **é** is a single codepoint, ``0x00e9``.
|
|
|
|
|
2020-02-02 00:52:20 -05:00
|
|
|
_property: ethers.utils.UnicodeNormalizationForm.NFD
|
2019-08-21 01:53:47 -04:00
|
|
|
The Decomposed Normalization Form. This form uses multiple codepoints
|
|
|
|
(when necessary) to compose a character.
|
|
|
|
|
|
|
|
For example, the **é**
|
|
|
|
is made up of two codepoints, ``"0x0065"`` (which is the letter ``"e"``)
|
|
|
|
and ``"0x0301"`` which is a special diacritic UTF-8 codepoint which
|
|
|
|
indicates the previous character should have an acute accent.
|
|
|
|
|
2020-02-02 00:52:20 -05:00
|
|
|
_property: ethers.utils.UnicodeNormalizationForm.NFKC
|
2019-08-21 01:53:47 -04:00
|
|
|
The Composed Normalization Form with Canonical Equivalence. The Canonical
|
|
|
|
representation folds characters which have the same syntactic representation
|
|
|
|
but different semantic meaning.
|
|
|
|
|
|
|
|
For example, the Roman Numeral **I**, which has a UTF-8
|
|
|
|
codepoint ``"0x2160"``, is folded into the capital letter I, ``"0x0049"``.
|
|
|
|
|
2020-02-02 00:52:20 -05:00
|
|
|
_property: ethers.utils.UnicodeNormalizationForm.NFKD
|
2019-08-21 01:53:47 -04:00
|
|
|
The Decomposed Normalization Form with Canonical Equivalence.
|
|
|
|
See NFKC for more an example.
|
|
|
|
|
2019-12-13 22:05:10 -05:00
|
|
|
_note: Note
|
2019-08-21 01:53:47 -04:00
|
|
|
Only certain specified characters are folded in Canonical Equivalence, and thus
|
2020-10-03 13:30:15 -03:00
|
|
|
it should **not** be considered a method to achieve //any// level of security from
|
2020-02-02 07:58:29 -05:00
|
|
|
[homoglyph attacks](link-wiki-homoglyph).
|
2020-02-02 00:52:20 -05:00
|
|
|
|
2020-05-08 03:24:40 -04:00
|
|
|
|
|
|
|
_subsection: Custom UTF-8 Error Handling @<strings--error-handling>
|
2020-02-02 00:52:20 -05:00
|
|
|
|
|
|
|
When converting a string to its codepoints, there is the possibility
|
|
|
|
of invalid byte sequences. Since certain situations may need specific
|
|
|
|
ways to handle UTF-8 errors, a custom error handling function can be used,
|
|
|
|
which has the signature:
|
|
|
|
|
|
|
|
_property: errorFunction(reason, offset, bytes, output [ , badCodepoint ]) => number
|
2020-05-08 03:24:40 -04:00
|
|
|
The //reason// is one of the [UTF-8 Error Reasons](strings--error-reasons), //offset// is the index
|
2020-02-02 00:52:20 -05:00
|
|
|
into //bytes// where the error was first encountered, output is the list
|
|
|
|
of codepoints already processed (and may be modified) and in certain Error
|
|
|
|
Reasons, the //badCodepoint// indicates the currently computed codepoint,
|
|
|
|
but which would be rejected because its value is invalid.
|
|
|
|
|
|
|
|
This function should return the number of bytes to skip past keeping in
|
|
|
|
mind the value at //offset// will already be consumed.
|
|
|
|
|
2020-05-08 03:24:40 -04:00
|
|
|
|
|
|
|
_heading: UTF-8 Error Reasons @<strings--error-reasons> @SRC<strings/utf8:Utf8ErrorReason>
|
2020-02-02 00:52:20 -05:00
|
|
|
|
|
|
|
_property: ethers.utils.Utf8ErrorReason.BAD_PREFIX
|
|
|
|
A byte was encountered which is invalid to begin a UTF-8 byte
|
|
|
|
sequence with.
|
|
|
|
|
|
|
|
_property: ethers.utils.Utf8ErrorReason.MISSING_CONTINUE
|
|
|
|
A UTF-8 sequence was begun, but did not have enough continuation
|
|
|
|
bytes for the sequence. For this error the //ofset// is the index
|
|
|
|
at which a continuation byte was expected.
|
|
|
|
|
|
|
|
_property: ethers.utils.Utf8ErrorReason.OUT_OF_RANGE
|
|
|
|
The computed codepoint is outside the range for valid UTF-8
|
|
|
|
codepoints (i.e. the codepoint is greater than 0x10ffff).
|
|
|
|
This reason will pass the computed //badCountpoint// into
|
|
|
|
the custom error function.
|
|
|
|
|
|
|
|
_property: ethers.utils.Utf8ErrorReason.OVERLONG
|
|
|
|
Due to the way UTF-8 allows variable length byte sequences
|
|
|
|
to be used, it is possible to have multiple representations
|
|
|
|
of the same character, which means
|
2020-02-02 07:58:29 -05:00
|
|
|
[overlong sequences](link-wiki-utf8-overlong)
|
2020-02-02 00:52:20 -05:00
|
|
|
allow for a non-distinguished string to be formed, which can
|
|
|
|
impact security as multiple strings that are otherwise
|
|
|
|
equal can have different hashes.
|
|
|
|
|
|
|
|
Generally, overlong sequences are an attempt to circumvent
|
|
|
|
some part of security, but in rare cases may be produced by
|
|
|
|
lazy libraries or used to encode the null terminating
|
|
|
|
character in a way that is safe to include in a ``char*``.
|
|
|
|
|
|
|
|
This reason will pass the computed //badCountpoint// into the
|
|
|
|
custom error function, which is actually a valid codepoint, just
|
|
|
|
one that was arrived at through unsafe methods.
|
|
|
|
|
|
|
|
_property: ethers.utils.Utf8ErrorReason.OVERRUN
|
|
|
|
The string does not have enough characters remaining for the
|
|
|
|
length of this sequence.
|
|
|
|
|
|
|
|
_property: ethers.utils.Utf8ErrorReason.UNEXPECTED_CONTINUE
|
|
|
|
This error is similar to BAD_PREFIX, since a continuation byte
|
|
|
|
cannot begin a valid sequence, but many may wish to process this
|
|
|
|
differently. However, most developers would want to trap this
|
|
|
|
and perform the same operation as a BAD_PREFIX.
|
|
|
|
|
|
|
|
_property: ethers.utils.Utf8ErrorReason.UTF16_SURROGATE
|
|
|
|
The computed codepoint represents a value reserved for
|
|
|
|
UTF-16 surrogate pairs.
|
|
|
|
This reason will pass the computed surrogate half
|
|
|
|
//badCountpoint// into the custom error function.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
_heading: Provided UTF-8 Error Handling Functions
|
|
|
|
|
|
|
|
There are already several functions available for the most common
|
|
|
|
situations.
|
|
|
|
|
2020-05-08 03:24:40 -04:00
|
|
|
_property: ethers.utils.Utf8ErrorFuncs.error @<strings--Utf8Error> @SRC<strings/utf8:errorFunc>
|
2020-02-02 00:52:20 -05:00
|
|
|
The will throw an error on **any** error with a UTF-8 sequence, including
|
|
|
|
invalid prefix bytes, overlong sequences, UTF-16 surrogate pairs.
|
|
|
|
|
2020-05-08 03:24:40 -04:00
|
|
|
_property: ethers.utils.Utf8ErrorFuncs.ignore @<strings--Utf8Ignore> @SRC<strings/utf8:ignoreFunc>
|
2020-02-02 00:52:20 -05:00
|
|
|
This will drop all invalid sequences (by consuming invalid prefix bytes and
|
|
|
|
any following continuation bytes) from the final string as well as permit
|
|
|
|
overlong sequences to be converted to their equivalent string.
|
|
|
|
|
2020-05-08 03:24:40 -04:00
|
|
|
_property: ethers.utils.Utf8ErrorFuncs.replace @<strings--Utf8Replace> @SRC<strings/utf8:replaceFunc>
|
2020-02-02 00:52:20 -05:00
|
|
|
This will replace all invalid sequences (by consuming invalid prefix bytes and
|
|
|
|
any following continuation bytes) with the
|
2020-02-02 07:58:29 -05:00
|
|
|
[UTF-8 Replacement Character](link-wiki-utf8-replacement),
|
2020-02-02 00:52:20 -05:00
|
|
|
(i.e. U+FFFD).
|