2019-08-21 08:53:47 +03:00
|
|
|
_title: Strings
|
|
|
|
|
|
|
|
_section: Strings
|
|
|
|
|
|
|
|
Tra la la
|
|
|
|
|
|
|
|
|
|
|
|
_subsection: Bytes32String @<bytes32-string>
|
|
|
|
|
|
|
|
A string in Solidity is length prefixed with its 256-bit (32 byte)
|
|
|
|
length, which means that even short strings require 2 words (64 bytes)
|
|
|
|
of storage.
|
|
|
|
|
|
|
|
In many cases, we deal with short strings, so instead of prefixing
|
|
|
|
the string with its length, we can null-terminate it and fit it in a
|
|
|
|
single word (32 bytes). Since we need only a single byte for the
|
|
|
|
null termination, we can store strings up to 31 bytes long in a
|
|
|
|
word.
|
|
|
|
|
2019-12-14 06:05:10 +03:00
|
|
|
_note: Note
|
2019-08-21 08:53:47 +03:00
|
|
|
Strings that are 31 __//bytes//__ long may contain fewer than 31 __//characters//__,
|
|
|
|
since UTF-8 requires multiple bytes to encode international characters.
|
|
|
|
|
2020-01-10 09:01:00 +03:00
|
|
|
_property: utils.parseBytes32String(aBytesLike) => string @<utils-parsebytes32> @SRC<strings>
|
2019-08-21 08:53:47 +03:00
|
|
|
Returns the decoded string represented by the ``Bytes32`` encoded data.
|
|
|
|
|
2020-01-10 09:01:00 +03:00
|
|
|
_property: utils.formatBytes32String(text) => string<[[datahexstring]]<32>> @<utils-formatbytes32> @SRC<strings>
|
2019-08-21 08:53:47 +03:00
|
|
|
Returns a ``bytes32`` string representation of //text//. If the
|
|
|
|
length of //text// exceeds 31 bytes, it will throw an error.
|
|
|
|
|
|
|
|
|
|
|
|
_subsection: UTF-8 Strings @<utf8-string>
|
|
|
|
|
2020-01-10 09:01:00 +03:00
|
|
|
_property: utils.toUtf8Bytes(text [ , form = current ] ) => Uint8Array @<utils-toutf8bytes> @SRC<strings>
|
2019-08-21 08:53:47 +03:00
|
|
|
Returns the UTF-8 bytes of //text//, optionally normalizing it using the
|
|
|
|
[[unicode-normalization-form]] //form//.
|
|
|
|
|
2020-01-10 09:01:00 +03:00
|
|
|
_property: utils.toUtf8CodePoints(aBytesLike [ , form = current ] ) => Array<number> @<utils-toutf8codepoints> @SRC<strings>
|
2019-08-21 08:53:47 +03:00
|
|
|
Returns the Array of codepoints of //aBytesLike//, optionally normalizing it using the
|
|
|
|
[[unicode-normalization-form]] //form//.
|
|
|
|
|
2020-01-10 09:01:00 +03:00
|
|
|
_note: Note
|
|
|
|
This function correctly splits each **user-perceived character** into
|
2019-08-21 08:53:47 +03:00
|
|
|
its codepoint, accounting for surrogate pairs. This should not be confused with
|
|
|
|
``string.split("")``, which destroys surrogate pairs, spliting between each UTF-16
|
|
|
|
codeunit instead.
|
|
|
|
|
2020-01-10 09:01:00 +03:00
|
|
|
_property: utils.toUtf8String(aBytesLike [ , ignoreErrors = false ] ) => string @<utils-toutf8string> @SRC<strings>
|
2019-08-21 08:53:47 +03:00
|
|
|
Returns the string represented by the UTF-8 bytes of //aBytesLike//. This will
|
|
|
|
throw an error for invalid surrogates, overlong sequences or other UTF-8 issues,
|
|
|
|
unless //ignoreErrors// is specified.
|
|
|
|
|
|
|
|
|
2020-01-10 09:01:00 +03:00
|
|
|
_heading: UnicodeNormalizationForm @<unicode-normalization-form> @SRC<strings/utf8:enum.UnicodeNormalizationForm>
|
2019-08-21 08:53:47 +03:00
|
|
|
|
|
|
|
There are several [commonly used forms](https://en.wikipedia.org/wiki/Unicode_equivalence)
|
|
|
|
when normalizing UTF-8 data, which allow strings to be compared or hashed in a stable
|
|
|
|
way.
|
|
|
|
|
|
|
|
_property: utils.UnicodeNormalizationForm.current
|
|
|
|
Maintain the current normalization form.
|
|
|
|
|
|
|
|
_property: utils.UnicodeNormalizationForm.NFC
|
|
|
|
The Composed Normalization Form. This form uses single codepoints
|
|
|
|
which represent the fully composed character.
|
|
|
|
|
|
|
|
For example, the **é** is a single codepoint, ``0x00e9``.
|
|
|
|
|
|
|
|
_property: utils.UnicodeNormalizationForm.NFD
|
|
|
|
The Decomposed Normalization Form. This form uses multiple codepoints
|
|
|
|
(when necessary) to compose a character.
|
|
|
|
|
|
|
|
For example, the **é**
|
|
|
|
is made up of two codepoints, ``"0x0065"`` (which is the letter ``"e"``)
|
|
|
|
and ``"0x0301"`` which is a special diacritic UTF-8 codepoint which
|
|
|
|
indicates the previous character should have an acute accent.
|
|
|
|
|
|
|
|
_property: utils.UnicodeNormalizationForm.NFKC
|
|
|
|
The Composed Normalization Form with Canonical Equivalence. The Canonical
|
|
|
|
representation folds characters which have the same syntactic representation
|
|
|
|
but different semantic meaning.
|
|
|
|
|
|
|
|
For example, the Roman Numeral **I**, which has a UTF-8
|
|
|
|
codepoint ``"0x2160"``, is folded into the capital letter I, ``"0x0049"``.
|
|
|
|
|
|
|
|
_property: utils.UnicodeNormalizationForm.NFKD
|
|
|
|
The Decomposed Normalization Form with Canonical Equivalence.
|
|
|
|
See NFKC for more an example.
|
|
|
|
|
2019-12-14 06:05:10 +03:00
|
|
|
_note: Note
|
2019-08-21 08:53:47 +03:00
|
|
|
Only certain specified characters are folded in Canonical Equivalence, and thus
|
2019-12-14 06:05:10 +03:00
|
|
|
it should **not** be considered a method to acheive //any// level of security from
|
2019-08-21 08:53:47 +03:00
|
|
|
[homoglyph attacks](https://en.wikipedia.org/wiki/IDN_homograph_attack).
|