Strings ======= Tra la la Bytes32String ------------- A string in Solidity is length prefixed with its 256-bit (32 byte) length, which means that even short strings require 2 words (64 bytes) of storage. In many cases, we deal with short strings, so instead of prefixing the string with its length, we can null-terminate it and fit it in a single word (32 bytes). Since we need only a single byte for the null termination, we can store strings up to 31 bytes long in a word. #### **Note:** Strings that are 31 **bytes** long may contain fewer than 31 **characters**, since UTF-8 requires multiple bytes to encode international characters. #### *utils* . **parseBytes32String** ( aBytesLike ) **=>** *string* Returns the decoded string represented by the `Bytes32` encoded data. #### *utils* . **formatBytes32String** ( text ) **=>** *string* Returns a `bytes32` string representation of *text*. If the length of *text* exceeds 31 bytes, it will throw an error. UTF-8 Strings ------------- #### *utils* . **toUtf8Bytes** ( text [ , form=current ] ) **=>** *Uint8Array* Returns the UTF-8 bytes of *text*, optionally normalizing it using the [UnicodeNormalizationForm](./) *form*. #### *utils* . **toUtf8CodePoints** ( aBytesLike [ , form=current ] ) **=>** *Array< number >* Returns the Array of codepoints of *aBytesLike*, optionally normalizing it using the [UnicodeNormalizationForm](./) *form*. **Note:** This function correctly splits each user-perceived character into its codepoint, accounting for surrogate pairs. This should not be confused with `string.split("")`, which destroys surrogate pairs, spliting between each UTF-16 codeunit instead. #### *utils* . **toUtf8String** ( aBytesLike [ , ignoreErrors=false ] ) **=>** *string* Returns the string represented by the UTF-8 bytes of *aBytesLike*. This will throw an error for invalid surrogates, overlong sequences or other UTF-8 issues, unless *ignoreErrors* is specified. ### UnicodeNormalizationForm There are several [commonly used forms](https://en.wikipedia.org/wiki/Unicode_equivalence) when normalizing UTF-8 data, which allow strings to be compared or hashed in a stable way. #### *utils* . *UnicodeNormalizationForm* . **current** Maintain the current normalization form. #### *utils* . *UnicodeNormalizationForm* . **NFC** The Composed Normalization Form. This form uses single codepoints which represent the fully composed character. For example, the **é** is a single codepoint, `0x00e9`. #### *utils* . *UnicodeNormalizationForm* . **NFD** The Decomposed Normalization Form. This form uses multiple codepoints (when necessary) to compose a character. For example, the **é** is made up of two codepoints, `"0x0065"` (which is the letter `"e"`) and `"0x0301"` which is a special diacritic UTF-8 codepoint which indicates the previous character should have an acute accent. #### *utils* . *UnicodeNormalizationForm* . **NFKC** The Composed Normalization Form with Canonical Equivalence. The Canonical representation folds characters which have the same syntactic representation but different semantic meaning. For example, the Roman Numeral **I**, which has a UTF-8 codepoint `"0x2160"`, is folded into the capital letter I, `"0x0049"`. #### *utils* . *UnicodeNormalizationForm* . **NFKD** The Decomposed Normalization Form with Canonical Equivalence. See NFKC for more an example. #### **Note:** Only certain specified characters are folded in Canonical Equivalence, and thus it should not be considered a method to acheive *any* level of security from [homoglyph attacks](https://en.wikipedia.org/wiki/IDN_homograph_attack). ----- **Content Hash:** f6a51816edca0ae4b74c16012629f26108f16204ff9d3aa3879fd44adb8d0d7f