There has been confusion about whether noncharacters were permitted in Unicode text. The new Corrigendum #9: Clarification About Noncharacters makes it clear that noncharacters are permissible even in open interchange, although their intended semantics may not be interpretable in such contexts. The UTF-8, UTF-16, UTF-32 & BOM FAQ has also been updated for clarity, and other informative text about noncharacters will be revised over time, including the Core Specification.
Background. There are 66 noncharacters permanently reserved for internal use, typically used for some sort of internally-defined control function or sentinel value. They should be supported by APIs, components, and applications that handle (i.e., either process or pass through) all Unicode strings, such as a text editor or string class. Where an application does make internal use of a noncharacter, it should take some measures to sanitize input text from unknown sources. The best practice is to replace that particular noncharacter on input by U+FFFD. (The noncharacter should not be simply deleted, since that can cause security problems. For more information, see Section 3.5 Deletion of Code Points in UTR #36, Unicode Security Guidelines.)
Context |