The "PanCJKV" IVD Collection—Unregistered

新闻|Adobe Blogs|Dr. Ken Lunde 2016-02-26 15:25:29

Much of the thinking that I did with regard to this unregistered—but hopefully soon-to-be-registered—IVD (Ideographic Variation Database) collection was done while visiting my parents in South Dakota, with one of the highlights of that trip being a scenic drive through Badlands National Park.

First and foremost, please forget, or at least ignore, most everything that was written in the 2016-02-13 and 2016-02-20 articles (which makes one wonder why I am linking to them, but I digress). Far too many things have changed, and what I present in this article represents the IVD collection that I hope will be registered later this year.

In terms of changes, two more regions—SG (Republic of Singapore) and MY (Malaysia), along with the pseudo-region XK (Kāngxī; note that the order of the characters of this region code is intention, in order to be a valid user-defined region code)—were added, increasing the total number of IVSes (Ideographic Variation Sequences) per character to eleven, and assignment of VSes (Variation Selectors) to regions was reversed. For Unicode Version 8.0, this means a total of 884,268 IVSes. The latest PanCJKV IVD Collection project on GitHub reflects these changes, and is also expected to serve as the URL of the site that describes the collection as indicated in the IVD_Collections.txt file.

One of the more interesting changes, in my opinion, is the addition of the pseudo-region XK (Kāngxī), which corresponds to forms that follow the style of the legendary Kāngxī dictionary (康熙字典 kāngxī zìdiǎn). I feel that this is important, because there is no region that completely adheres to this style, meaning that there is also no way to specify this style via language tagging. As the architect and developer of the Adobe-branded Source Han Sans and Google-branded Noto Sans CJK typeface families, I have received a non-zero number of requests to follow the Kāngxī dictionary style. While we have no such plans for these two joined-at-the-hip typeface families, there is clear value in supporting such fonts via this IVD collection.

The table below lists the eleven VSes that would be completely consumed by this IVD collection, their code points, and to which region (or pseudo-region) they correspond (using a two-letter region code that is reflected in the sequence identifier):

Variation SelectorCode PointRegion Code
VS246U+E01E5VN (Việt Nam)
VS247U+E01E6KP (DPRK)
VS248U+E01E7KR (ROK)
VS249U+E01E8JP (Japan)
VS250U+E01E9MY (Malaysia)
VS251U+E01EAMO (Macao SAR)
VS252U+E01EBHK (Hong Kong SAR)
VS253U+E01ECTW (ROC)
VS254U+E01EDSG (Republic of Singapore)
VS255U+E01EECN (PRC)
VS256U+E01EFXK (Kāngxī)—pseudo-region

Sequence identifiers use strings of the form uniHHHH (BMP) or uHHHHH (non-BMP) that correspond to the CJK Unified Ideograph (aka Base Character), followed by an underscore (U+005F), and finally a two-letter region code as shown in the table above. For example, the sequence identifier for the form of 曜 (U+66DC) that is used in Japan is uni66DC_JP.

The image below displays all eleven IVSes that correspond to the four CJK Unified Ideographs 一 (U+4E00), 字 (U+5B57), 骨 (U+9AA8), and 曜 (U+66DC) rendered using the four-region example font implementation, SourceHanSansR04-Regular.otf, with the default region (CN) inblue, supported regions (KR, JP, and TW) inblack, and unsupported regions inred(whose glyphs necessarily default to the default region):

This IVD collection will be unique from other registered IVD collections in that representative glyphs are not supplied, because they are not necessary. Instead, each IVS corresponds to what is best described as the intended or appropriate representation for a given region, and it is up to each implementation to decide which regions are supported, and which glyphs to supply.

The scope and intent of this IVD collection is best stated as follows: The IVSes for each CJK Unified Ideograph are expected to be displayed according to the conventions and limitations of a particular implementation, in terms of which glyphs are supplied (or default) for a given region and which particular regions are supported, and that there is no guarantee that characters will display according to the Unicode code charts nor according to regional conventions.

UPDATE:Based on some of the comments below, I have changed the region code KX to XK so that it is a valid user-defined one, and have added to the open source project an alternate UVS definition file, SourceHanSansR11_CN_sequences.txt, that aliases the seven unsupported regions to the four supported ones, specifically MY/SG toCN, HK/MO/VN to TW, and KP/XK to KR. The overly colorful image below illustrates the effect of this aliasing:

UPDATE #2:To illustrate how an implementation would use the IVD_Sequences.txt file, the image below shows the corresponding lines for the four example characters, along with the result of running the mkpancjkvivs.pl script with appropriate command-line options to create the SourceHanSansR11_CN_sequences.txt UVS definition file (red indicates the CID for the JP glyph, green indicates the CID for the TW glyph,blackindicates the CID for the CN glyph, and blue indicates the CID for the KR glyph, though it is actually a variant form of the JP glyph):

UPDATE #3:Both example fonts are now in the repository, and instead of using PanCJKV as the identifier, R04 (four regions) and R11 (eleven regions) are now used, and all of the source files have been updated accordingly.

UPDATE #4:I finished writing a document to be discussed during UTC #147 in May, and it is now posted on the UTC's document register as L2/16-063.

Alias

Alias