This document points browser implementers and specification developers to information about how to support typographic features of scripts or writing systems from around the world, and also points to relevant information in specifications, to tests, and to useful articles and papers. It is not exhaustive, and will be added to from time to time.

The information in this document helps to link users and developers so that browsers can better support typographic needs around the world. It is expected that this document will be constantly updated, as new material becomes available or comes to our attention.

To make it easier to track comments, please raise separate issues or emails for each comment, and point to the section you are commenting on  using a URL for the dated version of the document.

Introduction

The W3C and browser implementers need to make sure that the text layout and typographic needs of scripts and languages around the world are built in to technologies such as HTML, CSS, SVG, etc. so that Web pages and eBooks can look and behave as users expect.

To that end experts in various parts of the world are documenting layout and typographic requirements, as well as gaps between what is needed and what is currently supported in browsers and ebook readers. (See a list of relevant work in this area that is supported by the W3C Internationalization groups.)

This page points browser implementers and specification developers to information about how to support features of scripts or writing systems from around the world, and also points to relevant information in specifications, to tests, and to useful articles and papers. It is not exhaustive, and will be added to from time to time.

The Github resources links point to ongoing discussions of three types:

  1. Requests for information point to requests for more information about how one or more scripts work.
  2. Spec issues point to requests to implement a particular script feature in a spec.
  3. Browser bugs point to requests to implement or fix a particular script feature in a browser.

Additional information and references are hereby solicited; please suggest additions, clarifications, corrections, and other improvements using the github issues list. 

Text direction

Vertical text

When dealing with vertical lines of text, it's common for content authors to want to mix short horizontal runs of text, such as 2-digit numbers, in a vertical column (tate chu yoko). It's also important to provide appropriate support for text in scripts that are normally only horizontal. Also, there are often special requirements related to the orientation of characters within vertical text.

Bidirectional text

Scripts whose characters are typically written right-to-left, like Arabic, Hebrew, Thaana, and so on, become bidirectional when they include numbers or text from other scripts (such as Latin acronyms). Browsers and applications need to support bidirectionality. This means supporting the Unicode Bidirectional Algorithm, but also different visual locations of line start and end, isolation of embedded strings, correct line alignment, and so forth.

Glyph shaping & positioning

Fonts & font styles

Some scripts require special handling with regard to how font properties are specified and how font resources are loaded dynamically. In some scripts it is common to use different fonts for headings or emphasis, rather than bolding or italicisation. Fallback font families used by browsers (eg. serif, sans-serif, cursive, etc.) may need to be mapped differently to fonts for different scripts. For example, Khmer has slanted, upright, and round font styles, and Arabic has naskh, nasta'liq, ruq'a, kano, etc., which may need special handling. Special OpenType features may need to be supported.

See also [[[#letterforms]]]

Context-based shaping & positioning

In some scripts, such as Arabic, it may be desirable to allow the content author to control the placement of glyphs such as diacritics, or to control ligation, etc. Languages written with Arabic and Hebrew scripts have particular rules, of course, about when it is appropriate to show or hide diacritics for short vowel sounds. Many complex scripts have rules about how characters combine in syllabic structures, and scripts like Arabic may need controls to indicate where ligatures are wanted or not wanted. In addition, controls (based on Unicode characters or otherwise) may allow the user to control the shaping and positioning of glyphs, for example to compose/decompose conjuncts in brahmi-derived scripts.

See also the separate section [[[#cursive]]].

Cursive text

In scripts such as Arabic, Mongolian and N'Ko adjacent characters are joined together in normal printed text. It is important to ensure that those connections can be maintained correctly when characters are forced apart, or when transparency is applied to the text, etc. There are also situations where cursive joining behaviour exists when there is no adjacent character, or where joining needs to be disabled between glyphs. Cursive links shouldn't be broken by appropriate markup or styling. Etc.

Letterform slopes, weights, & italics

In CSS, italic and oblique are described as font styles. Non-Latin script can add requirements for such styling. For example, oblique styles in Arabic or Hebrew scripts text may lean to the left. Proper italic glyphs in Cyrillic text can look very different from normal variants, and so synthesising italics can produce poor results. Chinese, Japanese and Korean fonts almost always lack italic or oblique faces, because those are not native typographic traditions. Bold text is similar in usage and in problems to the use of italics. Control and use of font-weight is also relevant to this section.

See also [[[#fonts]]].

Transforming characters

Conversion between lower, upper and title case only applies to a few scripts, most scripts are unicameral. Where it does apply, the rules can vary by language. In other cases, a particular script may require a different type of transform. For example, in Japanese it is important to be able to convert between half-width and full-width presentation forms.

Typographic units

Characters & encodings

Most languages are now supported by Unicode, but there are still occasional issues. In particular, there may be issues related to ordering of characters, or competing encodings (as in Myanmar), or standardisation of variation selectors or the encoding model (as in Mongolian).

Grapheme/word segmentation & selection

A browser or application needs to correctly apply functions to the basic units of text, be they characters, character sequences, syllables, or words. Some scripts, such as those used in South and South-East Asia, require clusters of characters to be treated as a single unit for most editing operations. Many other scripts use combining characters such as accents, vowel signs, length markers, etc. that must be kept with the base character they are associated with.

When a user double-clicks on some text, the appropriate units should be selected. In scripts such as Chinese and Thai, 'words' should be selected even though they are not separated by spaces. In scripts such as Tibetan and Ethiopic, the word separator may be a visible character, rather than a space. It is important to understand how they should be treated when a 'word' is highlighted, or when text wraps, etc.

Punctuation & inline features

Phrase & section boundaries

Many scripts use native punctuation marks in addition to or instead of those used in Latin script text. In other cases, such as Greek, common Latin punctuation marks may mean something different from what they mean in English. It may be important to understand what needs to be supported, how these punctuation marks function, and how they interact with other operations applied to the text.

Another aspect of this relates to separation of characters or items in text. For example, French inserts a particular type of space before certain punctuation marks, and the traditional Mongolian script requires special spacing between word stems and certain suffixes.

Other special inline markers may appear when handling abbreviation, ellipsis, and iteration, bracketing information, or demarcating things such as proper nouns, etc.

See also [[[#text_decoration]]], [[[#quotations]]], and [[[#inline_notes]]], which are broken out into separate sections.

Quotations & citations

Quotation marks vary from language to language, not just from script to script. Also, you should expect variations in behavior when quotation marks are nested. Furthermore, the quotation marks used for vertical Japanese text are not the same as those typically used for the same text when horizontally laid out.

See also [[[#punctuation_etc]]].

Emphasis & highlighting

For many scripts bold and italic are not always appropriate for expressing emphasis or highlighting text, and some scripts have their own unique ways of doing it that involve adding special marks alongside letters or syllables, etc. Other approaches involve substituting a different font, or using quotation marks, brackets, etc. Underlining is also not used for emphasis in many scripts, and may have a different function altogether.

See also [[[#text_decoration]]].

Abbreviation, ellipsis & repetition

How are emphasis and highlighting achieved? If lines or marks are drawn alongside, over or through the text, do they need to be a special distance from the text itself? Is it important to skip characters when underlining, etc? How do things change for vertically set text?

Inline notes & annotations

Ruby is used for phonetic and semantic annotations of East Asian text, including furigana, pinyin and zhuyin fuhao systems. In addition to positioning annotations along the correct side of the base text, there are many fine adjustments of the annotation and base text to support. Warichu is a kind of inline annotation where the note text is two approximately equal lines of half sized text, one above the other, but both within the normal line height.

See also [[[#footnotes_etc]]].

Other text decoration & inline features

This section groups together other text decoration and inline features that don't fall under the previous headings. Some aspects related to the drawing of lines or markers alongside or through text involve local typographic considerations. For example, underlines need to be broken in special ways for some scripts, and the height of underlines, strike-through and overlines may vary depending on the script. For vertical text the placement needs to be to the right or left of the line of text, rather than under or over. A script may call for specialised inline features. An example in Japanese is kumimoji, a way of combining several characters into a single character space. Syriac and Ethiopic identify numbers by drawing lines above them: the line extends to the width of the number. Arabic also has special characters that stretch to the length of certain numbers. There may be other such inline features in these and other scripts.

Data formats & numbers

Relevant here are formats related to number, currency, dates, personal names, addresses, and so forth. Also, some scripts have one or sometimes more sets of their own numeric characters. In some cases, numeric characters represent numbers like 100, or 10,000. Numeric formats can also vary significantly, in terms not only of the separators and negative signs used, but also the groupings used for digits, and sometimes the mechanisms used to distinguish numbers from the text.

See also .

Lines & paragraphs

Line breaking

There are often specific rules about how scripts behave when a line is wrapped. For example, Chinese, Japanese and Korean tend to break a line in the middle of a word (with no hyphenation) – even in Korean, which has spaces between words. Others break lines at syllable boundaries. (See below for hyphenation.)

It is common for certain characters to be forbidden at the start or end of a line, but which characters these are, and what rules are applied when depends on the script or language. In some cases, such as Japanese, there may be different rules according to the type of content or the user's preference.

See also [[[#hyphenation]]], which is broken out into a separate section.

Hyphenation

Hyphenation in this sense means identifying broken words after text is wrapped at line end (and not only those involving a hyphen character). See [[[#punctuation_etc]]] for information about the use of regular hyphens in text. Some writing systems don't use hyphenation, those that do have particular rules about how it should be applied that are typically language-specific.

See [[[#line_breaking]]]

Text alignment & justification

Typographers have come up with various methods for effective full justification – causing the text to completely fill the line, in order to create visual alignment on both edges of a paragraph.

Typographic conventions for full text justification depend on the writing system, the content language, and the calligraphic style of the text. Results also tend to vary based on the capabilities of the layout engine and a given typographer’s preferences for weighing its various detrimental effects on typographic color and readability.

Text spacing

This section is concerned with spacing that is adjusted around and between characters on a line that is driven by aims different from the full line justification described in the previous section, although it will affect line layout. Examples follow. Many scripts create emphasis or other effects by moving apart the letters or syllables in a word. (This may even apply in Indic and SE Asian scripts, and in Arabic-based scripts which join up adjacent letters.) Other times, increasing or decreasing the typical space between characters aids readability. Scripts used for Japanese or Chinese may also seek to reduce space between adjacent punctuation, to avoid large gaps. On the other hand, it may be necessary to add a gap around embedded numbers or Latin text in scripts that don't normally use spaces around words. Some scripts prefer to indent the first line of a paragraph, rather than leave vertical gaps between paragraphs. And in some scripts space needs to be carefully controlled before and after certain punctuation marks, such as in French or Thai.

Baselines, line-height, etc.

Browsers and applications must accurately and comprehensively cover requirements for baseline alignment between mixed scripts. For example, Arabic script descenders go far below those of the Latin script, and Armenian characters need to be aligned with ideographic characters in Chinese appropriately with regard to comparative heights and baselines. European, Far Eastern and South Asian scripts tend to use different baselines, which must be aligned correctly. The complexity of characters in a script may affect line height settings. However, some scripts also expect larger inter-line gaps than others, in addition to the line height. This section covers these and other factors related to vertical spacing of lines.

Lists, counters, etc

List numbering in vertical text runs across the page, but may need to be rotated to run horizontally. In a list where items are alternatively right-to-left and left-to-right, where does the counter go, and how is the list aligned? The CSS specification describes a set of simple and complex styles for counters to be used in list numbering, chapter heading numbering, etc. It also provides a generic mechanism for content authors to create their own counter styles. One has to consider not only the characters and algorithms to be used (numeric, alphabetic, additive, etc), but also what the separator or other associated marks look like.

Styling initials

Does the browser or ereader correctly handle special styling of the initial letter of a line or paragraph, such as for drop caps?

Layout & pages

General page layout and progression

In paged media for right-to-left scripts or vertically set documents, pages progress from right to left, and the front and back cover are in the opposite locations to, say, an English book. Unlike the general Western approach, the size of the main text block in Japanese pages (called the hanmen) is traditionally established by counting character cells, and margin space is then defined by the remaining space. Columns run across a page in vertically-set pages. The standard page layout for Mongolian is landscape, and horizontal scrolling within a page is much more important than in the West, so default scrollbar positions may need special support.

Other topics that belong here include any local requirements for things such as printer marks, tables of contents and indexes.

See also [[[#grids_tables]]]

Grids & tables

Are there any special considerations related to the layout and design of tables? Due to their essentially monospaced character repertoire, Chinese and Japanese are partial to a grid-based system of layout that has some special requirements.

Footnotes, endnotes, etc

Support for footnotes, endnotes or other necessary annotations of this kind may vary in other cultures. In some cases, a script may use a very idiosyncratic approach to locate the notes or to link to them from the text. (See [[[#inline_notes]]] for purely inline annotations, such as ruby or warichu. This section is more about annotation systems that separate the reference marks and the content of the notes.)

For inline annotations, see

Page headers, footers, etc

These links point to conventions for managing the content that appears outside the main text block, for example page numbering, or the way that running headers and the like are handled.

Forms & user interaction

Where a page allows users to input information in forms or interact with a page, there may be special requirements for the layout and orientation of the input fields and associated labels. There may also be other aspects of user interaction with the page that differ.

Changes Since the Last Published Version

The following changes have been made since the document was last published to the TR space:

See the github commit log for more details.