WELCOME [ Log In · Register ]        SITE [ Search · Page Index · Recent Changes ]    RSS

Natural Language Specification

PDF/UA Home > Working Draft > Natural Language Specification

 Rationale

Specifying the natural (human) language of a PDF enables the following:

  1. Correct interpretation of text by text-to-speech engines.
  2. Correct rendition of Braille, including contractions.
  3. Correct choice of language tools like dictionaries, spellcheckers, and thesauri.
  4. Correct display of captions and audio descriptions for multimedia elements in the user's preferred language.

The PDF Reference

Where Found:  Specifying the Natural Language of a PDF document is described in Section 10.8.1 Natural Language Specification, pp. 864–870 of the PDF Reference.

Key Text: From Section 10.8.1:

The natural language used for text in a document is determined in a hierarchical fashion, based on whether an optional Lang entry (PDF 1.4) is present in any of several possible locations. At the highest level, the document’s default language (which applies to both text strings and text within content streams) can be specified by a Lang entry in the document catalog (see Section 3.6.1, “Document Catalog”)."

The Lang entry takes a value as defined by "Language identifiers" Page 865:

The value of the Lang entry in the document catalog, structure element dictionary, or property list is a text string that specifies the language with a language identifier having the syntax defined in Internet RFC 3066 , Tags for the Identification of Languages. This syntax... is also used to identify languages in XML, according to the W3C document Extensible Markup Language (XML) 1.1.... An empty string indicates that the language is unknown.

Hierarchical assignment of language specification

The specification of the natural language is hierarical and is defined as follows:

  • For elements within the logical structure of the document based on the following order:
    1. The language identifier of the structure cotaining the element (parent, grandparent, etc.)
    2. The document language identifier
  • For elements not within the logical structure of the document as the Lang property list attached to the marked content sequence for that element. (Needs rewriting for clarity)
  • For Unicode text, the  language may also be defined as an escape sequence. (Suggest to discourage that unless it is used to change the language within Alt Text, Actual Text, the Contents property, Bookmarks, etc.)
  • For multimedia elements where the alternative is specified as a multilanguage text array, natural language is defined as the first string in each pair of strings in the array.
  • We should encourage maintaining all elements within the logical structure of the document (with the exception of artifacts) and consequently, the proposed text should reflect that.

Multi-language documents

We can conceive of these categories of multi-language documents:

  1. Single dominant language with occasional other language. The dominant language needs to be marked, usually on the root element, with changes in language marked on whatever structural element encompasses them (as by adding span if necessary). There is an intractable problem of exactly when a foreign phrase («je ne sais quoi») is actually deemed to be part of the dominant language.
  2. Multiple languages of equal standing. Not to be confused with the former case and not a variant of it. A document may contain text (or multimedia, etc.) in more than one language, all of which are equal in standing or importance even if some languages are longer or shorter than others.
    •  ISO 639-2 provides the mul language tag for multilanguage documents, though the specification states that it is to be used if tagging all the languages in a document is “not practical.” This implies that a PDF with multiple languages of equal standing should have no language encoding on the root element.
    • We also have the option of allowing free text in the language encoding, e.g., lang="en,fr" or lang="en fr es" or lang="{fr-CA,fr-FR,fr-BE},{en-CA,en-US},en,es" or lang="en|es|fr".
  3. Single documents expressible in more than one language. If a document is saved with bilingual or multilingual text and only one language is displayed or rendered, then only that language should be encoded.
  4. Documents with second-language multimedia. We need a way to handle specification of the language of multimedia (e.g., audio, video) if it differs from the rest of the document. This may involve adding structural elements and language-tagging them. There is a further issue of language differences within the multimedia (e.g., English document with multimedia in French with Flemish subtitles).
  5. References to other documents. The HTML hreflang attribute, though almost never used, tells user agents the language of a document the current document links to. This information may or may not be useful in accessible PDF.

Version of the language-tag specification

RFC 3066 explicitly comprises the language-tag specifications of ISO 639-1 and -2. 639-3 is being developed by SIL and will be significantly different. Literal compliance with the current PDF spec makes it impossible to use 639-3 language codes yet remain in compliance. 

Proposed Text

  1. The document shall specify its natural language through the "Lang" entry of the document catalogue. The value of this entry shall conform with the syntax defined in Internet RFC 3066.
  2. Should the document include any change in the natural language, the text shall be placed within an appropriate structure element (for example, a span) and the language attribute for this element is set to the language of the that text.
  3. Should a change in the natural language appear within an artifact, the text shall be placed within its own artifact container and the language identifier shall be specified for this container.
  4. For elements that require the specification of Alternative Text (Alt), Actual (Replacement) Text, expansion of an abbreviation (the "E" attribute), the "Contents" property of an annotation or a Tooltip of a form field, the language identifier shall be specified through the language attribute of the structure element. Should a change in the natural language occur within any of these elements,  an escape sequence could be used to specify the language (assuming unicode encoding).
  5. The natural language of the Alt attribute for multimedia elements is specified as the first string of the pair of strings in the array.

Issues to take forward to Implementation 

  1. How do we handle change in language within Alt, Actual... etc for text encoded using PDFDocEncoding as opposed to Unicode?
  2. Sometimes it is superfluous to mark every change in language, not only for loanwords commonly used in the base language but in cases of change of script (Latin text in a Hebrew document is not going to be Hebrew, but it also isn’t necessarily English).
  3. Multiple-language documents need to be explicitly handled. The HTML approach of one language code at a time does not work for the root element, which, in PDF, should explicitly allow a formulation like lang="en,fr" or lang="en|fr". This was covered in the nonlinguistic section. Do we allow lang="mul" for multiple languages?
  4. ISO 639-1, -2, or -3 language codes?

Suggested changes, additions (by those other than the page initiator)

  • Tagging of multimedia files (as annotations), mostly useful in an accessibility context for sign-language video
  • Use of media equivalents for different-language multimedia (typically used for specification of video resolution)

Natural language shall be specified for all linguistic text.


 

The primary natural language or languages of the delivery unit can be programmatically determined.

The natural language of each foreign passage or phrase in the content can be programmatically determined.