Language Standardization

Language name standardization

Language name standardization is the process by which name of languages are encoded, standardized
and maintained according to some international standards. A well known organization that develop and publish standardization for all technical and nontechnical fields except electrical and electronic engineering fields, is the International Organization for Standardization, known also by the acronym ISO.

ISO 639

Substituting the name of a language with an equivalent code (either 2 or 3 letters long), can be of
big benefits for library catalog or bibliographic purpose, for libraries of information management, to
identify languages in computer systems or also for representing different language versions on web
application on the internet.
ISO 639 standard is currently composed of five parts, each of them maintained by an agency, which
handle the changes, especially adding new codes when needed. Each code in one part means the same
thing in another part, however, not all languages are available in all parts. Hereafter the five part of the ISO 639 standard: ISO 639-1, ISO 639-2, ISO 639-3, ISO 639-4 and ISO 639-5.

IETF BCP 47 Language Tag

IETF BCP 47 language defines standardized codes or tags used to identify human language on the
Internet. The tag structure is standardized by the Internet Engineering Task Force (IETF). BCP
47 standes for Best Current Practice and is a persistent name for a series of RFC that describe
language tag syntax. The last one is RFC 5646, and allows for a number of additional subtags. The
subtags used are managed by the IANA Language Subtag Registry (INA) and can be of various
types, E,g. language–extlang–script–region–.... The figure here-below
shows an example of tags.

Example of tags. Image produced from [this article]\(<https://www.w3.org/International/> articles/language-tags/)

Example of tags. Image produced from this article

Eden AI's Language Standardization handling

As Eden AI aims to make AI easy for developers, users relying on our API should not be concerned
with language constraints raised by each of our suppliers. Users had only to select the language they
want to use when using our services, without being bothered by the way they provide the language
name, and by which standardization or tags to use. Language interpolation between users input and
suppliers constraints is made by our service, relying on the IETF BCP 47 Language Tag shown above.

To better illustrate the problematic, we will present a simplified use case. One technology (we
will consider invoice processing with OCR for the sake of an example) is provided by three different
suppliers: A, B and C, using each a different set of language constraints. Supplier A handle the
processing for the following list of language tags: [en-US, fr-FR, fr-CA]. Supplier B handle
the processing for [it, en-GB, ceb-CB, fr-BE], and supplier C handle the following list of
language tags [fr-FR, it, ar]. The three suppliers are available by our service. A user called
Mike wants to use the invoice processing for an invoice written in English. However, Mike do not know
if the English used in the invoice is an American English, a British English or another type of spoken
English. For the sake of simplicity, Mike will try to process the document by only specifying that
these latter is written in English (en). Our service will handle the matching between Mike selected
language (en in this example), and suppliers provided languages, and this, by choosing the best
language tag that correspond to Mike choice, even if their is no exact matching. Supplier A should
consider the American English (en-US). Supplier B will, in the other hand, consider the British
English (en-GB) as a language of processing, while supplier C will not provide any results as this
latter supported language do not match with Mike request.