Multilingual IT Standard:
Readiness and Opportunities

Thaweesak Koanantakool
Director
National Electronics and Computer Technology Center
NSTDA, Thailand.

(htk@nectec.or.th)

 

Introduction

In this paper, I will discuss several issues which may help speeding up the initial IT standardization process of a country. The achievements of CICC’s MLIT project and its impact to Asian countries are illustrated.

A special emphasis was made towards the need to separate the main issue of code-point fromother side issues of input and output methods as well as sorting sequence. Some remarks on the number of code-points are given. In many Asian scripts, it is possible to make small character sets to gain advantage in flexibilities in using such character sets in 7-bit, 8-bit and ISO/IEC 10646-1 environments.

Conflict management methods for national harmony, regional harmony and global harmony are also discussed.

Remarkable milestones of MLIT

The first MLIT Symposium in Singapore in 1997 marked the new era for sharing of voices of Southeast Asian countries towards IT standardization. Recognizing the differences in progresses in each country, CICC’s MLIT-project has been instrumental in taking several real actions in providing assisted opportunities for less-developed nations to achieve their goals in making its national standards accepted in the global arena.

Through its remarkable secretariat led by Takayuki K.Sato, several achievements are worth noting. These includes: notifications to Lao PDR, Cambodia and Nepal to voice in the international standard development movement on time; supporting Myanmar to achieve its successful proposal of her standard code to the ISO; notification of Thailand on the re-consideration of her ISO/IEC CD 8859-11 proposal to ISO; and assisting the Philippines to add the missing characters (Peso sign and Ng) in ISO/IEC 10646-1; and probably many other issues [1].

The approach of MLIT was firmly rooted on the prior work of AFSIT-SIG on IT Internationalization which resulted in a publication “Data Book of Cultural Convention in Asian Countries”, also Edited by Sato [2].

There are, however, some remaining challenges to be achieved. Some of which are expressed in this paper.

Character code-points for information interchange as the primary concern

History on developing a coded character set for a script in a developing country seem to be repeating itself again and again. Before the age of graphic user interface, or GUI, defining a coded character set was always limited by display and printing technologies. However, with proper definition of “interchange code”, simple and unified code-points within a country could be developed.

In doing so, the national standard body has to put the issues of a clean “basic” character set for data storage and data communications apart from other “related” issues such as collating sequence, input method and display codes. Often, one needs more display codes than the basic interchange codes simply because a symbol can have more than one appearance on the display and hard copy. However, all one needs to make the information infrastructure to function is to have a minimum, unique, but sufficient set of code points in the national standard.

Apart from display code, the concern of sorting (or collating) sequence of characters may also play an important roles in making the code-point issue blur. Most Asian languages do have vowel and tonemark or diacritic characters which make straight forward sorting of words a non-trivial job. It is therefore not necessary to design for best collating sequence since some additional sorting logic must be included in any sorting algorithm of Asian languages. Moreover, in a particular culture and on a particular script, there exist more than one sorting convention anyway. In Thailand, for example, sorting convention of the Royal Institution’s dictionary is different from the one used in telephone directory.

Without the priority setting of issues, developing a national coded character set can run into many months of wasteful cycles of discussion without conclusion. So, setting the interchange code as the number one priority is the top priority task within any country.

Referring Yoshiki Mikami’s remark in his special lecture at MLIT-3 [3], we might recommend that a national standard committee can put aside the issue of key sequence design and number of key strokes; as well as the number of glyphs per internal codes or number of internal codes per glyphs. Just go for the best interchange code. The three blocks in Mikami’s diagram below do have MLIT working groups to take care of their issues.

 

National harmony: How many code-points do we really need ?

Additional complications come from several “extended” character sets used by various implementors in each country. In reaching a national consensus, a middle-way approach might be the solution after peaceful negotiation among stake-holders in the country. However, the maximum subset of all stake-holders might also be a solution. In any cases, there are some concerns about the related international standards which a country may want to apply for.

These concerns include the following thresholds:

In all cases, the character set can be proposed to be part of ISO/IIEC 10646-1.

According to reports given in MLIT-4 (October 1999), Urdu standard of Pakistan and Tai Lanna are both reported to have about 140 code-points [4]. Through this fact, the character sets will not be readily be processed as bilingual with English, unless the number of “basic” code-points can be made no more than 96 characters.

Intrinsic bilingual quality for a character set of an Asian language used to be a very critical issue prior to the release of Microsoft Windows ’95 for PC users. However, with the use of Unicode in Windows ’95 and newer version, it is possible to run multilingual scripts on the OS. On the Macintosh platfom, multilingual scripts could be supported since the introduction of MacOS 6 many years ago.

Conflict resolution among neighbouring countries sharing the same script: a regional harmony?

How many Asian languages are being shared by more than one countries? To my understanding, there are a few of these languages. For example: Devanagari (India, Nepal, Bhutan), Tamil (India, Sri Lanka, Nepal, Pakistan, Bangladesh), Urdu (India, Pakistan), Tai Lanna (LaoPDR, Thailand), Viet Thai (Vietnam, Thailand), Tai Ahom (Thailand, India), Dehong (Thailand, China), Tai Lue (Thailand, China), Pali, Sanskrit, etc.

I believe that regional harmony is an important key to the successful proposal of regional standards of a particular script to the ISO process. However, we must also recognise the differences in the normal use of one common script in different countries. On the other hand, if these countries can work together, unnecessary repeated efforts in improving the character sets and/or identifying any important missing characters can be very much reduced.

Even in the case of total disagreement, it is also beneficial if the participating members can voice their reasons to have different character sets of very similar scripts in ISO/IEC 10646-1. If such differences of opinion are supported by the regional conferenc e such as MLIT, it is more likely that the consensus is also acceptable to ISO.

Experts vs voice of the users of the language: the global harmony?

Under some circumstances, a particular script may be registered with the IS body without the awareness of the society where the script is mainly used. If there is such a conflict, MLIT can play a very important role of being the regional authority for such conflict. This is probably the cases of Khmer (Cambodia), LaoPDR and Myanmar. Similar case also happened to Thai (Thailand) and Sinhala (Sri Lanka) in the early versions of Unicode draft.

It is important to open dialog with those experts who were kind enough to started the international standard for a particular script and make sure that they can become the true supporter of the national standards.

Through the Internet and actual participation in JTC1/SC2/WG2 meetings, a country can voice its requirement.

Modern input and output devices: the sky’s the limit.

There are several challenging research projects in each country/culture regarding the modern input and output devices. These research programs are beneficial to a country in general, as well as having commercial value to any company who develop them. These concepts could be extended from the “basic” character set which represent the basic building block of the script. Here are some examples of the modern input and output devices:

INPUT
Optical character recognition (OCR}
voice recognition(for the blind)
pen-input (handwriting)
single-switch input (for disabled persons)
spelling insensitive input (using soundex)
phonetic keyboard
word-prediction
from HTML or XML file in other script

OUTPUT
text-to-speech
screen reader (for the blind)

Multilingual/Multimedia and Natural language Processing Outlook

With the Internet as a prime booster of global communication using email and WWW, the issues of MLIT are now appreciated by many more people than a few years ago. The general trends of E-Commerce in the global economy will drive not only multilingual IT to its practical use, but also make the modern input/output devices a truly multilingual multimedia (ML/MM) communication.

Such a ML/MM situation prerequisites a good and stable interchange code of a script within a language since there will be additional complexities arising from machine translation options. With a fast convergence of interchange code standardization, multilingual translation is feasible.

These processing services are: spell checkers, word-segmenter, sort, search by meaning (thesaurus), search through different language (dictionary), text-to-phonetic, search by sound (soundex), etc.

Second-generation unification: Tai family of scripts.

If we consider the CJK unification of the 90’s to be “first-generation” unification effort, then the attempt to unify Asian scripts will be second-generation. Unification of Tai languages might be a reality, after the vision of MLIT-3.

Following the initial research reported by Thawee Swangpanyangkoon, it is basically feasible to embark upon a deeper research into the code-point design of some 13 Tai language. The sentiment of unifying these character codes, according to Viet Trung Ngo [5], are not merging them into one compact code set. Instead, we may set out a common mission among Tai language user countries to agree upon a common “framework” which would help designing the best code-points for many Tai languagestogether.

The goals of Tai script unification may be a bit different from the CJK experience in the sense that each Tai character set might be registered with ISO/IEC 10646-1 as individual sets. However, they are registered as a unified proposal from the Tai countries together as a block of multiple character sets.

The participating members of Tai unification working group may consist of Thailand, China, India, Myanmar, Vietnam, LaoPDR and Cambodia at the minimum. A contribution from Japanese expert will also be necessary. It is possible that Thailand may offer itself as the focal point for the activities, subject to funding availability.

Concluding remarks

It has been my exciting three years with MLIT to experience the changing climate of I18N and L10N processes in the Asian countries. MLIT project of CICC has achieved remarkable success. Probably, the activities of MLIT are rising to its peak and increasing. It would be disappointing if such a momentum receives less future support from the main source of funding (MITI of Japan).

The Internet and E-Commerce are probably the major driving forces in the development of multilingual IT standards in the member countries. In my experience, POSIX and Locale concepts are also appreciated through the availability of free POSIX-complied Operating Systems such as FreeBSD and Linux. These OS’s are the ideal laboratories for I18N workers.

I would wholeheartedly agree with the need to voice Asian countries concerns about the appropriateness of their scripts in the international standard. Through MLIT project, a superb support from T.K.Sato proved to be very valuable and are well appreciated. Many developing countries need to develop a long-term and sustainable standards activities by creating IT standard team as well as providing the national IT standard documents online. In addition, in order to gain accceptance from industry and possible funding, there should be research projects related to natural language processing of a national language.

Last, it is imperative for a country to understand the importance of having the national standard registered with an effective IS body such as ISO, ECMA and IANA. Without such awareness by the top IT planner in any country, that country is being placed in a disadvantage position of being left out in the modern digital economy.

It is exciting to see the work of three working groups in MLIT, with some fruitful results coming soon, as wellas seeing the success of Myanmar,the progress of Cambodia, Lao etc. in registering their national standards with ISO/IEC 10646-1.

References

[1] Takayuki K.Sato, MLIT-Project activity report, Proceedingd of MLIT-3, CICC, March 1999, pp.33-39. htp://www.cicc.or.jp/homepage/mlit/

[2] Takayuki.K.Sato (Editor), Data Book of Cultural Convention in Asian Countries, CICC AFSIT-SIG on IT Internationalization.

[3] Yoshiki Mikami, Proposer issues for discussionL Towards multilingual information processing, Proceedings of MLIT-3, CICC, March 1999, pp.145-158.

[4] Khaver Zia and Theppitak Karoonboonyanan, private discussion.

[5] Viet Trung Ngo, private discussion.

About the Author

Dr.Thaweesak Koanantakool received his Bachelor degree and Ph.D. from the Department of Electrical Engineering, Imperial College of Science and Technology, London University. He joined Prince of Songkla University in 1981 to teach in EE and later moved to Thammasat University in 1983 as the Deputy Director of Information Processing Institute for Education and Development.

Thaweesak was part of the early effort in Thailand to unify the various character sets in 1984, with the result being TIS-620/2529-1986. Later, he joined the National Electronics and Computer Technology Center (NECTEC) to promote standard input and output method standards for Thailand. The work was better known as Wor-Tor-Tor, which became the country’s de-facto standard.

In 1992, after Wor-Tor-Tor, he decided to start the Internet in Thailand to assist the standard development process. He was engaged in many network activities since then. He is a co-founder of Thailand’s first Internet Service Provider: Internet Thailand Company as well as Thailand’s EDI service provider: TradeSiam Company. He is also the project director of The Royal Golden Jubilee Network Project and the SchoolNet Project.

He became the Director of NECTEC since July 1998.