The SEAlang Library

http://purl.org/sealang

University of Wisconsin-Madison
Center for Southeast Asian Studies
207 Ingraham Hall
1155 Observatory Drive
Madison, WI 53706-1397
Phone: (608) 263-1755

Project Director:
Prof. Robert J. Bickner
rbickner@wisc.edu

Project Co-Director:
Doug Cooper
doug@th.net



Dollar Allocation Year 1: $108,000

Project Overview:
Modern Southeast Asia (SEA) includes eleven states, conventionally divided between the mainland countries Vietnam, Cambodia, Laos, Thailand, and Myanmar (population roughly 210 million), and the insular countries Malaysia, Indonesia, Brunei, Singapore, East Timor, and the Philippines. A global crossroads of singular geopolitical importance, Southeast Asia transits an astonishing 30% of the world's trade goods, including half its oil, through the central Malacca Straits.

Unfortunately, we have little capacity for information access or language reference in this important and often tumultuous region. The mainland countries are represented least: data and dictionaries that use the Indic-derived scripts of Burma, Laos, Thailand, Cambodia and the minority Mon, Karen, and Shan states, and Vietnam's Roman-derived, Chinese-influenced Qu?c Ng? script, are largely unavailable in the U.S., and there has been little software development beyond basic tools for text input and output.

The SEAlang Library is a technically innovative plan to build core lexical and software resources for all Southeast Asian languages, starting with the difficult non-Roman scripts used in the five mainland countries. These will provide immediate access to a wide range of reference and research functionality, and will help enable the development of more advanced lexical resources and information retrieval tools in the future.

The SEAlang Library directly addresses TICFIA's goals:

  • it produces the innovative technology required for information access and language instruction for writing systems that use non-Roman scripts;
  • it has a regional focus, helping us understand the history of a strategically important area that suffers from deep political and cultural divides;
  • it meets long-range teaching and scholarly needs, building software, reference, and research resources that will be of value for decades;
  • it has broad value for the Southeast Asian Studies and heritage communities, and will enable research and development in a wide variety of specialties.

The SEAlang Library will provide:

DICTIONARIES: we will prepare digital bilingual dictionaries with Web-based user interfaces, drawn from the best available scholarly print reference works – often difficult to obtain from U.S. libraries – for the national languages Burmese, Lao, Thai, Khmer, and Vietnamese, and for the major ethnic minority languages Mon, Karen, and Shan. Supplementary dictionaries will be added for languages (especially Lao and Khmer) that have undergone significant orthographic change in recent years. All dictionaries will be extended, where possible, by specialized lexicons of newly minted words, and all will incorporate the approximate search software tools described below.

The SEAlang Library will add extensive XML metatagging to the source dictionaries to support data mining applications. All SEAlang Library dictionaries will be accessible via approximate search software, and can be used both interactively, and as program-accessible Web resources. Queries may search for original orthography, transliteration, phonetic transcription, rhyme, part-of-speech, etymology, English content of definitions, and so on – if it's in the dictionary, it will have a specific search function. In the future, we anticipate that these bilingual dictionaries will be supplemented by monolingual references as well.

TEXT CORPORA: we will build monolingual and aligned bitext corpora. Used to study collocation and usage, and to support data-driven language learning, text corpora are necessary precursors to more advanced translation and monolingual and cross-language information retrieval tools. Teachers and students can query the corpora from the SEAlang Library dictionaries (thus, a definition can be accompanied by common collocational contexts) or from other texts to supply authentic usage examples, lexicographers can find collocations or extract non-dictionary entries, and software developers and linguists alike can analyze word frequency or distributional data.

The SEAlang Library will provide substantial (to tens of millions of words) monolingual corpora for the majority languages, along with the largest feasible (hundreds of thousands of words for Thai and Vietnamese, and less for others) aligned two- language corpora, drawn from both on-line resources and on-the-ground publishing contacts.

SOFTWARE: we will build information access tools for Southeast Asian scripts, including basic tools for segmentation and transliteration, conversion between font encodings, text harvesting and indexing, and statistical analysis. The user applications listed below will be linked to dictionaries, text and text corpora to help fulfill the promise of regional information access.

segmentation: the SEA-Seg utility will use dictionary data to segment vernacular texts into individual words.

cataloging: the SEA-Cat cataloging utility will help generate segmented, Library of Congress / ALA transliteration from vernacular orthographies.

query building: the SEA-Search searching utility will do the reverse, building localorthography queries from transliteration or transcription.

reading & comprehension: the SEA-Read tool will display texts in local orthography, segmented or reformatted (e.g. transliterated or transcribed) as possible. It will be linked to the dictionaries and corpora.

text-as-image delivery: the SEA-See download utility will generate image data on the fly when Unicode and/or appropriate rendering software (a problem for Burmese and Khmer Unicode) are unavailable.

approximate dictionary search: utilities will provide intelligent heuristic-based approximate search capabilities for original orthography, transliteration, and/or phonetic transcription as available. These will dramatically improve the ease-of-use of the dictionaries.

corpus building and analysis: utilities will help collect, clean, and build the text corpora, align the bitext corpora, allow retrieval from very large unsegmented monolingual corpora, and analyze the corpora for data on word frequency and collocation.

transliteration, transcription, & conversion: utilities will orthographically transliterate or phonetically transcribe texts, and convert search items and older 'legacy' texts to and from Unicode, and the still widely used traditional font encodings, as possible.

The SEAlang Library is a long-awaited addition to the national digital infrastructure being built with the support of a variety of U.S. Department of Education Title VI programs. It will enable:

pedagogy and new teaching, learning, and translation tools for less-commonly taught languages,

scholarly inquiry in linguistics, history, lexicography/etymology, and Southeast Asia area studies,

scientific research in computational linguistics and cross- language information retrieval, and

language reference all but unavailable to 1.8 million Americans of mainland Southeast Asian heritage who can typically speak – but not read, or consult reference materials in – their heritage languages.

design and hosted by MATRIX