OMOD Documentation
The Orthographic Mapping Ondemand Datasystem (OMOD) is a comprehensive framework for mapping orthographic representations (written characters) to their phonetic realizations (pronunciations) across world languages, dialects, and written sources. Like Glottocode provides unique identifiers for languoids (languages, dialects, and language varieties), OMOD provides a systematic way to understand and document orthographic-phonetic relationships at any level of granularity.
- Standardized JSON schema for orthographic-phonetic mappings
- Context-aware mappings (initial, medial, final positions)
- Support for multiple writing systems per languoid
- Granular orthosets - from national standards to single manuscripts
- Flexible languoid system - languages, dialects, continuums, and varieties
- IPA (International Phonetic Alphabet) based pronunciations
- Source documentation for scholarly attribution
- Extensible framework for new languages, dialects, and sources
Core Concepts
Orthogon
An orthogon is the fundamental unit in OMOD - a single mapping between a written form and its pronunciation. Each orthogon contains:
{
"glyphon": "ழ்",
"interphon": "ɻ",
"context": ["medial", "final"],
"examples": [
{
"word": "தமிழ்",
"pronunciation": "t̪amiɻ",
"meaning": "Tamil"
}
]
}
Glyphon
A glyphon is the orthographic unit - the written character or character sequence. This can be:
- A single character:
a
,க
,ع
- A diacritic combination:
á
,கா
,عَ
- A digraph:
ng
,ch
,gh
Interphon
An interphon is the phonetic representation in IPA notation. For tonal languages, this includes tone markers:
- Simple consonant:
/k/
- Vowel with tone:
/a˧˥/
(rising tone) - Complex sound:
/ʈʂ/
(retroflex affricate)
Orthoset
An orthoset is a collection of orthogons representing a specific writing system, orthographic variant, or even conventions from a single source. Orthosets can be extremely granular - they might represent:
- A standardized national orthography
- Regional spelling variations
- Historical spelling conventions from a specific era
- Orthographic choices found in a single manuscript or publication
- Digital adaptations (like chat alphabets)
- Individual author's spelling preferences
Contemporary Arabic script used in formal contexts
arb-msaLatin numerals representing Arabic sounds in digital communication
arb-chatLatin-based script with tone diacritics
viet-quocnguOrthosets can be as specific as needed. For example, you might have separate orthosets for:
tam-sangam-300bce
- Tamil orthography in Sangam literature (300 BCE)tam-pallava-600ce
- Pallava dynasty inscriptions (600 CE)tam-modern-tn-gov
- Tamil Nadu government standard (2020s)tam-modern-sri-lanka
- Sri Lankan Tamil conventions
Languoid
A languoid represents any linguistic variety - not just "languages" but also dialects, sociolects, chronolects, or even language continuums. Following Glottolog's approach, languoids acknowledge that linguistic boundaries are often fuzzy and politically charged. A languoid might represent:
- A standardized national language
- A regional dialect with distinct phonology
- A historical stage of a language
- A sociolect (social variety)
- A point on a dialect continuum
- A contact variety or creole
{
"languoid": "Tamil",
"endonym": "தமிழ்",
"glottologCode": "tam1289",
"iso639_3": "tam",
"region": "South Asia",
"population": "75000000"
}
Languoids can represent various levels of linguistic granularity:
- Macro-language:
arab1395
- Arabic macrolanguage - Regional variety:
egyp1253
- Egyptian Arabic - Urban dialect:
cair1241
- Cairene Arabic - Historical stage:
clas1252
- Classical Arabic - Contact variety:
malt1254
- Maltese (Arabic-influenced) - Continuum point:
nort3208
- Northern Vietnamese
Data Schema
OMOD uses JSON Schema for validation and consistency across all language implementations.
Orthogon Schema
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "OMOD Orthogon",
"type": "object",
"properties": {
"glyphon": {
"type": "string",
"description": "The written form/character(s)"
},
"interphon": {
"type": "string",
"description": "IPA phonetic representation"
},
"context": {
"type": "array",
"items": {
"enum": ["initial", "medial", "final", "standalone", "allophonic"]
},
"description": "Positions where this mapping applies"
},
"examples": {
"type": "array",
"items": {
"type": "object",
"properties": {
"word": { "type": "string" },
"pronunciation": { "type": "string" },
"meaning": { "type": "string" }
},
"required": ["word", "pronunciation"]
}
}
},
"required": ["glyphon", "interphon"]
}
Context Values
Context | Description | Example |
---|---|---|
initial |
Beginning of a word/syllable | Vietnamese ng in "người" |
medial |
Middle of a word | Tamil ழ் in "வாழ்க்கை" |
final |
End of a word/syllable | Arabic ه in "الله" |
standalone |
Character by itself | Tamil அ as "the letter a" |
allophonic |
Varies by phonetic environment | English t aspirated vs unaspirated |
Languoid Examples
Tamil (தமிழ்)
Tamil orthography demonstrates complex mappings with retroflex sounds unique to Dravidian languages:
- 247 orthogons including vowels, consonants, and combinations
- Retroflex consonants: ட் /ʈ/, ண் /ɳ/, ழ் /ɻ/, ள் /ɭ/
- Context-sensitive pronunciations
- Brahmic abugida structure
Example: Retroflex Approximant
{
"glyphon": "ழ்",
"interphon": "ɻ",
"context": ["medial", "final"],
"examples": [
{ "word": "தமிழ்", "pronunciation": "t̪amiɻ", "meaning": "Tamil" },
{ "word": "வாழ்க்கை", "pronunciation": "ʋaːɻkkai̯", "meaning": "life" }
]
}
Arabic (العربية)
Arabic demonstrates multiple orthosets including traditional script and digital adaptations:
Traditional Arabic Script
{
"glyphon": "ع",
"interphon": "ʕ",
"context": ["initial", "medial", "final"],
"examples": [
{ "word": "عرب", "pronunciation": "ʕarab", "meaning": "Arabs" }
]
}
Arabic Chat Alphabet
{
"orthoset": "arb-chat",
"glyphon": "3",
"interphon": "ʕ",
"context": ["initial", "medial", "final"],
"examples": [
{ "word": "3arab", "pronunciation": "ʕarab", "meaning": "Arabs (chat spelling)" }
]
}
The Arabic chat alphabet uses numerals to represent sounds that don't exist in Latin:
2
= ء (hamza), 3
= ع (ain), 5
= خ (kha),
7
= ح (ha), 9
= ص (sad)
Vietnamese (Tiếng Việt)
Vietnamese showcases a tonal language with systematic tone marking:
Tonal System
Tone | Name | Diacritic | IPA Tone | Example |
---|---|---|---|---|
Level | ngang | a | ˧˧ (33) | ma "ghost" |
Rising | sắc | á | ˧˥ (35) | má "mother" |
Falling | huyền | à | ˨˩ (21) | mà "but" |
Dipping-rising | hỏi | ả | ˧˩˧ (313) | mả "tomb" |
Rising glottalized | ngã | ã | ˧˩˧ (313) | mã "code" |
Low glottalized | nặng | ạ | ˨˧ (23) | mạ "rice seedling" |
Special Characters
{
"glyphon": "đ",
"interphon": "ɗ",
"context": ["initial"],
"examples": [
{ "word": "đúng", "pronunciation": "ɗuŋ˧˥", "meaning": "correct" }
]
}
The same glyphon can have different pronunciations across languoids. For example, Vietnamese "r" varies by region:
// Northern Vietnamese (Hanoi)
{
"orthoset": "viet-north-hanoi",
"glyphon": "r",
"interphon": "z"
}
// Southern Vietnamese (Saigon)
{
"orthoset": "viet-south-saigon",
"glyphon": "r",
"interphon": "ʐ"
}
// Central Vietnamese (Hue)
{
"orthoset": "viet-central-hue",
"glyphon": "r",
"interphon": "ʐ" // but with different phonetic realization
}
Implementation Guide
File Structure
Each languoid implementation follows this structure:
omod-[languoid]/
├── omod-[languoid]-data.json # Main orthogon mappings
├── omod-[languoid]-languoids.json # Language metadata
├── omod-[languoid]-orthosets.json # Writing system variants
├── omod-[languoid]-schema.json # Validation schema
└── README.md # Languoid-specific notes
Data Format Example
Languoid File
{
"languoid": "Vietnamese",
"endonym": "Tiếng Việt",
"glottologCode": "viet1251",
"iso639_3": "vie",
"region": "Southeast Asia",
"population": "100400000"
}
Orthoset File
[
{
"id": "viet-quocngu-2025",
"languoid": "viet1251",
"name": "Vietnamese Quốc Ngữ (2025)",
"description": "Modern Latin-based Vietnamese orthography with tone diacritics",
"script": "Latin",
"created": "2025-01-01"
}
]
Best Practices
- Use standard IPA notation - Ensure consistency across languages
- Provide context - Specify where each mapping applies
- Include examples - Real words help users understand usage
- Document edge cases - Note dialectal variations or exceptions
- Cite your sources - Especially important for orthosets from specific texts or authors
- Be granular when needed - Don't hesitate to create specific orthosets for unique sources
- Validate with schema - Use JSON Schema validation before commits
When creating an orthoset from a specific source, include metadata about:
- Publication title and author
- Year of publication
- Page numbers (if relevant)
- Any editorial decisions or normalizations made
- Dialectal or regional context
{
"id": "sencoten-elliott-1984",
"name": "SENĆOŦEN as used in 'Saanich Placenames'",
"source": {
"title": "Saanich Placenames",
"author": "Dave Elliott Sr.",
"year": 1984,
"publisher": "First Edition",
"notes": "Orthography developed by PENÁĆ"
}
}
Contributing
OMOD is an open project welcoming contributions for new languages and improvements to existing mappings.
How to Contribute
- Fork the repository on GitHub
- Create a languoid folder following the naming convention
- Implement required files:
- Data file with orthogon mappings
- Languoid metadata
- Orthoset definitions
- Schema (can extend base schema)
- Validate your data against the schema
- Submit a pull request with description of changes
Contribution Guidelines
- Use accurate IPA transcriptions
- Cite linguistic sources where applicable
- Include native speaker verification when possible
- Follow existing formatting conventions
- Add comprehensive examples
Priority Languages
We're particularly interested in:
- Languages with complex orthographies
- Endangered languages needing documentation
- Languages with multiple writing systems
- Historical orthographies and manuscript traditions
- Dialectal variations within language families
- Contact varieties and creoles
- Source-specific orthographies from important texts
When contributing, consider creating separate orthosets for:
- Regional variations: Northern vs Southern pronunciations
- Historical periods: Medieval vs Modern spellings
- Literary sources: Specific authors or manuscripts
- Official vs colloquial: Government standards vs street usage
- Digital adaptations: SMS spelling, chat alphabets
This granularity helps researchers track orthographic evolution and variation.
API Reference
OMOD data can be accessed programmatically through various methods:
Direct JSON Access
// Fetch Tamil orthogons
const response = await fetch('/omod-tamil/omod-tamil-data.json');
const tamilData = await response.json();
// Find specific mapping
const zha = tamilData.find(o => o.glyphon === 'ழ்');
console.log(zha.interphon); // Output: ɻ
Query Functions
// Get all orthogons for a specific context
function getByContext(data, context) {
return data.filter(o => o.context.includes(context));
}
// Get mappings by script type
function getByScript(orthosets, script) {
return orthosets.filter(s => s.script === script);
}
// Get orthosets from a specific source
function getBySource(orthosets, author, year) {
return orthosets.filter(s =>
s.source?.author === author &&
s.source?.year === year
);
}
// Find dialectal variations
function getDialectalVariants(data, glyphon, languoids) {
return data.filter(o => o.glyphon === glyphon)
.map(o => ({
orthoset: o.orthoset,
languoid: languoids.find(l => l.id === o.languoid),
pronunciation: o.interphon
}));
}
Validation
// Validate orthogon against schema
import Ajv from 'ajv';
import schema from './omod-orthogon.schema.json';
const ajv = new Ajv();
const validate = ajv.compile(schema);
const valid = validate(orthogonData);
if (!valid) console.log(validate.errors);
Integration Examples
Text-to-Speech
// Convert Tamil text to IPA for TTS
function tamilToIPA(text, mappings) {
let ipa = '';
for (let char of text) {
const mapping = mappings.find(m => m.glyphon === char);
ipa += mapping ? mapping.interphon : char;
}
return ipa;
}
Language Learning
// Generate pronunciation guide
function getPronunciationGuide(word, mappings) {
return word.split('').map(char => {
const m = mappings.find(m => m.glyphon === char);
return {
char: char,
ipa: m?.interphon || '?',
examples: m?.examples || []
};
});
}
OMOD Documentation v0.7 • Last updated: May 2025
Schema Version: Draft-07 • Languages: Tamil, Arabic, Vietnamese
Supporting languoids from macro-languages to village dialects
Orthosets from national standards to individual manuscripts
© 2025 Squirrel Ridge Observatory Centre 🐿️