OMOD Documentation

The Orthographic Mapping Ondemand Datasystem (OMOD) is a comprehensive framework for mapping orthographic representations (written characters) to their phonetic realizations (pronunciations) across world languages, dialects, and written sources. Like Glottocode provides unique identifiers for languoids (languages, dialects, and language varieties), OMOD provides a systematic way to understand and document orthographic-phonetic relationships at any level of granularity.

Key Features

Standardized JSON schema for orthographic-phonetic mappings
Context-aware mappings (initial, medial, final positions)
Support for multiple writing systems per languoid
Granular orthosets - from national standards to single manuscripts
Flexible languoid system - languages, dialects, continuums, and varieties
IPA (International Phonetic Alphabet) based pronunciations
Source documentation for scholarly attribution
Extensible framework for new languages, dialects, and sources

Core Concepts

Orthogon

An orthogon is the fundamental unit in OMOD - a single mapping between a written form and its pronunciation. Each orthogon contains:

{
    "glyphon": "ழ்",
    "interphon": "ɻ",
    "context": ["medial", "final"],
    "examples": [
        { 
            "word": "தமிழ்", 
            "pronunciation": "t̪amiɻ", 
            "meaning": "Tamil" 
        }
    ]
}

Glyphon

A glyphon is the orthographic unit - the written character or character sequence. This can be:

A single character: a, க, ع
A diacritic combination: á, கா, عَ
A digraph: ng, ch, gh

Interphon

An interphon is the phonetic representation in IPA notation. For tonal languages, this includes tone markers:

Simple consonant: /k/
Vowel with tone: /a˧˥/ (rising tone)
Complex sound: /ʈʂ/ (retroflex affricate)

Orthoset

An orthoset is a collection of orthogons representing a specific writing system, orthographic variant, or even conventions from a single source. Orthosets can be extremely granular - they might represent:

A standardized national orthography
Regional spelling variations
Historical spelling conventions from a specific era
Orthographic choices found in a single manuscript or publication
Digital adaptations (like chat alphabets)
Individual author's spelling preferences

Modern Standard Arabic

Contemporary Arabic script used in formal contexts

arb-msa

Arabic Chat Alphabet

Latin numerals representing Arabic sounds in digital communication

arb-chat

Vietnamese Quốc Ngữ

Latin-based script with tone diacritics

viet-quocngu

Orthoset Granularity

Orthosets can be as specific as needed. For example, you might have separate orthosets for:

tam-sangam-300bce - Tamil orthography in Sangam literature (300 BCE)
tam-pallava-600ce - Pallava dynasty inscriptions (600 CE)
tam-modern-tn-gov - Tamil Nadu government standard (2020s)
tam-modern-sri-lanka - Sri Lankan Tamil conventions

Languoid

A languoid represents any linguistic variety - not just "languages" but also dialects, sociolects, chronolects, or even language continuums. Following Glottolog's approach, languoids acknowledge that linguistic boundaries are often fuzzy and politically charged. A languoid might represent:

A standardized national language
A regional dialect with distinct phonology
A historical stage of a language
A sociolect (social variety)
A point on a dialect continuum
A contact variety or creole

{
    "languoid": "Tamil",
    "endonym": "தமிழ்",
    "glottologCode": "tam1289",
    "iso639_3": "tam",
    "region": "South Asia",
    "population": "75000000"
}

Languoid Examples

Languoids can represent various levels of linguistic granularity:

Macro-language: arab1395 - Arabic macrolanguage
Regional variety: egyp1253 - Egyptian Arabic
Urban dialect: cair1241 - Cairene Arabic
Historical stage: clas1252 - Classical Arabic
Contact variety: malt1254 - Maltese (Arabic-influenced)
Continuum point: nort3208 - Northern Vietnamese

Data Schema

OMOD uses JSON Schema for validation and consistency across all language implementations.

Orthogon Schema

{
    "$schema": "http://json-schema.org/draft-07/schema#",
    "title": "OMOD Orthogon",
    "type": "object",
    "properties": {
        "glyphon": { 
            "type": "string",
            "description": "The written form/character(s)"
        },
        "interphon": { 
            "type": "string",
            "description": "IPA phonetic representation" 
        },
        "context": {
            "type": "array",
            "items": {
                "enum": ["initial", "medial", "final", "standalone", "allophonic"]
            },
            "description": "Positions where this mapping applies"
        },
        "examples": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "word": { "type": "string" },
                    "pronunciation": { "type": "string" },
                    "meaning": { "type": "string" }
                },
                "required": ["word", "pronunciation"]
            }
        }
    },
    "required": ["glyphon", "interphon"]
}

Context Values

Context	Description	Example
`initial`	Beginning of a word/syllable	Vietnamese ng in "người"
`medial`	Middle of a word	Tamil ழ் in "வாழ்க்கை"
`final`	End of a word/syllable	Arabic ه in "الله"
`standalone`	Character by itself	Tamil அ as "the letter a"
`allophonic`	Varies by phonetic environment	English t aspirated vs unaspirated

Languoid Examples

Tamil (தமிழ்)

Tamil orthography demonstrates complex mappings with retroflex sounds unique to Dravidian languages:

Tamil Features

247 orthogons including vowels, consonants, and combinations
Retroflex consonants: ட் /ʈ/, ண் /ɳ/, ழ் /ɻ/, ள் /ɭ/
Context-sensitive pronunciations
Brahmic abugida structure

Example: Retroflex Approximant

{
    "glyphon": "ழ்",
    "interphon": "ɻ",
    "context": ["medial", "final"],
    "examples": [
        { "word": "தமிழ்", "pronunciation": "t̪amiɻ", "meaning": "Tamil" },
        { "word": "வாழ்க்கை", "pronunciation": "ʋaːɻkkai̯", "meaning": "life" }
    ]
}

Arabic (العربية)

Arabic demonstrates multiple orthosets including traditional script and digital adaptations:

Traditional Arabic Script

{
    "glyphon": "ع",
    "interphon": "ʕ",
    "context": ["initial", "medial", "final"],
    "examples": [
        { "word": "عرب", "pronunciation": "ʕarab", "meaning": "Arabs" }
    ]
}

Arabic Chat Alphabet

{
    "orthoset": "arb-chat",
    "glyphon": "3",
    "interphon": "ʕ",
    "context": ["initial", "medial", "final"],
    "examples": [
        { "word": "3arab", "pronunciation": "ʕarab", "meaning": "Arabs (chat spelling)" }
    ]
}

Note on Arabic Numerals

The Arabic chat alphabet uses numerals to represent sounds that don't exist in Latin: 2 = ء (hamza), 3 = ع (ain), 5 = خ (kha), 7 = ح (ha), 9 = ص (sad)

Vietnamese (Tiếng Việt)

Vietnamese showcases a tonal language with systematic tone marking:

Tonal System

Tone	Name	Diacritic	IPA Tone	Example
Level	ngang	a	˧˧ (33)	ma "ghost"
Rising	sắc	á	˧˥ (35)	má "mother"
Falling	huyền	à	˨˩ (21)	mà "but"
Dipping-rising	hỏi	ả	˧˩˧ (313)	mả "tomb"
Rising glottalized	ngã	ã	˧˩˧ (313)	mã "code"
Low glottalized	nặng	ạ	˨˧ (23)	mạ "rice seedling"

Special Characters

{
    "glyphon": "đ",
    "interphon": "ɗ",
    "context": ["initial"],
    "examples": [
        { "word": "đúng", "pronunciation": "ɗuŋ˧˥", "meaning": "correct" }
    ]
}

Dialectal Variation Example

The same glyphon can have different pronunciations across languoids. For example, Vietnamese "r" varies by region:

// Northern Vietnamese (Hanoi)
{
    "orthoset": "viet-north-hanoi",
    "glyphon": "r",
    "interphon": "z"
}

// Southern Vietnamese (Saigon)
{
    "orthoset": "viet-south-saigon",
    "glyphon": "r",
    "interphon": "ʐ"
}

// Central Vietnamese (Hue)
{
    "orthoset": "viet-central-hue",
    "glyphon": "r",
    "interphon": "ʐ"  // but with different phonetic realization
}

Implementation Guide

File Structure

Each languoid implementation follows this structure:

omod-[languoid]/
├── omod-[languoid]-data.json        # Main orthogon mappings
├── omod-[languoid]-languoids.json   # Language metadata
├── omod-[languoid]-orthosets.json   # Writing system variants
├── omod-[languoid]-schema.json      # Validation schema
└── README.md                        # Languoid-specific notes

Data Format Example

Languoid File

{
    "languoid": "Vietnamese",
    "endonym": "Tiếng Việt",
    "glottologCode": "viet1251",
    "iso639_3": "vie",
    "region": "Southeast Asia",
    "population": "100400000"
}

Orthoset File

[
    {
        "id": "viet-quocngu-2025",
        "languoid": "viet1251",
        "name": "Vietnamese Quốc Ngữ (2025)",
        "description": "Modern Latin-based Vietnamese orthography with tone diacritics",
        "script": "Latin",
        "created": "2025-01-01"
    }
]

Best Practices

Use standard IPA notation - Ensure consistency across languages
Provide context - Specify where each mapping applies
Include examples - Real words help users understand usage
Document edge cases - Note dialectal variations or exceptions
Cite your sources - Especially important for orthosets from specific texts or authors
Be granular when needed - Don't hesitate to create specific orthosets for unique sources
Validate with schema - Use JSON Schema validation before commits

Source Documentation

When creating an orthoset from a specific source, include metadata about:

Publication title and author
Year of publication
Page numbers (if relevant)
Any editorial decisions or normalizations made
Dialectal or regional context

{
    "id": "sencoten-elliott-1984",
    "name": "SENĆOŦEN as used in 'Saanich Placenames'",
    "source": {
        "title": "Saanich Placenames",
        "author": "Dave Elliott Sr.",
        "year": 1984,
        "publisher": "First Edition",
        "notes": "Orthography developed by PENÁĆ"
    }
}

Contributing

OMOD is an open project welcoming contributions for new languages and improvements to existing mappings.

How to Contribute

Fork the repository on GitHub
Create a languoid folder following the naming convention
Implement required files:
- Data file with orthogon mappings
- Languoid metadata
- Orthoset definitions
- Schema (can extend base schema)
Validate your data against the schema
Submit a pull request with description of changes

Contribution Guidelines

Requirements

Use accurate IPA transcriptions
Cite linguistic sources where applicable
Include native speaker verification when possible
Follow existing formatting conventions
Add comprehensive examples

Priority Languages

We're particularly interested in:

Languages with complex orthographies
Endangered languages needing documentation
Languages with multiple writing systems
Historical orthographies and manuscript traditions
Dialectal variations within language families
Contact varieties and creoles
Source-specific orthographies from important texts

Documenting Variations

When contributing, consider creating separate orthosets for:

Regional variations: Northern vs Southern pronunciations
Historical periods: Medieval vs Modern spellings
Literary sources: Specific authors or manuscripts
Official vs colloquial: Government standards vs street usage
Digital adaptations: SMS spelling, chat alphabets

This granularity helps researchers track orthographic evolution and variation.

API Reference

OMOD data can be accessed programmatically through various methods:

Direct JSON Access

// Fetch Tamil orthogons
const response = await fetch('/omod-tamil/omod-tamil-data.json');
const tamilData = await response.json();

// Find specific mapping
const zha = tamilData.find(o => o.glyphon === 'ழ்');
console.log(zha.interphon); // Output: ɻ

Query Functions

// Get all orthogons for a specific context
function getByContext(data, context) {
    return data.filter(o => o.context.includes(context));
}

// Get mappings by script type
function getByScript(orthosets, script) {
    return orthosets.filter(s => s.script === script);
}

// Get orthosets from a specific source
function getBySource(orthosets, author, year) {
    return orthosets.filter(s => 
        s.source?.author === author && 
        s.source?.year === year
    );
}

// Find dialectal variations
function getDialectalVariants(data, glyphon, languoids) {
    return data.filter(o => o.glyphon === glyphon)
        .map(o => ({
            orthoset: o.orthoset,
            languoid: languoids.find(l => l.id === o.languoid),
            pronunciation: o.interphon
        }));
}

Validation

// Validate orthogon against schema
import Ajv from 'ajv';
import schema from './omod-orthogon.schema.json';

const ajv = new Ajv();
const validate = ajv.compile(schema);

const valid = validate(orthogonData);
if (!valid) console.log(validate.errors);

Integration Examples

Text-to-Speech

// Convert Tamil text to IPA for TTS
function tamilToIPA(text, mappings) {
    let ipa = '';
    for (let char of text) {
        const mapping = mappings.find(m => m.glyphon === char);
        ipa += mapping ? mapping.interphon : char;
    }
    return ipa;
}

Language Learning

// Generate pronunciation guide
function getPronunciationGuide(word, mappings) {
    return word.split('').map(char => {
        const m = mappings.find(m => m.glyphon === char);
        return {
            char: char,
            ipa: m?.interphon || '?',
            examples: m?.examples || []
        };
    });
}