OMOD Documentation

The Orthographic Mapping Ondemand Datasystem (OMOD) is a comprehensive framework for mapping orthographic representations (written characters) to their phonetic realizations (pronunciations) across world languages, dialects, and written sources. Like Glottocode provides unique identifiers for languoids (languages, dialects, and language varieties), OMOD provides a systematic way to understand and document orthographic-phonetic relationships at any level of granularity.

Key Features
  • Standardized JSON schema for orthographic-phonetic mappings
  • Context-aware mappings (initial, medial, final positions)
  • Support for multiple writing systems per languoid
  • Granular orthosets - from national standards to single manuscripts
  • Flexible languoid system - languages, dialects, continuums, and varieties
  • IPA (International Phonetic Alphabet) based pronunciations
  • Source documentation for scholarly attribution
  • Extensible framework for new languages, dialects, and sources

Core Concepts

Orthogon

An orthogon is the fundamental unit in OMOD - a single mapping between a written form and its pronunciation. Each orthogon contains:

{
    "glyphon": "ழ்",
    "interphon": "ɻ",
    "context": ["medial", "final"],
    "examples": [
        { 
            "word": "தமிழ்", 
            "pronunciation": "t̪amiɻ", 
            "meaning": "Tamil" 
        }
    ]
}

Glyphon

A glyphon is the orthographic unit - the written character or character sequence. This can be:

  • A single character: a, , ع
  • A diacritic combination: á, கா, عَ
  • A digraph: ng, ch, gh

Interphon

An interphon is the phonetic representation in IPA notation. For tonal languages, this includes tone markers:

  • Simple consonant: /k/
  • Vowel with tone: /a˧˥/ (rising tone)
  • Complex sound: /ʈʂ/ (retroflex affricate)

Orthoset

An orthoset is a collection of orthogons representing a specific writing system, orthographic variant, or even conventions from a single source. Orthosets can be extremely granular - they might represent:

  • A standardized national orthography
  • Regional spelling variations
  • Historical spelling conventions from a specific era
  • Orthographic choices found in a single manuscript or publication
  • Digital adaptations (like chat alphabets)
  • Individual author's spelling preferences
Modern Standard Arabic

Contemporary Arabic script used in formal contexts

arb-msa
Arabic Chat Alphabet

Latin numerals representing Arabic sounds in digital communication

arb-chat
Vietnamese Quốc Ngữ

Latin-based script with tone diacritics

viet-quocngu
Orthoset Granularity

Orthosets can be as specific as needed. For example, you might have separate orthosets for:

  • tam-sangam-300bce - Tamil orthography in Sangam literature (300 BCE)
  • tam-pallava-600ce - Pallava dynasty inscriptions (600 CE)
  • tam-modern-tn-gov - Tamil Nadu government standard (2020s)
  • tam-modern-sri-lanka - Sri Lankan Tamil conventions

Languoid

A languoid represents any linguistic variety - not just "languages" but also dialects, sociolects, chronolects, or even language continuums. Following Glottolog's approach, languoids acknowledge that linguistic boundaries are often fuzzy and politically charged. A languoid might represent:

  • A standardized national language
  • A regional dialect with distinct phonology
  • A historical stage of a language
  • A sociolect (social variety)
  • A point on a dialect continuum
  • A contact variety or creole
{
    "languoid": "Tamil",
    "endonym": "தமிழ்",
    "glottologCode": "tam1289",
    "iso639_3": "tam",
    "region": "South Asia",
    "population": "75000000"
}
Languoid Examples

Languoids can represent various levels of linguistic granularity:

  • Macro-language: arab1395 - Arabic macrolanguage
  • Regional variety: egyp1253 - Egyptian Arabic
  • Urban dialect: cair1241 - Cairene Arabic
  • Historical stage: clas1252 - Classical Arabic
  • Contact variety: malt1254 - Maltese (Arabic-influenced)
  • Continuum point: nort3208 - Northern Vietnamese

Data Schema

OMOD uses JSON Schema for validation and consistency across all language implementations.

Orthogon Schema

{
    "$schema": "http://json-schema.org/draft-07/schema#",
    "title": "OMOD Orthogon",
    "type": "object",
    "properties": {
        "glyphon": { 
            "type": "string",
            "description": "The written form/character(s)"
        },
        "interphon": { 
            "type": "string",
            "description": "IPA phonetic representation" 
        },
        "context": {
            "type": "array",
            "items": {
                "enum": ["initial", "medial", "final", "standalone", "allophonic"]
            },
            "description": "Positions where this mapping applies"
        },
        "examples": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "word": { "type": "string" },
                    "pronunciation": { "type": "string" },
                    "meaning": { "type": "string" }
                },
                "required": ["word", "pronunciation"]
            }
        }
    },
    "required": ["glyphon", "interphon"]
}

Context Values

Context Description Example
initial Beginning of a word/syllable Vietnamese ng in "người"
medial Middle of a word Tamil ழ் in "வாழ்க்கை"
final End of a word/syllable Arabic ه in "الله"
standalone Character by itself Tamil as "the letter a"
allophonic Varies by phonetic environment English t aspirated vs unaspirated

Languoid Examples

Tamil (தமிழ்)

Tamil orthography demonstrates complex mappings with retroflex sounds unique to Dravidian languages:

Tamil Features
  • 247 orthogons including vowels, consonants, and combinations
  • Retroflex consonants: ட் /ʈ/, ண் /ɳ/, ழ் /ɻ/, ள் /ɭ/
  • Context-sensitive pronunciations
  • Brahmic abugida structure

Example: Retroflex Approximant

{
    "glyphon": "ழ்",
    "interphon": "ɻ",
    "context": ["medial", "final"],
    "examples": [
        { "word": "தமிழ்", "pronunciation": "t̪amiɻ", "meaning": "Tamil" },
        { "word": "வாழ்க்கை", "pronunciation": "ʋaːɻkkai̯", "meaning": "life" }
    ]
}

Arabic (العربية)

Arabic demonstrates multiple orthosets including traditional script and digital adaptations:

Traditional Arabic Script

{
    "glyphon": "ع",
    "interphon": "ʕ",
    "context": ["initial", "medial", "final"],
    "examples": [
        { "word": "عرب", "pronunciation": "ʕarab", "meaning": "Arabs" }
    ]
}

Arabic Chat Alphabet

{
    "orthoset": "arb-chat",
    "glyphon": "3",
    "interphon": "ʕ",
    "context": ["initial", "medial", "final"],
    "examples": [
        { "word": "3arab", "pronunciation": "ʕarab", "meaning": "Arabs (chat spelling)" }
    ]
}
Note on Arabic Numerals

The Arabic chat alphabet uses numerals to represent sounds that don't exist in Latin: 2 = ء (hamza), 3 = ع (ain), 5 = خ (kha), 7 = ح (ha), 9 = ص (sad)

Vietnamese (Tiếng Việt)

Vietnamese showcases a tonal language with systematic tone marking:

Tonal System

Tone Name Diacritic IPA Tone Example
Level ngang a ˧˧ (33) ma "ghost"
Rising sắc á ˧˥ (35) má "mother"
Falling huyền à ˨˩ (21) mà "but"
Dipping-rising hỏi ˧˩˧ (313) mả "tomb"
Rising glottalized ngã ã ˧˩˧ (313) mã "code"
Low glottalized nặng ˨˧ (23) mạ "rice seedling"

Special Characters

{
    "glyphon": "đ",
    "interphon": "ɗ",
    "context": ["initial"],
    "examples": [
        { "word": "đúng", "pronunciation": "ɗuŋ˧˥", "meaning": "correct" }
    ]
}
Dialectal Variation Example

The same glyphon can have different pronunciations across languoids. For example, Vietnamese "r" varies by region:

// Northern Vietnamese (Hanoi)
{
    "orthoset": "viet-north-hanoi",
    "glyphon": "r",
    "interphon": "z"
}

// Southern Vietnamese (Saigon)
{
    "orthoset": "viet-south-saigon",
    "glyphon": "r",
    "interphon": "ʐ"
}

// Central Vietnamese (Hue)
{
    "orthoset": "viet-central-hue",
    "glyphon": "r",
    "interphon": "ʐ"  // but with different phonetic realization
}

Implementation Guide

File Structure

Each languoid implementation follows this structure:

omod-[languoid]/
├── omod-[languoid]-data.json        # Main orthogon mappings
├── omod-[languoid]-languoids.json   # Language metadata
├── omod-[languoid]-orthosets.json   # Writing system variants
├── omod-[languoid]-schema.json      # Validation schema
└── README.md                        # Languoid-specific notes

Data Format Example

Languoid File

{
    "languoid": "Vietnamese",
    "endonym": "Tiếng Việt",
    "glottologCode": "viet1251",
    "iso639_3": "vie",
    "region": "Southeast Asia",
    "population": "100400000"
}

Orthoset File

[
    {
        "id": "viet-quocngu-2025",
        "languoid": "viet1251",
        "name": "Vietnamese Quốc Ngữ (2025)",
        "description": "Modern Latin-based Vietnamese orthography with tone diacritics",
        "script": "Latin",
        "created": "2025-01-01"
    }
]

Best Practices

  • Use standard IPA notation - Ensure consistency across languages
  • Provide context - Specify where each mapping applies
  • Include examples - Real words help users understand usage
  • Document edge cases - Note dialectal variations or exceptions
  • Cite your sources - Especially important for orthosets from specific texts or authors
  • Be granular when needed - Don't hesitate to create specific orthosets for unique sources
  • Validate with schema - Use JSON Schema validation before commits
Source Documentation

When creating an orthoset from a specific source, include metadata about:

  • Publication title and author
  • Year of publication
  • Page numbers (if relevant)
  • Any editorial decisions or normalizations made
  • Dialectal or regional context
{
    "id": "sencoten-elliott-1984",
    "name": "SENĆOŦEN as used in 'Saanich Placenames'",
    "source": {
        "title": "Saanich Placenames",
        "author": "Dave Elliott Sr.",
        "year": 1984,
        "publisher": "First Edition",
        "notes": "Orthography developed by PENÁĆ"
    }
}

Contributing

OMOD is an open project welcoming contributions for new languages and improvements to existing mappings.

How to Contribute

  1. Fork the repository on GitHub
  2. Create a languoid folder following the naming convention
  3. Implement required files:
    • Data file with orthogon mappings
    • Languoid metadata
    • Orthoset definitions
    • Schema (can extend base schema)
  4. Validate your data against the schema
  5. Submit a pull request with description of changes

Contribution Guidelines

Requirements
  • Use accurate IPA transcriptions
  • Cite linguistic sources where applicable
  • Include native speaker verification when possible
  • Follow existing formatting conventions
  • Add comprehensive examples

Priority Languages

We're particularly interested in:

  • Languages with complex orthographies
  • Endangered languages needing documentation
  • Languages with multiple writing systems
  • Historical orthographies and manuscript traditions
  • Dialectal variations within language families
  • Contact varieties and creoles
  • Source-specific orthographies from important texts
Documenting Variations

When contributing, consider creating separate orthosets for:

  • Regional variations: Northern vs Southern pronunciations
  • Historical periods: Medieval vs Modern spellings
  • Literary sources: Specific authors or manuscripts
  • Official vs colloquial: Government standards vs street usage
  • Digital adaptations: SMS spelling, chat alphabets

This granularity helps researchers track orthographic evolution and variation.

API Reference

OMOD data can be accessed programmatically through various methods:

Direct JSON Access

// Fetch Tamil orthogons
const response = await fetch('/omod-tamil/omod-tamil-data.json');
const tamilData = await response.json();

// Find specific mapping
const zha = tamilData.find(o => o.glyphon === 'ழ்');
console.log(zha.interphon); // Output: ɻ

Query Functions

// Get all orthogons for a specific context
function getByContext(data, context) {
    return data.filter(o => o.context.includes(context));
}

// Get mappings by script type
function getByScript(orthosets, script) {
    return orthosets.filter(s => s.script === script);
}

// Get orthosets from a specific source
function getBySource(orthosets, author, year) {
    return orthosets.filter(s => 
        s.source?.author === author && 
        s.source?.year === year
    );
}

// Find dialectal variations
function getDialectalVariants(data, glyphon, languoids) {
    return data.filter(o => o.glyphon === glyphon)
        .map(o => ({
            orthoset: o.orthoset,
            languoid: languoids.find(l => l.id === o.languoid),
            pronunciation: o.interphon
        }));
}

Validation

// Validate orthogon against schema
import Ajv from 'ajv';
import schema from './omod-orthogon.schema.json';

const ajv = new Ajv();
const validate = ajv.compile(schema);

const valid = validate(orthogonData);
if (!valid) console.log(validate.errors);

Integration Examples

Text-to-Speech

// Convert Tamil text to IPA for TTS
function tamilToIPA(text, mappings) {
    let ipa = '';
    for (let char of text) {
        const mapping = mappings.find(m => m.glyphon === char);
        ipa += mapping ? mapping.interphon : char;
    }
    return ipa;
}

Language Learning

// Generate pronunciation guide
function getPronunciationGuide(word, mappings) {
    return word.split('').map(char => {
        const m = mappings.find(m => m.glyphon === char);
        return {
            char: char,
            ipa: m?.interphon || '?',
            examples: m?.examples || []
        };
    });
}
Version Information

OMOD Documentation v0.7 • Last updated: May 2025
Schema Version: Draft-07 • Languages: Tamil, Arabic, Vietnamese

Supporting languoids from macro-languages to village dialects
Orthosets from national standards to individual manuscripts

© 2025 Squirrel Ridge Observatory Centre 🐿️