Uucd
Unicode character database decoder.
Uucd
decodes the data of the Unicode character database from its XML representation. It provides high-level (but not necessarily efficient) access to the data so that efficient representations can be extracted.
Uucd
decodes the representation described in the Annex #42 of Unicode %%UNICODE_VERSION%%. Subsequent versions may be decoded as long as no new cases are introduced in parsed enumerated properties.
Consult the basics.
Note. All strings returned by the module are UTF-8 encoded.
Unicode version %%UNICODE_VERSION%%
The type for Unicode code points, ranges from 0x0000
to 0x10_FFFF
.
is_cp n
is true
iff n
a Unicode code point.
is_scalar_value n
is true
iff n
is a Unicode scalar value.
Properties are referenced by their name and property values by their abbreviated name. To understand their semantics refer to the standard.
val unknown_prop : (string * string) -> string prop
unknown_prop (ns, n)
is a property read from an XML attribute whose expanded name is (ns, n)
. This can be used to access a property unknown to the module.
In alphabetical order.
val age : [ `Version of int * int | `Unassigned ] prop
val alphabetic : bool prop
val ascii_hex_digit : bool prop
val bidi_class : [ `AL | `AN | `B | `BN | `CS | `EN | `ES | `ET | `FSI | `L | `LRE | `LRI | `LRO | `NSM | `ON | `PDF | `PDI | `R | `RLE | `RLI | `RLO | `S | `WS ] prop
val bidi_control : bool prop
val bidi_mirrored : bool prop
val bidi_paired_bracket_type : [ `O | `C | `N ] prop
val block : [ `ASCII | `Adlam | `Aegean_Numbers | `Ahom | `Alchemical | `Alphabetic_PF | `Anatolian_Hieroglyphs | `Ancient_Greek_Music | `Ancient_Greek_Numbers | `Ancient_Symbols | `Arabic | `Arabic_Ext_A | `Arabic_Ext_B | `Arabic_Math | `Arabic_PF_A | `Arabic_PF_B | `Arabic_Sup | `Armenian | `Arrows | `Avestan | `Balinese | `Bamum | `Bamum_Sup | `Bassa_Vah | `Batak | `Bengali | `Bhaiksuki | `Block_Elements | `Bopomofo | `Bopomofo_Ext | `Box_Drawing | `Brahmi | `Braille | `Buginese | `Buhid | `Byzantine_Music | `CJK | `CJK_Compat | `CJK_Compat_Forms | `CJK_Compat_Ideographs | `CJK_Compat_Ideographs_Sup | `CJK_Ext_A | `CJK_Ext_B | `CJK_Ext_C | `CJK_Ext_D | `CJK_Ext_E | `CJK_Ext_F | `CJK_Ext_G | `CJK_Radicals_Sup | `CJK_Strokes | `CJK_Symbols | `Carian | `Caucasian_Albanian | `Chakma | `Cham | `Cherokee | `Cherokee_Sup | `Chess_Symbols | `Chorasmian | `Compat_Jamo | `Control_Pictures | `Coptic | `Coptic_Epact_Numbers | `Counting_Rod | `Cuneiform | `Cuneiform_Numbers | `Currency_Symbols | `Cypriot_Syllabary | `Cypro_Minoan | `Cyrillic | `Cyrillic_Ext_A | `Cyrillic_Ext_B | `Cyrillic_Ext_C | `Cyrillic_Sup | `Deseret | `Devanagari | `Devanagari_Ext | `Diacriticals | `Diacriticals_Ext | `Diacriticals_For_Symbols | `Diacriticals_Sup | `Dingbats | `Dives_Akuru | `Dogra | `Domino | `Duployan | `Early_Dynastic_Cuneiform | `Egyptian_Hieroglyph_Format_Controls | `Egyptian_Hieroglyphs | `Elbasan | `Elymaic | `Emoticons | `Enclosed_Alphanum | `Enclosed_Alphanum_Sup | `Enclosed_CJK | `Enclosed_Ideographic_Sup | `Ethiopic | `Ethiopic_Ext | `Ethiopic_Ext_A | `Ethiopic_Ext_B | `Ethiopic_Sup | `Geometric_Shapes | `Geometric_Shapes_Ext | `Georgian | `Georgian_Ext | `Georgian_Sup | `Glagolitic | `Glagolitic_Sup | `Gothic | `Grantha | `Greek | `Greek_Ext | `Gujarati | `Gunjala_Gondi | `Gurmukhi | `Half_And_Full_Forms | `Half_Marks | `Hangul | `Hanifi_Rohingya | `Hanunoo | `Hatran | `Hebrew | `High_PU_Surrogates | `High_Surrogates | `Hiragana | `IDC | `IPA_Ext | `Ideographic_Symbols | `Imperial_Aramaic | `Indic_Number_Forms | `Indic_Siyaq_Numbers | `Inscriptional_Pahlavi | `Inscriptional_Parthian | `Jamo | `Jamo_Ext_A | `Jamo_Ext_B | `Javanese | `Kaithi | `Kana_Ext_A | `Kana_Ext_B | `Kana_Sup | `Kanbun | `Kangxi | `Kannada | `Katakana | `Katakana_Ext | `Kayah_Li | `Kharoshthi | `Khitan_Small_Script | `Khmer | `Khmer_Symbols | `Khojki | `Khudawadi | `Lao | `Latin_1_Sup | `Latin_Ext_A | `Latin_Ext_Additional | `Latin_Ext_B | `Latin_Ext_C | `Latin_Ext_D | `Latin_Ext_E | `Latin_Ext_F | `Latin_Ext_G | `Lepcha | `Letterlike_Symbols | `Limbu | `Linear_A | `Linear_B_Ideograms | `Linear_B_Syllabary | `Lisu | `Lisu_Sup | `Low_Surrogates | `Lycian | `Lydian | `Mahajani | `Mahjong | `Makasar | `Malayalam | `Mandaic | `Manichaean | `Marchen | `Masaram_Gondi | `Math_Alphanum | `Math_Operators | `Mayan_Numerals | `Medefaidrin | `Meetei_Mayek | `Meetei_Mayek_Ext | `Mende_Kikakui | `Meroitic_Cursive | `Meroitic_Hieroglyphs | `Miao | `Misc_Arrows | `Misc_Math_Symbols_A | `Misc_Math_Symbols_B | `Misc_Pictographs | `Misc_Symbols | `Misc_Technical | `Modi | `Modifier_Letters | `Modifier_Tone_Letters | `Mongolian | `Mongolian_Sup | `Mro | `Multani | `Music | `Myanmar | `Myanmar_Ext_A | `Myanmar_Ext_B | `NB | `NKo | `Nabataean | `Nandinagari | `New_Tai_Lue | `Newa | `Number_Forms | `Nushu | `Nyiakeng_Puachue_Hmong | `OCR | `Ogham | `Ol_Chiki | `Old_Hungarian | `Old_Italic | `Old_North_Arabian | `Old_Permic | `Old_Persian | `Old_Sogdian | `Old_South_Arabian | `Old_Turkic | `Old_Uyghur | `Oriya | `Ornamental_Dingbats | `Osage | `Osmanya | `Ottoman_Siyaq_Numbers | `PUA | `Pahawh_Hmong | `Palmyrene | `Pau_Cin_Hau | `Phags_Pa | `Phaistos | `Phoenician | `Phonetic_Ext | `Phonetic_Ext_Sup | `Playing_Cards | `Psalter_Pahlavi | `Punctuation | `Rejang | `Rumi | `Runic | `Samaritan | `Saurashtra | `Sharada | `Shavian | `Shorthand_Format_Controls | `Siddham | `Sinhala | `Sinhala_Archaic_Numbers | `Small_Forms | `Small_Kana_Ext | `Sogdian | `Sora_Sompeng | `Soyombo | `Specials | `Sundanese | `Sundanese_Sup | `Sup_Arrows_A | `Sup_Arrows_B | `Sup_Arrows_C | `Sup_Math_Operators | `Sup_PUA_A | `Sup_PUA_B | `Sup_Punctuation | `Sup_Symbols_And_Pictographs | `Super_And_Sub | `Sutton_SignWriting | `Syloti_Nagri | `Symbols_And_Pictographs_Ext_A | `Symbols_For_Legacy_Computing | `Syriac | `Syriac_Sup | `Tagalog | `Tagbanwa | `Tags | `Tai_Le | `Tai_Tham | `Tai_Viet | `Tai_Xuan_Jing | `Takri | `Tamil | `Tamil_Sup | `Tangsa | `Tangut | `Tangut_Components | `Tangut_Sup | `Telugu | `Thaana | `Thai | `Tibetan | `Tifinagh | `Tirhuta | `Toto | `Transport_And_Map | `UCAS | `UCAS_Ext | `UCAS_Ext_A | `Ugaritic | `VS | `VS_Sup | `Vai | `Vedic_Ext | `Vertical_Forms | `Vithkuqi | `Wancho | `Warang_Citi | `Yezidi | `Yi_Radicals | `Yi_Syllables | `Yijing | `Zanabazar_Square | `Znamenny_Music ] prop
val canonical_combining_class : int prop
val cased : bool prop
val case_ignorable : bool prop
val changes_when_casefolded : bool prop
val changes_when_casemapped : bool prop
val changes_when_lowercased : bool prop
val changes_when_nfkc_casefolded : bool prop
val changes_when_titlecased : bool prop
val changes_when_uppercased : bool prop
val composition_exclusion : bool prop
val dash : bool prop
val decomposition_type : [ `Can | `Com | `Enc | `Fin | `Font | `Fra | `Init | `Iso | `Med | `Nar | `Nb | `Sml | `Sqr | `Sub | `Sup | `Vert | `Wide | `None ] prop
val default_ignorable_code_point : bool prop
val deprecated : bool prop
val diacritic : bool prop
val east_asian_width : [ `A | `F | `H | `N | `Na | `W ] prop
val emoji : bool prop
val emoji_presentation : bool prop
val emoji_modifier : bool prop
val emoji_modifier_base : bool prop
val emoji_component : bool prop
val expands_on_nfc : bool prop
val expands_on_nfd : bool prop
val expands_on_nfkc : bool prop
val expands_on_nfkd : bool prop
val extended_pictographic : bool prop
val extender : bool prop
val full_composition_exclusion : bool prop
val general_category : [ `Lu | `Ll | `Lt | `Lm | `Lo | `Mn | `Mc | `Me | `Nd | `Nl | `No | `Pc | `Pd | `Ps | `Pe | `Pi | `Pf | `Po | `Sm | `Sc | `Sk | `So | `Zs | `Zl | `Zp | `Cc | `Cf | `Cs | `Co | `Cn ] prop
val grapheme_base : bool prop
val grapheme_cluster_break : [ `CN | `CR | `EB | `EBG | `EM | `EX | `GAZ | `L | `LF | `LV | `LVT | `PP | `RI | `SM | `T | `V | `XX | `ZWJ ] prop
val grapheme_extend : bool prop
val grapheme_link : bool prop
val hangul_syllable_type : [ `L | `LV | `LVT | `T | `V | `NA ] prop
val hex_digit : bool prop
val hyphen : bool prop
val id_continue : bool prop
val id_start : bool prop
val ideographic : bool prop
val ids_binary_operator : bool prop
val ids_trinary_operator : bool prop
val indic_syllabic_category : [ `Avagraha | `Bindu | `Brahmi_Joining_Number | `Cantillation_Mark | `Consonant | `Consonant_Dead | `Consonant_Final | `Consonant_Head_Letter | `Consonant_Initial_Postfixed | `Consonant_Killer | `Consonant_Medial | `Consonant_Placeholder | `Consonant_Preceding_Repha | `Consonant_Prefixed | `Consonant_Repha | `Consonant_Subjoined | `Consonant_Succeeding_Repha | `Consonant_With_Stacker | `Gemination_Mark | `Invisible_Stacker | `Joiner | `Modifying_Letter | `Non_Joiner | `Nukta | `Number | `Number_Joiner | `Other | `Pure_Killer | `Register_Shifter | `Syllable_Modifier | `Tone_Letter | `Tone_Mark | `Virama | `Visarga | `Vowel | `Vowel_Dependent | `Vowel_Independent ] prop
val indic_matra_category : [ `Right | `Left | `Visual_Order_Left | `Left_And_Right | `Top | `Bottom | `Top_And_Bottom | `Top_And_Right | `Top_And_Left | `Top_And_Left_And_Right | `Bottom_And_Right | `Top_And_Bottom_And_Right | `Overstruck | `Invisible | `NA ] prop
val indic_positional_category : [ `Bottom | `Bottom_And_Left | `Bottom_And_Right | `Left | `Left_And_Right | `NA | `Overstruck | `Right | `Top | `Top_And_Bottom | `Top_And_Bottom_And_Left | `Top_And_Bottom_And_Right | `Top_And_Left | `Top_And_Left_And_Right | `Top_And_Right | `Visual_Order_Left ] prop
val iso_comment : string prop
val jamo_short_name : string prop
val join_control : bool prop
val joining_group : [ `African_Feh | `African_Noon | `African_Qaf | `Ain | `Alaph | `Alef | `Alef_Maqsurah | `Beh | `Beth | `Burushaski_Yeh_Barree | `Dal | `Dalath_Rish | `E | `Farsi_Yeh | `Fe | `Feh | `Final_Semkath | `Gaf | `Gamal | `Hah | `Hanifi_Rohingya_Kinna_Ya | `Hanifi_Rohingya_Pa | `Hamza_On_Heh_Goal | `He | `Heh | `Heh_Goal | `Heth | `Kaf | `Kaph | `Khaph | `Knotted_Heh | `Lam | `Lamadh | `Malayalam_Bha | `Malayalam_Ja | `Malayalam_Lla | `Malayalam_Llla | `Malayalam_Nga | `Malayalam_Nna | `Malayalam_Nnna | `Malayalam_Nya | `Malayalam_Ra | `Malayalam_Ssa | `Malayalam_Tta | `Manichaean_Aleph | `Manichaean_Ayin | `Manichaean_Beth | `Manichaean_Daleth | `Manichaean_Dhamedh | `Manichaean_Five | `Manichaean_Gimel | `Manichaean_Heth | `Manichaean_Hundred | `Manichaean_Kaph | `Manichaean_Lamedh | `Manichaean_Mem | `Manichaean_Nun | `Manichaean_One | `Manichaean_Pe | `Manichaean_Qoph | `Manichaean_Resh | `Manichaean_Sadhe | `Manichaean_Samekh | `Manichaean_Taw | `Manichaean_Ten | `Manichaean_Teth | `Manichaean_Thamedh | `Manichaean_Twenty | `Manichaean_Waw | `Manichaean_Yodh | `Manichaean_Zayin | `Meem | `Mim | `No_Joining_Group | `Noon | `Nun | `Nya | `Pe | `Qaf | `Qaph | `Reh | `Reversed_Pe | `Rohingya_Yeh | `Sad | `Sadhe | `Seen | `Semkath | `Shin | `Straight_Waw | `Swash_Kaf | `Syriac_Waw | `Tah | `Taw | `Teh_Marbuta | `Teh_Marbuta_Goal | `Teth | `Thin_Yeh | `Vertical_Tail | `Waw | `Yeh | `Yeh_Barree | `Yeh_With_Tail | `Yudh | `Yudh_He | `Zain | `Zhain ] prop
val joining_type : [ `U | `C | `T | `D | `L | `R ] prop
val line_break : [ `AI | `AL | `B2 | `BA | `BB | `BK | `CB | `CJ | `CL | `CM | `CP | `CR | `EX | `GL | `H2 | `H3 | `HL | `HY | `ID | `IN | `IS | `JL | `JT | `JV | `LF | `NL | `NS | `NU | `OP | `PO | `PR | `QU | `RI | `SA | `SG | `SP | `SY | `WJ | `XX | `ZW | `EB | `EM | `ZWJ ] prop
val logical_order_exception : bool prop
val lowercase : bool prop
val math : bool prop
val name : [ `Pattern of string | `Name of string ] prop
In the `Pattern
case occurrences of the character '#'
(U+0023
) in the string must be replaced by the value of the code point as four to six uppercase hexadecimal digits (the minimal needed). E.g. the pattern "CJK UNIFIED IDEOGRAPH-#"
associated to code point U+3400
gives the name "CJK UNIFIED IDEOGRAPH-3400"
.
val name_alias : (string * [ `Abbreviation | `Alternate | `Control | `Correction | `Figment ]) list prop
val nfc_quick_check : [ `True | `False | `Maybe ] prop
val nfd_quick_check : [ `True | `False | `Maybe ] prop
val nfkc_quick_check : [ `True | `False | `Maybe ] prop
val nfkd_quick_check : [ `True | `False | `Maybe ] prop
val noncharacter_code_point : bool prop
val numeric_type : [ `None | `De | `Di | `Nu ] prop
val numeric_value : [ `NaN | `Frac of int * int | `Num of int64 ] prop
val other_alphabetic : bool prop
val other_default_ignorable_code_point : bool prop
val other_grapheme_extend : bool prop
val other_id_continue : bool prop
val other_id_start : bool prop
val other_lowercase : bool prop
val other_math : bool prop
val other_uppercase : bool prop
val pattern_syntax : bool prop
val pattern_white_space : bool prop
val prepended_concatenation_mark : bool prop
val quotation_mark : bool prop
val radical : bool prop
val regional_indicator : bool prop
type script = [
]
val sentence_break : [ `AT | `CL | `CR | `EX | `FO | `LE | `LF | `LO | `NU | `SC | `SE | `SP | `ST | `UP | `XX ] prop
val soft_dotted : bool prop
val sterm : bool prop
val terminal_punctuation : bool prop
val uax_42_element : [ `Reserved | `Noncharacter | `Surrogate | `Char ] prop
Not normative, artefact of Uucd
. Corresponds to the XML element name that describes the code point.
val unicode_1_name : string prop
val unified_ideograph : bool prop
val uppercase : bool prop
val variation_selector : bool prop
val vertical_orientation : [ `U | `R | `Tu | `Tr ] prop
val white_space : bool prop
val word_break : [ `CR | `DQ | `EB | `EBG | `EM | `EX | `Extend | `FO | `GAZ | `HL | `KA | `LE | `LF | `MB | `ML | `MN | `NL | `NU | `RI | `SQ | `WSegSpace | `XX | `ZWJ ] prop
val xid_continue : bool prop
val xid_start : bool prop
In alphabetic order. For now unihan properties are always represented as strings.
val kAccountingNumeric : string prop
val kAlternateHanYu : string prop
val kAlternateJEF : string prop
val kAlternateKangXi : string prop
val kAlternateMorohashi : string prop
val kBigFive : string prop
val kCCCII : string prop
val kCNS1986 : string prop
val kCNS1992 : string prop
val kCangjie : string prop
val kCantonese : string prop
val kCheungBauer : string prop
val kCheungBauerIndex : string prop
val kCihaiT : string prop
val kCompatibilityVariant : string prop
val kCowles : string prop
val kDaeJaweon : string prop
val kDefinition : string prop
val kEACC : string prop
val kFenn : string prop
val kFennIndex : string prop
val kFourCornerCode : string prop
val kFrequency : string prop
val kGB0 : string prop
val kGB1 : string prop
val kGB3 : string prop
val kGB5 : string prop
val kGB7 : string prop
val kGB8 : string prop
val kGSR : string prop
val kGradeLevel : string prop
val kHDZRadBreak : string prop
val kHKGlyph : string prop
val kHKSCS : string prop
val kHanYu : string prop
val kHangul : string prop
val kHanyuPinlu : string prop
val kHanyuPinyin : string prop
val kIBMJapan : string prop
val kIICore : string prop
val kIRGDaeJaweon : string prop
val kIRGDaiKanwaZiten : string prop
val kIRGHanyuDaZidian : string prop
val kIRGKangXi : string prop
val kIRG_GSource : string prop
val kIRG_HSource : string prop
val kIRG_JSource : string prop
val kIRG_KPSource : string prop
val kIRG_KSource : string prop
val kIRG_MSource : string prop
val kIRG_SSource : string prop
val kIRG_TSource : string prop
val kIRG_USource : string prop
val kIRG_UKSource : string prop
val kIRG_VSource : string prop
val kJHJ : string prop
val kJIS0213 : string prop
val kJa : string prop
val kJapaneseKun : string prop
val kJapaneseOn : string prop
val kJinmeiyoKanji : string prop
val kJis0 : string prop
val kJis1 : string prop
val kJoyoKanji : string prop
val kKPS0 : string prop
val kKPS1 : string prop
val kKSC0 : string prop
val kKSC1 : string prop
val kKangXi : string prop
val kKarlgren : string prop
val kKorean : string prop
val kKoreanEducationHanja : string prop
val kKoreanName : string prop
val kLau : string prop
val kMainlandTelegraph : string prop
val kMandarin : string prop
val kMatthews : string prop
val kMeyerWempe : string prop
val kMorohashi : string prop
val kNelson : string prop
val kOtherNumeric : string prop
val kPhonetic : string prop
val kPrimaryNumeric : string prop
val kPseudoGB1 : string prop
val kRSAdobe_Japan1_6 : string prop
val kRSJapanese : string prop
val kRSKanWa : string prop
val kRSKangXi : string prop
val kRSKorean : string prop
val kRSMerged : string prop
val kRSTUnicode : string prop
val kRSUnicode : string prop
val kReading : string prop
val kSBGY : string prop
val kSemanticVariant : string prop
val kSimplifiedVariant : string prop
val kSpecializedSemanticVariant : string prop
val kSpoofingVariant : string prop
val kSrc_NushuDuben : string prop
val kStrange : string prop
val kUnihanCore2020 : string prop
val kTGH : string prop
val kTGHZ2013 : string prop
val kTGT_MergedSrc : string prop
val kTaiwanTelegraph : string prop
val kTang : string prop
val kTotalStrokes : string prop
val kTraditionalVariant : string prop
val kVietnamese : string prop
val kWubi : string prop
val kXHC1983 : string prop
val kXerox : string prop
val kZVariant : string prop
type named_sequence = string * cp list
The type for named sequences. Sequence name, code point sequence.
The type for normalization corrections. Code point, old normalization, new normalization, version
type standardized_variant = cp list * string * [ `Isolate | `Initial | `Medial | `Final ] list
The type for standarized variants. Code point sequence, description, when.
The type for CJK radicals. Radical number, CJK radical character, CJK unified ideograph.
type emoji_source = cp list * int option * int option * int option
The type for emoji sources. Unicode, docomo, kddi, softbank.
type t = {
description : string; |
repertoire : props Cpmap.t; |
blocks : block list; |
named_sequences : named_sequence list; |
provisional_named_sequences : named_sequence list; |
normalization_corrections : normalization_correction list; |
standardized_variants : standardized_variant list; |
cjk_radicals : cjk_radical list; |
emoji_sources : emoji_source list; |
}
The type for Unicode character databases.
Note. Absence of an optional top-level field in the database is denoted by the neutral element of its type (empty string, empty list, Cpmap
.empty). This means that the module doesn't distinguish between absence of a field and presence of the field with empty data (but incurs no problems in this context).
cp_prop ucd cp p
is the property p
of the code point cp
in db
's repertoire, if p
is in the repertoire and the property exists for cp
.
decode d
decodes a database from d
or returns an error.
val decoded_range : decoder -> (int * int) * (int * int)
decoded_range d
is the range of characters spanning the `Error
decoded by d
. A pair of line and column numbers respectively one and zero based.
The database and subsets of it for Unicode %%UNICODE_VERSION%% are available here. Databases with groups should be preferred, they maximize value sharing and improve parsing performance.
A database is decoded as follows:
let ucd_or_die inf = try
let ic = if inf = "-" then stdin else open_in inf in
let d = Uucd.decoder (`Channel ic) in
match Uucd.decode d with
| `Ok db -> db
| `Error e ->
let (l0, c0), (l1, c1) = Uucd.decoded_range d in
Printf.eprintf "%s:%d.%d-%d.%d: %s\n%!" inf l0 c0 l1 c1 e;
exit 1
with Sys_error e -> Printf.eprintf "%s\n%!" e; exit 1
let ucd = ucd_or_die "/tmp/ucd.all.grouped.xml"
The convenience function cp_prop
can be used to query the property of a given code point. For example the general category of U+1F42B
is given by:
let u_1F42B_gc = Uucd.cp_prop ucd 0x1F42B Uucd.general_category