Uucd
Unicode character database decoder.
Uucd
decodes the data of the Unicode character database from its XML representation. It provides high-level (but not necessarily efficient) access to the data so that efficient representations can be extracted.
Uucd
decodes the representation described in the Annex #42 of Unicode %%UNICODE_VERSION%%. Subsequent versions may be decoded as long as no new cases are introduced in parsed enumerated properties.
Consult the basics.
Note. All strings returned by the module are UTF-8 encoded.
Unicode version %%UNICODE_VERSION%%
The type for Unicode code points, ranges from 0x0000
to 0x10_FFFF
.
is_cp n
is true
iff n
a Unicode code point.
is_scalar_value n
is true
iff n
is a Unicode scalar value.
Properties are referenced by their name and property values by their abbreviated name. To understand their semantics refer to the standard.
val unknown_prop : (string * string) -> string prop
unknown_prop (ns, n)
is a property read from an XML attribute whose expanded name is (ns, n)
. This can be used to access a property unknown to the module.
In alphabetical order.
val age : [ `Version of int * int | `Unassigned ] prop
val alphabetic : bool prop
val ascii_hex_digit : bool prop
val bidi_class :
[ `AL
| `AN
| `B
| `BN
| `CS
| `EN
| `ES
| `ET
| `FSI
| `L
| `LRE
| `LRI
| `LRO
| `NSM
| `ON
| `PDF
| `PDI
| `R
| `RLE
| `RLI
| `RLO
| `S
| `WS ]
prop
val bidi_control : bool prop
val bidi_mirrored : bool prop
val bidi_paired_bracket_type : [ `O | `C | `N ] prop
val block :
[ `ASCII
| `Adlam
| `Aegean_Numbers
| `Ahom
| `Alchemical
| `Alphabetic_PF
| `Anatolian_Hieroglyphs
| `Ancient_Greek_Music
| `Ancient_Greek_Numbers
| `Ancient_Symbols
| `Arabic
| `Arabic_Ext_A
| `Arabic_Ext_B
| `Arabic_Ext_C
| `Arabic_Math
| `Arabic_PF_A
| `Arabic_PF_B
| `Arabic_Sup
| `Armenian
| `Arrows
| `Avestan
| `Balinese
| `Bamum
| `Bamum_Sup
| `Bassa_Vah
| `Batak
| `Bengali
| `Bhaiksuki
| `Block_Elements
| `Bopomofo
| `Bopomofo_Ext
| `Box_Drawing
| `Brahmi
| `Braille
| `Buginese
| `Buhid
| `Byzantine_Music
| `CJK
| `CJK_Compat
| `CJK_Compat_Forms
| `CJK_Compat_Ideographs
| `CJK_Compat_Ideographs_Sup
| `CJK_Ext_A
| `CJK_Ext_B
| `CJK_Ext_C
| `CJK_Ext_D
| `CJK_Ext_E
| `CJK_Ext_F
| `CJK_Ext_G
| `CJK_Ext_H
| `CJK_Radicals_Sup
| `CJK_Strokes
| `CJK_Symbols
| `Carian
| `Caucasian_Albanian
| `Chakma
| `Cham
| `Cherokee
| `Cherokee_Sup
| `Chess_Symbols
| `Chorasmian
| `Compat_Jamo
| `Control_Pictures
| `Coptic
| `Coptic_Epact_Numbers
| `Counting_Rod
| `Cuneiform
| `Cuneiform_Numbers
| `Currency_Symbols
| `Cypriot_Syllabary
| `Cypro_Minoan
| `Cyrillic
| `Cyrillic_Ext_A
| `Cyrillic_Ext_B
| `Cyrillic_Ext_C
| `Cyrillic_Ext_D
| `Cyrillic_Sup
| `Deseret
| `Devanagari
| `Devanagari_Ext
| `Devanagari_Ext_A
| `Diacriticals
| `Diacriticals_Ext
| `Diacriticals_For_Symbols
| `Diacriticals_Sup
| `Dingbats
| `Dives_Akuru
| `Dogra
| `Domino
| `Duployan
| `Early_Dynastic_Cuneiform
| `Egyptian_Hieroglyph_Format_Controls
| `Egyptian_Hieroglyphs
| `Elbasan
| `Elymaic
| `Emoticons
| `Enclosed_Alphanum
| `Enclosed_Alphanum_Sup
| `Enclosed_CJK
| `Enclosed_Ideographic_Sup
| `Ethiopic
| `Ethiopic_Ext
| `Ethiopic_Ext_A
| `Ethiopic_Ext_B
| `Ethiopic_Sup
| `Geometric_Shapes
| `Geometric_Shapes_Ext
| `Georgian
| `Georgian_Ext
| `Georgian_Sup
| `Glagolitic
| `Glagolitic_Sup
| `Gothic
| `Grantha
| `Greek
| `Greek_Ext
| `Gujarati
| `Gunjala_Gondi
| `Gurmukhi
| `Half_And_Full_Forms
| `Half_Marks
| `Hangul
| `Hanifi_Rohingya
| `Hanunoo
| `Hatran
| `Hebrew
| `High_PU_Surrogates
| `High_Surrogates
| `Hiragana
| `IDC
| `IPA_Ext
| `Ideographic_Symbols
| `Imperial_Aramaic
| `Indic_Number_Forms
| `Indic_Siyaq_Numbers
| `Inscriptional_Pahlavi
| `Inscriptional_Parthian
| `Jamo
| `Jamo_Ext_A
| `Jamo_Ext_B
| `Javanese
| `Kaithi
| `Kaktovik_Numerals
| `Kana_Ext_A
| `Kana_Ext_B
| `Kana_Sup
| `Kanbun
| `Kangxi
| `Kannada
| `Katakana
| `Katakana_Ext
| `Kawi
| `Kayah_Li
| `Kharoshthi
| `Khitan_Small_Script
| `Khmer
| `Khmer_Symbols
| `Khojki
| `Khudawadi
| `Lao
| `Latin_1_Sup
| `Latin_Ext_A
| `Latin_Ext_Additional
| `Latin_Ext_B
| `Latin_Ext_C
| `Latin_Ext_D
| `Latin_Ext_E
| `Latin_Ext_F
| `Latin_Ext_G
| `Lepcha
| `Letterlike_Symbols
| `Limbu
| `Linear_A
| `Linear_B_Ideograms
| `Linear_B_Syllabary
| `Lisu
| `Lisu_Sup
| `Low_Surrogates
| `Lycian
| `Lydian
| `Mahajani
| `Mahjong
| `Makasar
| `Malayalam
| `Mandaic
| `Manichaean
| `Marchen
| `Masaram_Gondi
| `Math_Alphanum
| `Math_Operators
| `Mayan_Numerals
| `Medefaidrin
| `Meetei_Mayek
| `Meetei_Mayek_Ext
| `Mende_Kikakui
| `Meroitic_Cursive
| `Meroitic_Hieroglyphs
| `Miao
| `Misc_Arrows
| `Misc_Math_Symbols_A
| `Misc_Math_Symbols_B
| `Misc_Pictographs
| `Misc_Symbols
| `Misc_Technical
| `Modi
| `Modifier_Letters
| `Modifier_Tone_Letters
| `Mongolian
| `Mongolian_Sup
| `Mro
| `Multani
| `Music
| `Myanmar
| `Myanmar_Ext_A
| `Myanmar_Ext_B
| `NB
| `NKo
| `Nabataean
| `Nag_Mundari
| `Nandinagari
| `New_Tai_Lue
| `Newa
| `Number_Forms
| `Nushu
| `Nyiakeng_Puachue_Hmong
| `OCR
| `Ogham
| `Ol_Chiki
| `Old_Hungarian
| `Old_Italic
| `Old_North_Arabian
| `Old_Permic
| `Old_Persian
| `Old_Sogdian
| `Old_South_Arabian
| `Old_Turkic
| `Old_Uyghur
| `Oriya
| `Ornamental_Dingbats
| `Osage
| `Osmanya
| `Ottoman_Siyaq_Numbers
| `PUA
| `Pahawh_Hmong
| `Palmyrene
| `Pau_Cin_Hau
| `Phags_Pa
| `Phaistos
| `Phoenician
| `Phonetic_Ext
| `Phonetic_Ext_Sup
| `Playing_Cards
| `Psalter_Pahlavi
| `Punctuation
| `Rejang
| `Rumi
| `Runic
| `Samaritan
| `Saurashtra
| `Sharada
| `Shavian
| `Shorthand_Format_Controls
| `Siddham
| `Sinhala
| `Sinhala_Archaic_Numbers
| `Small_Forms
| `Small_Kana_Ext
| `Sogdian
| `Sora_Sompeng
| `Soyombo
| `Specials
| `Sundanese
| `Sundanese_Sup
| `Sup_Arrows_A
| `Sup_Arrows_B
| `Sup_Arrows_C
| `Sup_Math_Operators
| `Sup_PUA_A
| `Sup_PUA_B
| `Sup_Punctuation
| `Sup_Symbols_And_Pictographs
| `Super_And_Sub
| `Sutton_SignWriting
| `Syloti_Nagri
| `Symbols_And_Pictographs_Ext_A
| `Symbols_For_Legacy_Computing
| `Syriac
| `Syriac_Sup
| `Tagalog
| `Tagbanwa
| `Tags
| `Tai_Le
| `Tai_Tham
| `Tai_Viet
| `Tai_Xuan_Jing
| `Takri
| `Tamil
| `Tamil_Sup
| `Tangsa
| `Tangut
| `Tangut_Components
| `Tangut_Sup
| `Telugu
| `Thaana
| `Thai
| `Tibetan
| `Tifinagh
| `Tirhuta
| `Toto
| `Transport_And_Map
| `UCAS
| `UCAS_Ext
| `UCAS_Ext_A
| `Ugaritic
| `VS
| `VS_Sup
| `Vai
| `Vedic_Ext
| `Vertical_Forms
| `Vithkuqi
| `Wancho
| `Warang_Citi
| `Yezidi
| `Yi_Radicals
| `Yi_Syllables
| `Yijing
| `Zanabazar_Square
| `Znamenny_Music ]
prop
val canonical_combining_class : int prop
val cased : bool prop
val case_ignorable : bool prop
val changes_when_casefolded : bool prop
val changes_when_casemapped : bool prop
val changes_when_lowercased : bool prop
val changes_when_nfkc_casefolded : bool prop
val changes_when_titlecased : bool prop
val changes_when_uppercased : bool prop
val composition_exclusion : bool prop
val dash : bool prop
val decomposition_type :
[ `Can
| `Com
| `Enc
| `Fin
| `Font
| `Fra
| `Init
| `Iso
| `Med
| `Nar
| `Nb
| `Sml
| `Sqr
| `Sub
| `Sup
| `Vert
| `Wide
| `None ]
prop
val default_ignorable_code_point : bool prop
val deprecated : bool prop
val diacritic : bool prop
val east_asian_width : [ `A | `F | `H | `N | `Na | `W ] prop
val emoji : bool prop
val emoji_presentation : bool prop
val emoji_modifier : bool prop
val emoji_modifier_base : bool prop
val emoji_component : bool prop
val expands_on_nfc : bool prop
val expands_on_nfd : bool prop
val expands_on_nfkc : bool prop
val expands_on_nfkd : bool prop
val extended_pictographic : bool prop
val extender : bool prop
val full_composition_exclusion : bool prop
val general_category :
[ `Lu
| `Ll
| `Lt
| `Lm
| `Lo
| `Mn
| `Mc
| `Me
| `Nd
| `Nl
| `No
| `Pc
| `Pd
| `Ps
| `Pe
| `Pi
| `Pf
| `Po
| `Sm
| `Sc
| `Sk
| `So
| `Zs
| `Zl
| `Zp
| `Cc
| `Cf
| `Cs
| `Co
| `Cn ]
prop
val grapheme_base : bool prop
val grapheme_cluster_break :
[ `CN
| `CR
| `EB
| `EBG
| `EM
| `EX
| `GAZ
| `L
| `LF
| `LV
| `LVT
| `PP
| `RI
| `SM
| `T
| `V
| `XX
| `ZWJ ]
prop
val grapheme_extend : bool prop
val grapheme_link : bool prop
val hangul_syllable_type : [ `L | `LV | `LVT | `T | `V | `NA ] prop
val hex_digit : bool prop
val hyphen : bool prop
val id_continue : bool prop
val id_start : bool prop
val ideographic : bool prop
val ids_binary_operator : bool prop
val ids_trinary_operator : bool prop
val indic_syllabic_category :
[ `Avagraha
| `Bindu
| `Brahmi_Joining_Number
| `Cantillation_Mark
| `Consonant
| `Consonant_Dead
| `Consonant_Final
| `Consonant_Head_Letter
| `Consonant_Initial_Postfixed
| `Consonant_Killer
| `Consonant_Medial
| `Consonant_Placeholder
| `Consonant_Preceding_Repha
| `Consonant_Prefixed
| `Consonant_Repha
| `Consonant_Subjoined
| `Consonant_Succeeding_Repha
| `Consonant_With_Stacker
| `Gemination_Mark
| `Invisible_Stacker
| `Joiner
| `Modifying_Letter
| `Non_Joiner
| `Nukta
| `Number
| `Number_Joiner
| `Other
| `Pure_Killer
| `Register_Shifter
| `Syllable_Modifier
| `Tone_Letter
| `Tone_Mark
| `Virama
| `Visarga
| `Vowel
| `Vowel_Dependent
| `Vowel_Independent ]
prop
val indic_matra_category :
[ `Right
| `Left
| `Visual_Order_Left
| `Left_And_Right
| `Top
| `Bottom
| `Top_And_Bottom
| `Top_And_Right
| `Top_And_Left
| `Top_And_Left_And_Right
| `Bottom_And_Right
| `Top_And_Bottom_And_Right
| `Overstruck
| `Invisible
| `NA ]
prop
val indic_positional_category :
[ `Bottom
| `Bottom_And_Left
| `Bottom_And_Right
| `Left
| `Left_And_Right
| `NA
| `Overstruck
| `Right
| `Top
| `Top_And_Bottom
| `Top_And_Bottom_And_Left
| `Top_And_Bottom_And_Right
| `Top_And_Left
| `Top_And_Left_And_Right
| `Top_And_Right
| `Visual_Order_Left ]
prop
val iso_comment : string prop
val jamo_short_name : string prop
val join_control : bool prop
val joining_group :
[ `African_Feh
| `African_Noon
| `African_Qaf
| `Ain
| `Alaph
| `Alef
| `Alef_Maqsurah
| `Beh
| `Beth
| `Burushaski_Yeh_Barree
| `Dal
| `Dalath_Rish
| `E
| `Farsi_Yeh
| `Fe
| `Feh
| `Final_Semkath
| `Gaf
| `Gamal
| `Hah
| `Hanifi_Rohingya_Kinna_Ya
| `Hanifi_Rohingya_Pa
| `Hamza_On_Heh_Goal
| `He
| `Heh
| `Heh_Goal
| `Heth
| `Kaf
| `Kaph
| `Khaph
| `Knotted_Heh
| `Lam
| `Lamadh
| `Malayalam_Bha
| `Malayalam_Ja
| `Malayalam_Lla
| `Malayalam_Llla
| `Malayalam_Nga
| `Malayalam_Nna
| `Malayalam_Nnna
| `Malayalam_Nya
| `Malayalam_Ra
| `Malayalam_Ssa
| `Malayalam_Tta
| `Manichaean_Aleph
| `Manichaean_Ayin
| `Manichaean_Beth
| `Manichaean_Daleth
| `Manichaean_Dhamedh
| `Manichaean_Five
| `Manichaean_Gimel
| `Manichaean_Heth
| `Manichaean_Hundred
| `Manichaean_Kaph
| `Manichaean_Lamedh
| `Manichaean_Mem
| `Manichaean_Nun
| `Manichaean_One
| `Manichaean_Pe
| `Manichaean_Qoph
| `Manichaean_Resh
| `Manichaean_Sadhe
| `Manichaean_Samekh
| `Manichaean_Taw
| `Manichaean_Ten
| `Manichaean_Teth
| `Manichaean_Thamedh
| `Manichaean_Twenty
| `Manichaean_Waw
| `Manichaean_Yodh
| `Manichaean_Zayin
| `Meem
| `Mim
| `No_Joining_Group
| `Noon
| `Nun
| `Nya
| `Pe
| `Qaf
| `Qaph
| `Reh
| `Reversed_Pe
| `Rohingya_Yeh
| `Sad
| `Sadhe
| `Seen
| `Semkath
| `Shin
| `Straight_Waw
| `Swash_Kaf
| `Syriac_Waw
| `Tah
| `Taw
| `Teh_Marbuta
| `Teh_Marbuta_Goal
| `Teth
| `Thin_Yeh
| `Vertical_Tail
| `Waw
| `Yeh
| `Yeh_Barree
| `Yeh_With_Tail
| `Yudh
| `Yudh_He
| `Zain
| `Zhain ]
prop
val joining_type : [ `U | `C | `T | `D | `L | `R ] prop
val line_break :
[ `AI
| `AL
| `B2
| `BA
| `BB
| `BK
| `CB
| `CJ
| `CL
| `CM
| `CP
| `CR
| `EX
| `GL
| `H2
| `H3
| `HL
| `HY
| `ID
| `IN
| `IS
| `JL
| `JT
| `JV
| `LF
| `NL
| `NS
| `NU
| `OP
| `PO
| `PR
| `QU
| `RI
| `SA
| `SG
| `SP
| `SY
| `WJ
| `XX
| `ZW
| `EB
| `EM
| `ZWJ ]
prop
val logical_order_exception : bool prop
val lowercase : bool prop
val math : bool prop
val name : [ `Pattern of string | `Name of string ] prop
In the `Pattern
case occurrences of the character '#'
(U+0023
) in the string must be replaced by the value of the code point as four to six uppercase hexadecimal digits (the minimal needed). E.g. the pattern "CJK UNIFIED IDEOGRAPH-#"
associated to code point U+3400
gives the name "CJK UNIFIED IDEOGRAPH-3400"
.
val name_alias :
(string * [ `Abbreviation | `Alternate | `Control | `Correction | `Figment ])
list
prop
val nfc_quick_check : [ `True | `False | `Maybe ] prop
val nfd_quick_check : [ `True | `False | `Maybe ] prop
val nfkc_quick_check : [ `True | `False | `Maybe ] prop
val nfkd_quick_check : [ `True | `False | `Maybe ] prop
val noncharacter_code_point : bool prop
val numeric_type : [ `None | `De | `Di | `Nu ] prop
val numeric_value : [ `NaN | `Frac of int * int | `Num of int64 ] prop
val other_alphabetic : bool prop
val other_default_ignorable_code_point : bool prop
val other_grapheme_extend : bool prop
val other_id_continue : bool prop
val other_id_start : bool prop
val other_lowercase : bool prop
val other_math : bool prop
val other_uppercase : bool prop
val pattern_syntax : bool prop
val pattern_white_space : bool prop
val prepended_concatenation_mark : bool prop
val quotation_mark : bool prop
val radical : bool prop
val regional_indicator : bool prop
type script = [
|
`Adlm
|
`Aghb
|
`Ahom
|
`Arab
|
`Armi
|
`Armn
|
`Avst
|
`Bali
|
`Bamu
|
`Bass
|
`Batk
|
`Beng
|
`Bhks
|
`Bopo
|
`Brah
|
`Brai
|
`Bugi
|
`Buhd
|
`Cakm
|
`Cans
|
`Cari
|
`Cham
|
`Cher
|
`Chrs
|
`Copt
|
`Cpmn
|
`Cprt
|
`Cyrl
|
`Deva
|
`Diak
|
`Dogr
|
`Dsrt
|
`Dupl
|
`Egyp
|
`Elba
|
`Elym
|
`Ethi
|
`Geor
|
`Glag
|
`Gong
|
`Gonm
|
`Goth
|
`Gran
|
`Grek
|
`Gujr
|
`Guru
|
`Hang
|
`Hani
|
`Hano
|
`Hatr
|
`Hebr
|
`Hira
|
`Hluw
|
`Hmng
|
`Hmnp
|
`Hrkt
|
`Hung
|
`Ital
|
`Java
|
`Kali
|
`Kana
|
`Kawi
|
`Khar
|
`Khmr
|
`Khoj
|
`Knda
|
`Kthi
|
`Kits
|
`Lana
|
`Laoo
|
`Latn
|
`Lepc
|
`Limb
|
`Lina
|
`Linb
|
`Lisu
|
`Lyci
|
`Lydi
|
`Mahj
|
`Maka
|
`Mand
|
`Mani
|
`Marc
|
`Medf
|
`Mend
|
`Merc
|
`Mero
|
`Mlym
|
`Modi
|
`Mong
|
`Mroo
|
`Mtei
|
`Mult
|
`Mymr
|
`Nagm
|
`Nand
|
`Narb
|
`Nbat
|
`Newa
|
`Nkoo
|
`Nshu
|
`Ogam
|
`Olck
|
`Orkh
|
`Orya
|
`Osge
|
`Osma
|
`Ougr
|
`Palm
|
`Pauc
|
`Perm
|
`Phag
|
`Phli
|
`Phlp
|
`Phnx
|
`Plrd
|
`Prti
|
`Qaai
|
`Rjng
|
`Rohg
|
`Runr
|
`Samr
|
`Sarb
|
`Saur
|
`Sgnw
|
`Shaw
|
`Shrd
|
`Sidd
|
`Sind
|
`Sinh
|
`Sogd
|
`Sogo
|
`Sora
|
`Soyo
|
`Sund
|
`Sylo
|
`Syrc
|
`Tagb
|
`Takr
|
`Tale
|
`Talu
|
`Taml
|
`Tang
|
`Tavt
|
`Telu
|
`Tfng
|
`Tglg
|
`Thaa
|
`Thai
|
`Tibt
|
`Tirh
|
`Tnsa
|
`Toto
|
`Ugar
|
`Vaii
|
`Vith
|
`Wara
|
`Wcho
|
`Xpeo
|
`Xsux
|
`Yezi
|
`Yiii
|
`Zanb
|
`Zinh
|
`Zyyy
|
`Zzzz
]
val sentence_break :
[ `AT
| `CL
| `CR
| `EX
| `FO
| `LE
| `LF
| `LO
| `NU
| `SC
| `SE
| `SP
| `ST
| `UP
| `XX ]
prop
val soft_dotted : bool prop
val sterm : bool prop
val terminal_punctuation : bool prop
val uax_42_element : [ `Reserved | `Noncharacter | `Surrogate | `Char ] prop
Not normative, artefact of Uucd
. Corresponds to the XML element name that describes the code point.
val unicode_1_name : string prop
val unified_ideograph : bool prop
val uppercase : bool prop
val variation_selector : bool prop
val vertical_orientation : [ `U | `R | `Tu | `Tr ] prop
val white_space : bool prop
val word_break :
[ `CR
| `DQ
| `EB
| `EBG
| `EM
| `EX
| `Extend
| `FO
| `GAZ
| `HL
| `KA
| `LE
| `LF
| `MB
| `ML
| `MN
| `NL
| `NU
| `RI
| `SQ
| `WSegSpace
| `XX
| `ZWJ ]
prop
val xid_continue : bool prop
val xid_start : bool prop
In alphabetic order. For now unihan properties are always represented as strings.
val kAccountingNumeric : string prop
val kAlternateHanYu : string prop
val kAlternateJEF : string prop
val kAlternateKangXi : string prop
val kAlternateMorohashi : string prop
val kAlternateTotalStrokes : string prop
val kBigFive : string prop
val kCCCII : string prop
val kCNS1986 : string prop
val kCNS1992 : string prop
val kCangjie : string prop
val kCantonese : string prop
val kCheungBauer : string prop
val kCheungBauerIndex : string prop
val kCihaiT : string prop
val kCompatibilityVariant : string prop
val kCowles : string prop
val kDaeJaweon : string prop
val kDefinition : string prop
val kEACC : string prop
val kFenn : string prop
val kFennIndex : string prop
val kFourCornerCode : string prop
val kFrequency : string prop
val kGB0 : string prop
val kGB1 : string prop
val kGB3 : string prop
val kGB5 : string prop
val kGB7 : string prop
val kGB8 : string prop
val kGSR : string prop
val kGradeLevel : string prop
val kHDZRadBreak : string prop
val kHKGlyph : string prop
val kHKSCS : string prop
val kHanYu : string prop
val kHangul : string prop
val kHanyuPinlu : string prop
val kHanyuPinyin : string prop
val kIBMJapan : string prop
val kIICore : string prop
val kIRGDaeJaweon : string prop
val kIRGDaiKanwaZiten : string prop
val kIRGHanyuDaZidian : string prop
val kIRGKangXi : string prop
val kIRG_GSource : string prop
val kIRG_HSource : string prop
val kIRG_JSource : string prop
val kIRG_KPSource : string prop
val kIRG_KSource : string prop
val kIRG_MSource : string prop
val kIRG_SSource : string prop
val kIRG_TSource : string prop
val kIRG_USource : string prop
val kIRG_UKSource : string prop
val kIRG_VSource : string prop
val kJHJ : string prop
val kJIS0213 : string prop
val kJa : string prop
val kJapaneseKun : string prop
val kJapaneseOn : string prop
val kJinmeiyoKanji : string prop
val kJis0 : string prop
val kJis1 : string prop
val kJoyoKanji : string prop
val kKPS0 : string prop
val kKPS1 : string prop
val kKSC0 : string prop
val kKSC1 : string prop
val kKangXi : string prop
val kKarlgren : string prop
val kKorean : string prop
val kKoreanEducationHanja : string prop
val kKoreanName : string prop
val kLau : string prop
val kMainlandTelegraph : string prop
val kMandarin : string prop
val kMatthews : string prop
val kMeyerWempe : string prop
val kMorohashi : string prop
val kNelson : string prop
val kOtherNumeric : string prop
val kPhonetic : string prop
val kPrimaryNumeric : string prop
val kPseudoGB1 : string prop
val kRSAdobe_Japan1_6 : string prop
val kRSJapanese : string prop
val kRSKanWa : string prop
val kRSKangXi : string prop
val kRSKorean : string prop
val kRSMerged : string prop
val kRSTUnicode : string prop
val kRSUnicode : string prop
val kReading : string prop
val kSBGY : string prop
val kSemanticVariant : string prop
val kSimplifiedVariant : string prop
val kSpecializedSemanticVariant : string prop
val kSpoofingVariant : string prop
val kSrc_NushuDuben : string prop
val kStrange : string prop
val kUnihanCore2020 : string prop
val kTGH : string prop
val kTGHZ2013 : string prop
val kTGT_MergedSrc : string prop
val kTaiwanTelegraph : string prop
val kTang : string prop
val kTotalStrokes : string prop
val kTraditionalVariant : string prop
val kVietnamese : string prop
val kWubi : string prop
val kXHC1983 : string prop
val kXerox : string prop
val kZVariant : string prop
type named_sequence = string * cp list
The type for named sequences. Sequence name, code point sequence.
The type for normalization corrections. Code point, old normalization, new normalization, version
type standardized_variant =
cp list * string * [ `Isolate | `Initial | `Medial | `Final ] list
The type for standarized variants. Code point sequence, description, when.
The type for CJK radicals. Radical number, CJK radical character, CJK unified ideograph.
type emoji_source = cp list * int option * int option * int option
The type for emoji sources. Unicode, docomo, kddi, softbank.
type t = {
description : string;
repertoire : props Cpmap.t;
blocks : block list;
named_sequences : named_sequence list;
provisional_named_sequences : named_sequence list;
normalization_corrections : normalization_correction list;
standardized_variants : standardized_variant list;
cjk_radicals : cjk_radical list;
emoji_sources : emoji_source list;
}
The type for Unicode character databases.
Note. Absence of an optional top-level field in the database is denoted by the neutral element of its type (empty string, empty list, Cpmap
.empty). This means that the module doesn't distinguish between absence of a field and presence of the field with empty data (but incurs no problems in this context).
cp_prop ucd cp p
is the property p
of the code point cp
in db
's repertoire, if p
is in the repertoire and the property exists for cp
.
decode d
decodes a database from d
or returns an error.
val decoded_range : decoder -> (int * int) * (int * int)
decoded_range d
is the range of characters spanning the `Error
decoded by d
. A pair of line and column numbers respectively one and zero based.
The database and subsets of it for Unicode %%UNICODE_VERSION%% are available here. Databases with groups should be preferred, they maximize value sharing and improve parsing performance.
A database is decoded as follows:
let ucd_or_die inf = try
let ic = if inf = "-" then stdin else open_in inf in
let d = Uucd.decoder (`Channel ic) in
match Uucd.decode d with
| `Ok db -> db
| `Error e ->
let (l0, c0), (l1, c1) = Uucd.decoded_range d in
Printf.eprintf "%s:%d.%d-%d.%d: %s\n%!" inf l0 c0 l1 c1 e;
exit 1
with Sys_error e -> Printf.eprintf "%s\n%!" e; exit 1
let ucd = ucd_or_die "/tmp/ucd.all.grouped.xml"
The convenience function cp_prop
can be used to query the property of a given code point. For example the general category of U+1F42B
is given by:
let u_1F42B_gc = Uucd.cp_prop ucd 0x1F42B Uucd.general_category