Overview
This specification details the processes of reading Unicode characters into
CLforJava. There are 2 possible methods for ingesting Unicode characters:
The Lisp Reader has 2 methods defined for entering characters. The first is the simple act of reading a character from the input stream. This will work correctly for all Unicode characters in the Basic Multilingual Plane (BMP) - those chars whose code points fit into 16 bits. The second is the #\ reader macro. This reader macro reads the next element in the stream an interprets it as the name of a character. For those characters that are a single character, the character is just quoted. For results longer than 1 character, the name is the index into a table of Unicode characters. Every Unicode character has a unique name, sometimes there are multiple names for a character.
The second method is to provide an additional reader macro that lets the user specify the
UnicodeCodePoint? of the character. This involves appropriating one of the unspecified # dispatch characters. In this case, we use the u or U characters. There are 2 variants of the syntax:
- #uXXXX - where the 'u' is followed by 4 hexadecimal characters. This variant can specify all characters from the BMP.
- #u+XXXXXX - where the '+' is followed by 2 to 6 hexadecimal characters. This variant can specify all code points in the Unicode repertoire. Note that code points above FFFF are implemented in Java as an array of 2 characters (char[2]).
Specification of the #\ Reader Macro
TBS
Specification of the #u/#u+ ReaderMacro?
TBS
FileSystem?
References
The
Unicode Organization has detailed information on Unicode characters and how to manipulate them.
Implementation
Details of implementation
Discussions
Links to Blog issues
Status:
Release Level:
Open bug count:
Test Suites
Links to JUnit results
--
JerryBoetje - 12 Jul 2003
Topic revision: r3 - 2009-02-11 - 18:52:38 -
MeganLusher