CharacterSystem

This system can be understood by reading the information below. A much more comprehensive specification is included in the pdf file available in the attachment below (CharactersCLforJava2004? .pdf).

Issues

  1. Base the specification on Unicode 4.0. Use of Unicode 4 requires the support built into Java 1.5.
  2. What should the definition of Extended-Character be? What makes sense in the context of Unicode?
  3. The discussion of using #u\hhhh is good, but some Unicode characters are now longer than 16 bits.
  4. How to handle non-arabic numerals. For example, Unicode supports Roman numerals as actual numeric characters.
  5. How should the directional attributes affect the Lisp Printer?
  6. How should the FileStream be altered to deal with character encodings?

Overview

A character is the basic unit used by TheReader to build the components of a Lisp system or application. The specification of Common Lisp pre-dates the development of the Unicode character encoding system. Common Lisp does define 2 classes of characters:

  • Standard-Character - the set of 96 ASCII characters that are always available on any modern-day (or ancient) computer system
  • Base-Character - at least the set of 96 standard characters. It is possible to define the entire Unicode set as base characters
  • Extended-Character - everything else that can be encoded as a character. The spec is flexible in the treatment of characters, allowing the CLforJava project to define the integration of the Unicode system used by Java into Common Lisp.

How Characters are Loaded

The characters can be found in from the build in the lisp/common/resources/UnicodeForLisp.xml file. An example of an xml document with only one character definition is below:

006A j LATIN SMALL LETTER J LATIN-SMALL-LETTER-J

All of the character definitions are wrapped with a beginning and ending tag. Each character has one unique codepoint and preferredname. Some characters do not have unicodenames, and all characters have at least one aliasname.

... Talk about SAXparser, where is it now being loaded from? talk about the different character handler classes (parser, parslet, etc) talk about the different classes (Standardchar, basechar, extendedchar) why we chose to use 96 static variables instead of letting them be loaded as well talk about the sharpyu and sharpslash macro functions talk about ...

Representing Common Lisp Characters

All characters in CLforJava are implemented as Unicode characters, and the entire set of valid Unicode characters are valid in CLforJava. The 96 standard ASCII characters of Common Lisp are the set of Standard-Characters. The rest of the Unicode set form the set of Extended-Characters.

Commonly non-printable characters are represented through the use of the '#' (Sharpsign) dispatching macro character combined with 'X' (hex) or 'O' (octal) characters. Unicode characters are 16 bits long, requiring 4 hexadecimal digits. The Java convention is to use the 'u' to define a Unicode character. Common Lisp leaves the 'U' and 'u' character undefined in the standard reader macros for Sharpsign. CLforJava will define a reader macro for these characters that reads the next 4 characters as the hex representation of a Unicode character.

Character Attributes in CLforJava

There are 3 ways in which the phrase "character attributes" is used in Common Lisp:

  • The common notions such as "upper case", "digit", "alphanumeric" that pertain to a character regardless of context.
  • The attributes that are linked keyboard modifiers (CNTL-C, META-V).
  • The attributes that are defined only within an execution context - the primary example being TheReader.

Java has built-in functions to test for most of the first set. Some of the keyboard modifiers manifest as standard Unicode characters. Some of the second set and all of the third will require special coding (see Implementation).

References

HyperSpec CLtL
Data Type Discussion Concepts
Reading Characters Types and Functions

Implementation

Case 1: Common Attributes

The most straightforward implementation is to use the standard Java character attribute functions.

Common Lisp Function Java Character Method
ALPHA-CHAR-P
IsAlpha
UPPER-CASE-P
IsUpper
ALPHANUMERICP
IsAlphaNumeric
etc

Another possibility is to use the Character.

UnicodeBlock
methods directly to implement these functions. This would be more in line with determining attributes in Case 3.

Case 2: Keyboard Modifiers

TBD

Case 3: Context-Dependent Attributes

In some situations, most notably in TheReader, the attributes of a character depend on the dynamic environment. For example, is a character a macro character, a non-terminating macro character, or an escape character. In these situations, the implementation creates appropriate subclasses of Character.Subset. TheReader or other processor can query to determine the appropriate attribute.

Core Java Classes Main.CLforJava Subclasses
Character  
Character.Subset TBS

Character Type Implemenation

Character type implementation

The Character type is divided into 4 classes:

-Character: this is a super class that all other concrete types inherit from. The following methods will be available to programmers in this class:

Function Name Description Example
equals returns true if Character a represents the same character value as Character b a.equals(b) => true or false
lessThan returns true if Character a is less than Character b a= 'a' b= 'b', a.lessThan(b) => true
greaterThan returns true if Character a is greater than Character b a ='a' b='b',a.greaterThan(b) => false
lessThanOrEqual returns true if Character a is less than or equal to Character b a ='a' b='b',a.lessThanOrEqual(b) => true
greaterThanOrEqual returns true if Character a is greater than or equal to Character b a ='a' b='b',a.lessThanOrEqual(b) => false
compareTo returns 0 if a is equal to b, -1 if this Character is less than b; and 1 if this Character is greater than b a='a' b='b' a.compareTo(b) => -1
hashcode returns a hash code for the Character a.hashCode() => a unique number that no other character instance will have.

-BaseCharacter: this class represents the Unicode Basic Multi-Lingual Plain.

-StandardCharacter: subclass of BaseCharacter? , represents the standard set of 96 basic characters that all common lisp implementations must support.

-ExtendedCharacter: Represents all Unicode characters that need more than 16 bits also know as supplementary characters.

The Character Functions to be implemented

Function Name Description Example Java Implementation
CHAR= returns true if all characters are the same; otherwise, it returns false. (char= #\d #\d) => true [CharEqualCaseSensitive.java]
CHAR/= returns true if all characters are different; otherwise, it returns false. (char/= #\d #\d) => false [CharNotEqualCaseSensitive.java]
CHAR< returns true if the characters are monotonically increasing; otherwise, it returns false. (char< #\d #\x) => true [CharLessPCaseSensitive.java]
CHAR> returns true if the characters are monotonically decreasing; otherwise, it returns false. (char> #\e #\d) => true [CharGreaterPCaseSensitive.java]
CHAR<= returns true if the characters are monotonically nondecreasing; otherwise, it returns false. (char<= #\d #\x #\x #\x) => true [CharNotGreaterPCaseSensitive.java]
CHAR>= returns true if the characters are monotonically nonincreasing; otherwise, it returns false. (char>= #\d #\c #\b #\b #\a #\a) => true [CharNotLessPCaseSensitive.java]
CHAR-EQUAL Same as CHAR=, except it's case-insensitive (char-equal #\D #\d #\d) => true [CharEqualCaseInsensitive.java]
CHAR-NOT-EQUAL Same as CHAR/=, except it's case-insensitive (char-not-equal #\d #\D) => false [CharNotEqualCaseInsensitive.java]
CHAR-LESSP Same as CHAR<, except it's case-insensitive (char-lessp #\D #\x) => true [CharLessPCaseInsensitive.java]
CHAR-GREATERP Same as CHAR>, except it's case-insensitive (char-greaterp #\E #\d) => true [CharGreaterPCaseInsensitive.java]
CHAR-NOT-GREATERP Same as CHAR<=, except it's case-insensitive (char-not-greaterp #\d #\X #\x) => true [CharNotGreaterPCaseInsensitive.java]
CHAR-NOT-LESSP Same as CHAR>=, except it's case-insensitive (char-not-lessp #\d #\C #\b #\A #\a) => true [CharNotLessPCaseInsensitive.java]
CHARACTER Returns the character denoted by the character designator. (character #\a) => #\a
(character "a") => #\a
Char.java
CHARACTERP Returns true if object is of type character (characterp #\a) => true [CharacterP.java]
ALPHA-CHAR-P Returns true if character is an alphabetic_1 character (alpha-char-p #\5) => false [AlphaCharP.java]
ALPHANUMERICP Returns true if character is an alphabetic_1 character or a numeric character (alpha-char-p #\5) => true [AlphaNumericP.java]
DIGIT-CHAR If weight is less than radix, digit-char returns a character which has that weight when considered as a digit in the specified radix. If the resulting character is to be an alphabetic_1 character, it will be an uppercase character. If weight is greater than or equal to radix, digit-char returns false. (digit-char 10 11) => #\A [DigitChar.java]
DIGIT-CHAR-P Tests whether char is a digit in the specified radix (i.e., with a weight less than radix). If it is a digit in that radix, its weight is returned as an integer; otherwise nil is returned. (digit-char-p #\5) => 5 [DigitCharP.java]
GRAPHIC-CHAR-P Returns true if character is a graphic character; otherwise, returns false. (graphic-char-p #\Space) => true
(graphic-char-p #\Newline) => false
[GraphicCharP.java]
STANDARD-CHAR-P Returns true if character is of type standard-char; otherwise, returns false. (standard-char-p #\Space) => true [StandardCharP.java]
CHAR-UPCASE returns the uppercase character. (char-upcase #\a) => #\A [CharUpCase.java]
CHAR-DOWNCASE returns the lowercase character. (char-downcase #\a) => #\a [CharDownCase.java]
UPPER-CASE-P returns true if character is an uppercase character (upper-case-p #\A) => true [UpperCaseP.java]
LOWER-CASE-P returns true if character is a lowercase character (lower-case-p #\A) => false [LowerCaseP.java]
BOTH-CASE-P returns true if character is a character with case (both-case-p #\5) => false [BothCaseP.java]
CHAR-CODE char-code returns the code attribute of character (char-code #\$) => 36 [CharCode.java]
CHAR-INT Returns a non-negative integer encoding the character object. The manner in which the integer is computed is implementation-dependent. If character has no implementation-defined attributes, the results of char-int and char-code are the same. (char-int #\A) => 65 [CharInt.java]
CODE-CHAR Returns a character with the code attribute given by code. If no such character exists and one cannot be created, nil is returned. (code-char 65) => #\A [CodeChar.java]
CHAR-NAME Returns a string that is the name of the character, or nil if the character has no name. (char-name #\ ) => "Space" [CharName.java]
NAME-CHAR Returns the character object whose name is name (as determined by string-equal---i.e., lookup is not case sensitive). If such a character does not exist, nil is returned. (name-char 'space) => #\Space [NameChar.java]

Discussions

Links to Blog issues

Test Suites

Links to JUnit results

-- JerryBoetje - 11 Jul 2003

SpecStatusForm
SpecStatus? SpecInProgress?
FirstRelease?

OpenBugCount?

Topic attachments
I Attachment Action Size Date Who Comment
PDFpdf CharactersCLforJava2004.pdf manage 6263.9 K 2004-12-08 - 02:52 DavidLyle Character Group Presentation Fall 2004
Topic revision: r17 - 2009-02-11 - 03:19:35 - MeganLusher
 
Home
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback