This technical note describes how to construct auxiliary break tables for use
FindWord routine in the Script Manager.
[Nov 01 1987]
Constructing break tables
FindWord algorithm finds word boundaries by determining where
words should not be broken. For example, "re-do" is one word: it should not be
broken at the hyphen. In other words, a sequence of the form: (letter, hyphen,
letter) should not be broken between the first and second or second and third
character. This is called a continuation sequence. The algorithm used by the
FindWord routine allows for continuation sequences of lengths one, two
and three. Examples of a sequence of length two include (letter, letter), or
(number, number). For a length of one, there is only one sequence, consisting
of the characters of type
nonBreaking: these characters are never
separated from preceding or following characters.
For most scripts, this information about continuation sequences is packed into
a table for use by the
FindWord algorithm. (For complex scripts like
Japanese, a different algorithm is used for portions of the script.) The
default break tables for a given script can be overridden by a user-specified
breakTable parameter, but should only be used for known scripts. That
is, before overriding the
breakTable parameter, the programmer should
first check the script of the current font.
A break table consists of two sections, a 256 byte character type table
followed by a character triple table.
The character type table is indexed by the character's ASCII code and contains
one type value for each character. The character types in the table are limited
to values between 1 and 31. There are two distinguishing values: the type
nonBreaking (= 1) indicates that the character is non-breaking; it
always continues a word. The type
wild (=0) indicates that the
character may or may not break, depending on information in the character
triple table, as described below. Otherwise, the choice of numbers to
represent character types is completely arbitrary.
For example, the following in MPW Assembler defines character types for use in
a word-selection break table, then sets up a character type table using an
assembly macro (
setByte) to store character type values in an array.
(Note that the character types could have been defined with equate definitions
EQU), rather than using the record structure.) Writing the setByte
macro is left as an exercise to the reader. Note that the break value is the
default. This value is not distinguished, but should have no continuation
charWordRec record 0
wild ds.b 1 ; constant! not in char table.
nonbreak ds.b 1 ; constant! non-breaking space.
letter ds.b 1 ; letters.
number ds.b 1 ; digits.
break ds.b 1 ; always breaks.
midLetter ds.b 1 ; a'a.
midLetNum ds.b 1 ; a'a 1'1.
preNum ds.b 1 ; $, etc.
postNum ds.b 1 ; %, etc.
midNum ds.b 1 ; 1,1.
preMidNum ds.b 1 ; .1234.
blank ds.b 1 ; spaces and tabs.
cr ds.b 1 ; add carriage return
setByte wordTable,blank,$00,' ',$09
The character triple table is a coded representation of a list of continuation
sequences. It consists of a list of packed one word triples, preceded by a
length word. This length word contains the number of triples minus one. Each
triple contains three character types, either as derived from the
charType table or the special type
wild (= zero). The three
types in a triple are packed into fields five bits apiece, with the most
significant bit in the word cleared. The first type in the triple is the
A continuation sequence of length three (xyz) is represented by entering three
triples into the triple list: xyz, *xy, and yz* (where '*' stands for the type
wild, which is always zero).
A continuation sequence of length two (xy) is represented by entering two
triples into this list: *xy, and xy*. A continuation sequence of length one has
no entry in the triple list: the character type is simply
Note that the type
wild cannot appear as the middle element of a
triple. The words in the triple table must be sorted in ascending numerical
order for future compatibility.
The following is an example of how a character triple table could be coded. The
defSeq macro takes a continuation sequence as a parameter, and enters
a set of triples into an internal array. The
dumpSeq macro sorts the
triples, and stores them in the proper order with
dc.w commands. Once
again, writing the macros
dumpSeq is left as an
exercise for the reader.
dc.w ((wordEnd-wordBegin)/2)-1 ; length word.
A series of blanks should generally select as a single word. Make certain,
however, that a carriage return does not continue a word to the right (note how
it has a separate character type from blank for this reason), otherwise word
selection and wrapping do not work properly across paragraphs.
Back to top
The values 16-31 in the character type table entry for null ($00) (the first
byte in the character type table) are reserved by Apple for future expansion.
The use of one of these values indicates the presence of a supplementary table
after the triple table.
Back to top
The Script Manager
Back to top
Acrobat version of this Note (160K).