aspell: Words With Symbols in Them
1
1 C.2 Words With Spaces or Other Symbols in Them
1 ==============================================
1
1 Many languages, including English, have words with non-letter symbols in
1 them. For example the apostrophe. These symbols generally appear in
1 the middle of a word, but they can also appear at the end, such as in an
1 abbreviation. If a symbol can _only_ appear as part of a word then
1 Aspell can treat it as if it were a letter.
1
1 However, the problem is most of these symbols have other uses. For
1 example, the apostrophe is often used as a single quote and the
1 abbreviations marker is also used as a period. Thus, Aspell cannot
1 blindly treat them as if they were letters.
1
1 Aspell currently handles the case where the symbol can only appear in
1 the middle of the word fairly well. It simply assumes that if there is
1 a letter both before and after the symbol than it is part of the word.
1 This works most of the time but it is not fool proof. For example,
1 suppose the user forgot to leave a space after the period:
1
1 ... and the dog went up the tree.Then the cat ...
1
1 Aspell would think "tree.Then" is one word. A better solution might be
1 to then try to check "tree" and "Then" separately. But what if one of
1 them is not in the dictionary? Should Aspell assume "tree.Then" is one
1 word?
1
1 The case where the symbol can appear at the beginning or end of the
1 word is more difficult to deal with. The symbol may or may not
1 actually be part of the word. Aspell currently handles this case by
1 first trying to spell check the word with the symbol and if that fails,
1 try it without. The problem is, if the word is misspelled, should
1 Aspell assume the symbol belongs with the word or not? Currently
1 Aspell assumes it does, which is not always the correct thing to do.
1
1 Numbers in words present a different challenge to Aspell. If Aspell
1 treats numbers as letters then every possible number a user might write
1 in a document must be specified in the dictionary. This could easily
1 be solved by having special code to assume all numbers are correctly
1 spelled. Yet, what about something like "4th". Since the "th" suffix
1 can appear after any number we are left with the same problem. The
1 solution would be to have a special symbol for "any number".
1
1 Words with spaces in them, such as foreign phrases, are even more
1 trouble to deal with. The basic problem is that when tokenizing a
1 string there is no good way to keep phrases together. One solution is to
1 use trial and error. If a word is not in the dictionary try grouping it
1 with the previous or next word and see if the combined word is in the
1 dictionary. But what if the combined word is not, should the misspelled
1 word be grouped when looking for suggestions? One solution is to also
1 store each part of the phrase in the dictionary, but tag it as part of a
1 phrase and not an independent word.
1
1 To further complicate things, most applications that use spell
1 checkers are accustom to parsing the document themselves and sending it
1 to the spell checker a word at a time. In order to support words with
1 spaces in them a more complicated interface will be required.
1