aspell: Phonetic Code
1
1 7.3 Phonetic Code
1 =================
1
1 Aspell is in fact the spell checker that comes up with the best
1 suggestions if it finds an unknown word. One reason is that it does
1 not just compare the word with other words in the dictionary (like
1 Ispell does) but also uses phonetic comparisons with other words.
1
1 The new table driven phonetic code is very flexible and setting up
1 phonetic transformation rules for other languages is not difficult but
1 there can be a number of stumbling blocks -- that's why I wrote this
1 section.
1
1 The main phonetic code is free of any language specific code and
1 should be powerful enough to allow setting up rules for any language.
1 Anything which is language specific is kept in a plain text file and
1 can easily be edited. So it's even possible to write phonetic
1 transformation rules if you don't have any programming skills. All you
1 need to know is how words of the language are written and how they are
1 pronounced.
1
1 7.3.1 Syntax of the transformation array
1 ----------------------------------------
1
1 In the translation array there are two strings on each line; the first
1 one is the search string (or switch name) and the second one is the
1 replacement string (or switch parameter). The line
1
1 version VERSION
1
1 is also required to appear somewhere in the translation array. The
1 version string can be anything but it should be changed whenever a new
1 version of the translation array is released. This is important
1 because it will keep Aspell from using a compiled dictionary with the
1 wrong set of rules. For example, if when coming up with suggestion for
1 `hallo', Aspell will use the new rules to come up with the soundslike
1 say `H*L*', but if `hello' is stored in the dictionary using the old
1 rules as `HL' instead of `H*L*' Aspell will never be able to come up
1 with `hello'. So to solve this problem Aspell checks if the version
1 strings match and aborts with an error if they don't. Thus it is
1 important to update it whenever a new version of the translation array
1 is released. This is only a problem with the main word list as the
1 personal word lists are now stored as simple word lists with a single
1 header line (i.e. no soundslike data).
1
1 Each non switch line represents one replacement (transformation)
1 rule. Words beginning with the same letter must be grouped together;
1 the order inside this group does not depend on alphabetical issues but
1 it gives priorities; the higher the rule the higher the priority.
1 That's why the first rule that matches is applied. In the following
1 example:
1
1 GH _
1 G K
1
1 `GH -> _' has higher priority than `G -> K'
1
1 `_' represents the empty string "". If `GH -> _' came after `G ->
1 K', the second rule would never match because the algorithm would stop
1 searching for more rules after the first match. The above rules
1 transform any `GH' to an empty string (delete them) and transforms any
1 other `G' to `K'.
1
1 At the end of the first string of a line (the search string) there
1 may optionally stand a number of characters in brackets. One (only
1 one!) of these characters must fit. It's comparable with the `[ ]'
1 brackets in regular expressions. The rule `DG(EIY) -> J' for example
1 would match any `DGE', `DGI' and `DGY' and replace them with `J'. This
1 way you can reduce several rules to one.
1
1 Before the search string, one or more dashes `-' may be placed.
1 Those search strings will be matched totally but only the beginning of
1 the string will be replaced. Furthermore, for these rules no follow-up
1 rule will be searched (what this is will be explained later). The rule
1 `TCH-- '-> _ will match any word containing `TCH' (like `match') but
1 will only replace the first character `T' with an empty string. The
1 number of dashes determines how many characters from the end will not
1 be replaced. After the replacement, the search for transformation
1 rules continues with the not replaced `CH'!
1
1 If a `<' is appended to the search string, the search for
1 replacement rules will continue with the replacement string and not with
1 the next character of the word. The rule `PH< -> F' for example would
1 replace `PH' with `F' and then again start to search for a replacement
1 rule for `F...'. If there would also be rules like `FO '-> `O' and `F
1 -> _' then words like `PHOXYZ' would be transformed to `OXYZ' and any
1 occurrences of `PH' that are not followed by an `O' will be deleted like
1 `PHIXYZ -> IXYZ'. The second replacement however is not applied if the
1 priority of this rule is lower than the priority of the first rule.
1
1 Priorities are added to a rule by putting a number between 0 and 9 at
1 the end of the search string, for example `ING6 -> N'. The higher the
1 number the higher is the priority.
1
1 Priorities are especially important for the previously mentioned
1 follow-up rules. Follow-up rules are searched beginning from the last
1 string of the first search string. This is a bit complicated but I
1 hope this example will make it clearer:
1
1 CHS X
1 CH G
1
1 HAU--1 H
1
1 SCH SH
1
1 In this example `CHS' in the word `FUCHS' would be transformed to
1 `X'. If we take the word `DURCHSCHNITT' then things look a bit
1 different. Here `CH' belongs together and `SCH' belongs together and
1 both are spoken separately. The algorithm however first finds the
1 string `CHS' which may not be transformed like in the previous word
1 `FUCHS'. At this point the algorithm can find a follow-up rule. It
1 takes the last character of the first matching rule (`CHS') which is
1 `S' and looks for the next match, beginning from this character. What
1 it finds is clear: It finds `SCH -> SH', which has the same priority
1 (no priority means standard priority, which is 5). If the priority is
1 the same or higher the follow-up rule will be applied. Let's take a
1 look at the word `SCHAUKEL'. In this word `SCH' belongs together and
1 may not be taken apart. After the algorithm has found `SCH '-> `SH' it
1 searches for a follow-up rule for `H+'`AUKEL'. It finds `HAU--1 -> H',
1 but does not apply it because its priority is lower than the one of the
1 first rule. You see that this is a very powerful feature but it also
1 can easily lead to mistakes. If you really don't need this feature you
1 can turn it off by putting the line:
1
1 followup 0
1
1 at the beginning of the phonetic table file. As mentioned, for rules
1 containing a `-' no follow-up rules are searched but giving such rules
1 a priority is not totally senseless because they can be follow-up rules
1 and in that case the priority makes sense again. Follow-up rules of
1 follow-up rules are not searched because this is in fact not needed
1 very often.
1
1 The control character `^' says that the search string only matches
1 at the beginning of words so that the rule `RH -> R' will only apply to
1 words like `RHESUS' but not `PERHAPS'. You can append another `^' to
1 the search string. In that case the algorithm treats the rest of the
1 word totally separately from the first matched string at the beginning.
1 This is useful for prefixes whose pronunciation does not depend on the
1 rest of the word and vice versa like `OVER^^' in English for example.
1
1 The same way as `^' works does `$' only apply to words that end with
1 the search string. `GN$ -> N' only matches on words like `SIGN' but
1 not `SIGNUM'. If you use `^' and `$' together, both of them must fit
1 `ENOUGH^$ -> NF' will only match the word `ENOUGH' and nothing else.
1
1 Of course you can combine all of the mentioned control characters but
1 they must occur in this order: `< - priority ^ $'. All characters must
1 be written in CAPITAL letters.
1
1 If absolutely no rule can be found -- might happen if you use strange
1 characters for which you don't have any replacement rule -- the next
1 character will simply be skipped and the search for replacement rules
1 will continue with the rest of the word.
1
1 If you want double letters to be reduced to one you must set up a
1 rule like `LL- -> L'. If double letters in the resulting phonetic word
1 should be allowed, you must place the line:
1
1 collapse_result 0
1
1 at the beginning of your transformation table file; otherwise set the
1 value to `1'. The English rules for example strip all vowels from
1 words and so the word "GOGO" would be transformed to "K" and not to
1 "KK" (as desired) if `collapse_result' is set to 1. That's why the
1 English rules have `collapse_result' set to `0'.
1
1 By default, all accents are removed from a word before it is matched
1 to the soundslike rules. If you do not want this then add the line
1
1 remove_accents 0
1
1 at the beginning of your file. The exact definition of an accent is
1 language dependent and is controlled via the character set file. If you
1 set remove_accents to '0' then you should also set "store-as" to "lower"
1 in the language data file (not the phonetic transformation file)
1 otherwise Aspell will have problems when both the accented and the
1 de-accented version of a word appearing in the dictionary; it will
1 consider one of them as incorrectly spelled.
1
1 7.3.2 How do I start finally?
1 -----------------------------
1
1 Before you start to write an array of transformation rules, you should
1 be aware that you have to do some work to make sure that things you do
1 will result in correct transformation rules.
1
1 7.3.2.1 Things that come in handy
1 .................................
1
1 First of all, you need to have a large word list of the language you
1 want to make phonetics for. It should contain about as many words as
1 the dictionary of the spell checker. If you don't have such a list,
1 you will probably find an Ispell dictionary at
1 `http://fmg-www.cs.ucla.edu/geoff/ispell-dictionaries.html' which will
1 help you. You can then make affix expansion via `ispell -e' and then
1 pipe it through `tr " " "\n"' to put one word on each line. After that
1 you eventually have to convert special characters like `é' from
1 Ispell's internal representation to latin1 encoding. `sed s/e'/é/g'
1 for example would replace all `e'' with `é'.
1
1 The second is that you know how to use regular expressions and know
1 how to use `grep'. You should for example know that:
1
1 grep ^[^aeiou]qu[io] wordlist | less
1
1 will show you all words that begin with any character but `a', `e',
1 `i', `o' or `u' and then continue with `qui' or `quo'. This stuff is
1 important for example to find out if a phonetic replacement rule you
1 want to set up is valid for all words which match the expression you
1 want to replace. Taking a look at the regex(7) man page is a good idea.
1
1 7.3.2.2 What the phonetic code should do
1 ........................................
1
1 Normal text comparison works well as long as the typer misspells a word
1 because he pressed one key he didn't really want to press. In these
1 cases, mostly one character differs from the original word.
1
1 In cases where the writer didn't know about the correct spelling of
1 the word, the word may have several characters that differ from the
1 original word but usually the word would still sound like the original.
1 Someone might think that `tough' is spelled `taff'. No spell checker
1 without phonetic code will come to the idea that this might be `tough',
1 but a spell checker who knows that `taff' would be pronounced like
1 `tough' will make good suggestions to the user. Another example could
1 be `funetik' and `phonetic'.
1
1 From these examples you can see that the phonetic transformation
1 should not be too fussy and too precise. If you implement a whole
1 phonetic dictionary as you can find it in books this will not be very
1 useful because then there could still be many characters differing from
1 the misspelled and the desired word. What you should do if you
1 implement the phonetic transformation table is to reduce the number of
1 used letters to the only really necessary ones.
1
1 Characters that sound similar should be reduced to one. In the
1 English language for example `Z' sounds like `S' and that's why the
1 transformation rule `Z -> S' is present in the replacement table. "PH
1 is spoken like "F and so we have a `PH -> F' rule.
1
1 If you take a closer look you will even see that vowels sound very
1 similar in the English language: `contradiction', `cuntradiction',
1 `cantradiction' or `centradiction' in fact sound nearly the same, don't
1 they? Therefore the English phonetic replacement rules not only reduce
1 all vowels to one but even remove them all (removing is done by just
1 setting up no rule for those letters). The phonetic code of
1 "contradiction" is "KNTRTKXN" and if you try to read this
1 letter-monster loud you will hear that it still sound a bit like
1 `contradiction'. You also see that "D" is transformed to "T" because
1 they nearly sound the same.
1
1 If you think you have found a regularity you should _always_ take
1 your word list and `grep' for the corresponding regular expression you
1 want to make a transformation rule for. An example: If you come to the
1 idea that all English words ending on `ough' sound like `AF' at the end
1 because you think of `enough' and `tough'. If you then `grep' for the
1 corresponding regular expression by `grep -i ough$ wordlist' you will
1 see that the rule you wanted to set up is not correct because the rule
1 doesn't fit to words like `although' or `bough'. So you have to define
1 your rule more precisely or you have to set up exceptions if the number
1 of words that differ from the desired rule is not too big.
1
1 Don't forget about follow-up rules which can help in many cases but
1 which also can lead to confusion and unwanted side effects. It's also
1 important to write exceptions in front of the more general rules (`GH'
1 before `G' etc.).
1
1 If you think you have set up a number of rules that may produce some
1 good results try them out! If you run Aspell as `aspell
1 --lang=YOUR_LANGUAGE pipe' you get a prompt at which you can type in
1 words. If you just type words Aspell checks them and eventually makes
1 suggestions if they are misspelled. If you type in `$$Sw WORD' you
1 will see the phonetic transformation and you can test out if your work
1 does what you want.
1
1 Another good way to check that changes you make to your rules don't
1 have any bad side effects is to create another list from your word list
1 which contains not only the word of the word list but also the
1 corresponding phonetic version of this word on the same line. If you
1 do this once before the change and once after the change you can make a
1 diff (see `man diff') to see what _really_ changed. To do this use the
1 command `aspell --lang=YOUR_LANGUAGE soundslike'. In this mode Aspell
1 will output the the original word and then its soundslike separated by
1 a tab character for each word you give it. If you are interested in
1 seeing how the algorithm works you can download a set of useful
1 programs from
1 `http://members.xoom.com/maccy/spell/phonet-utils.tar.gz'. This
1 includes a program that produces a list as mentioned above and another
1 program which illustrates how the algorithm works. It uses the same
1 transformation table as Aspell and so it helps a lot during the process
1 of creating a phonetic transformation table for Aspell.
1
1 During your work you should write down your basic ideas so that other
1 people are able to understand what you did (and you still know about it
1 after a few weeks). The English table has a huge documentation
1 appended as an example.
1
1 Now you can start experimenting with all the things you just read and
1 perhaps set up a nice phonetic transformation table for your language
1 to help Aspell to come up with the best correction suggestions ever
1 seen also for your language. Take a look at the Aspell homepage to see
1 if there is already a transformation table for your language. If there
1 is one you might also take a look at it to see if it could be improved.
1
1 If you think that this section helped you or if you think that this
1 is just a waste of time you can send any feedback to
1 <bjoern.jacke@gmx.de>.
1