aspell: Phonetic Code

1 
1 7.3 Phonetic Code
1 =================
1 
1 Aspell is in fact the spell checker that comes up with the best
1 suggestions if it finds an unknown word.  One reason is that it does
1 not just compare the word with other words in the dictionary (like
1 Ispell does) but also uses phonetic comparisons with other words.
1 
1    The new table driven phonetic code is very flexible and setting up
1 phonetic transformation rules for other languages is not difficult but
1 there can be a number of stumbling blocks -- that's why I wrote this
1 section.
1 
1    The main phonetic code is free of any language specific code and
1 should be powerful enough to allow setting up rules for any language.
1 Anything which is language specific is kept in a plain text file and
1 can easily be edited.  So it's even possible to write phonetic
1 transformation rules if you don't have any programming skills.  All you
1 need to know is how words of the language are written and how they are
1 pronounced.
1 
1 7.3.1 Syntax of the transformation array
1 ----------------------------------------
1 
1 In the translation array there are two strings on each line; the first
1 one is the search string (or switch name) and the second one is the
1 replacement string (or switch parameter).  The line
1 
1      version   VERSION
1 
1 is also required to appear somewhere in the translation array.  The
1 version string can be anything but it should be changed whenever a new
1 version of the translation array is released.  This is important
1 because it will keep Aspell from using a compiled dictionary with the
1 wrong set of rules.  For example, if when coming up with suggestion for
1 `hallo', Aspell will use the new rules to come up with the soundslike
1 say `H*L*', but if `hello' is stored in the dictionary using the old
1 rules as `HL' instead of `H*L*' Aspell will never be able to come up
1 with `hello'.  So to solve this problem Aspell checks if the version
1 strings match and aborts with an error if they don't.  Thus it is
1 important to update it whenever a new version of the translation array
1 is released.  This is only a problem with the main word list as the
1 personal word lists are now stored as simple word lists with a single
1 header line (i.e. no soundslike data).
1 
1    Each non switch line represents one replacement (transformation)
1 rule.  Words beginning with the same letter must be grouped together;
1 the order inside this group does not depend on alphabetical issues but
1 it gives priorities; the higher the rule the higher the priority.
1 That's why the first rule that matches is applied.  In the following
1 example:
1 
1      GH   _
1      G    K
1 
1 `GH -> _' has higher priority than `G -> K'
1 
1    `_' represents the empty string "".  If `GH -> _' came after `G ->
1 K', the second rule would never match because the algorithm would stop
1 searching for more rules after the first match.  The above rules
1 transform any `GH' to an empty string (delete them) and transforms any
1 other `G' to `K'.
1 
1    At the end of the first string of a line (the search string) there
1 may optionally stand a number of characters in brackets.  One (only
1 one!)  of these characters must fit.  It's comparable with the `[ ]'
1 brackets in regular expressions.  The rule `DG(EIY) -> J' for example
1 would match any `DGE', `DGI' and `DGY' and replace them with `J'.  This
1 way you can reduce several rules to one.
1 
1    Before the search string, one or more dashes `-' may be placed.
1 Those search strings will be matched totally but only the beginning of
1 the string will be replaced.  Furthermore, for these rules no follow-up
1 rule will be searched (what this is will be explained later).  The rule
1 `TCH-- '-> _ will match any word containing `TCH' (like `match') but
1 will only replace the first character `T' with an empty string.  The
1 number of dashes determines how many characters from the end will not
1 be replaced.  After the replacement, the search for transformation
1 rules continues with the not replaced `CH'!
1 
1    If a `<' is appended to the search string, the search for
1 replacement rules will continue with the replacement string and not with
1 the next character of the word.  The rule `PH< -> F' for example would
1 replace `PH' with `F' and then again start to search for a replacement
1 rule for `F...'.  If there would also be rules like `FO '-> `O' and `F
1 -> _' then words like `PHOXYZ' would be transformed to `OXYZ' and any
1 occurrences of `PH' that are not followed by an `O' will be deleted like
1 `PHIXYZ -> IXYZ'.  The second replacement however is not applied if the
1 priority of this rule is lower than the priority of the first rule.
1 
1    Priorities are added to a rule by putting a number between 0 and 9 at
1 the end of the search string, for example `ING6 -> N'.  The higher the
1 number the higher is the priority.
1 
1    Priorities are especially important for the previously mentioned
1 follow-up rules.  Follow-up rules are searched beginning from the last
1 string of the first search string.  This is a bit complicated but I
1 hope this example will make it clearer:
1 
1      CHS      X
1      CH       G
1 
1      HAU--1   H
1 
1      SCH      SH
1 
1    In this example `CHS' in the word `FUCHS' would be transformed to
1 `X'.  If we take the word `DURCHSCHNITT' then things look a bit
1 different.  Here `CH' belongs together and `SCH' belongs together and
1 both are spoken separately.  The algorithm however first finds the
1 string `CHS' which may not be transformed like in the previous word
1 `FUCHS'.  At this point the algorithm can find a follow-up rule.  It
1 takes the last character of the first matching rule (`CHS') which is
1 `S' and looks for the next match, beginning from this character.  What
1 it finds is clear: It finds `SCH -> SH', which has the same priority
1 (no priority means standard priority, which is 5).  If the priority is
1 the same or higher the follow-up rule will be applied.  Let's take a
1 look at the word `SCHAUKEL'.  In this word `SCH' belongs together and
1 may not be taken apart.  After the algorithm has found `SCH '-> `SH' it
1 searches for a follow-up rule for `H+'`AUKEL'.  It finds `HAU--1 -> H',
1 but does not apply it because its priority is lower than the one of the
1 first rule.  You see that this is a very powerful feature but it also
1 can easily lead to mistakes.  If you really don't need this feature you
1 can turn it off by putting the line:
1 
1      followup      0
1 
1 at the beginning of the phonetic table file.  As mentioned, for rules
1 containing a `-' no follow-up rules are searched but giving such rules
1 a priority is not totally senseless because they can be follow-up rules
1 and in that case the priority makes sense again.  Follow-up rules of
1 follow-up rules are not searched because this is in fact not needed
1 very often.
1 
1    The control character `^' says that the search string only matches
1 at the beginning of words so that the rule `RH -> R' will only apply to
1 words like `RHESUS' but not `PERHAPS'.  You can append another `^' to
1 the search string.  In that case the algorithm treats the rest of the
1 word totally separately from the first matched string at the beginning.
1 This is useful for prefixes whose pronunciation does not depend on the
1 rest of the word and vice versa like `OVER^^' in English for example.
1 
1    The same way as `^' works does `$' only apply to words that end with
1 the search string.  `GN$ -> N' only matches on words like `SIGN' but
1 not `SIGNUM'.  If you use `^' and `$' together, both of them must fit
1 `ENOUGH^$ -> NF' will only match the word `ENOUGH' and nothing else.
1 
1    Of course you can combine all of the mentioned control characters but
1 they must occur in this order: `< - priority ^ $'.  All characters must
1 be written in CAPITAL letters.
1 
1    If absolutely no rule can be found -- might happen if you use strange
1 characters for which you don't have any replacement rule -- the next
1 character will simply be skipped and the search for replacement rules
1 will continue with the rest of the word.
1 
1    If you want double letters to be reduced to one you must set up a
1 rule like `LL- -> L'.  If double letters in the resulting phonetic word
1 should be allowed, you must place the line:
1 
1      collapse_result     0
1 
1 at the beginning of your transformation table file; otherwise set the
1 value to `1'.  The English rules for example strip all vowels from
1 words and so the word "GOGO" would be transformed to "K" and not to
1 "KK" (as desired) if `collapse_result' is set to 1.  That's why the
1 English rules have `collapse_result' set to `0'.
1 
1    By default, all accents are removed from a word before it is matched
1 to the soundslike rules.  If you do not want this then add the line
1 
1      remove_accents      0
1 
1    at the beginning of your file.  The exact definition of an accent is
1 language dependent and is controlled via the character set file.  If you
1 set remove_accents to '0' then you should also set "store-as" to "lower"
1 in the language data file (not the phonetic transformation file)
1 otherwise Aspell will have problems when both the accented and the
1 de-accented version of a word appearing in the dictionary; it will
1 consider one of them as incorrectly spelled.
1 
1 7.3.2 How do I start finally?
1 -----------------------------
1 
1 Before you start to write an array of transformation rules, you should
1 be aware that you have to do some work to make sure that things you do
1 will result in correct transformation rules.
1 
1 7.3.2.1 Things that come in handy
1 .................................
1 
1 First of all, you need to have a large word list of the language you
1 want to make phonetics for.  It should contain about as many words as
1 the dictionary of the spell checker.  If you don't have such a list,
1 you will probably find an Ispell dictionary at
1 `http://fmg-www.cs.ucla.edu/geoff/ispell-dictionaries.html' which will
1 help you.  You can then make affix expansion via `ispell -e' and then
1 pipe it through `tr " " "\n"' to put one word on each line.  After that
1 you eventually have to convert special characters like `é' from
1 Ispell's internal representation to latin1 encoding.  `sed s/e'/é/g'
1 for example would replace all `e'' with `é'.
1 
1    The second is that you know how to use regular expressions and know
1 how to use `grep'.  You should for example know that:
1 
1      grep ^[^aeiou]qu[io] wordlist | less
1 
1 will show you all words that begin with any character but `a', `e',
1 `i', `o' or `u' and then continue with `qui' or `quo'.  This stuff is
1 important for example to find out if a phonetic replacement rule you
1 want to set up is valid for all words which match the expression you
1 want to replace.  Taking a look at the regex(7) man page is a good idea.
1 
1 7.3.2.2 What the phonetic code should do
1 ........................................
1 
1 Normal text comparison works well as long as the typer misspells a word
1 because he pressed one key he didn't really want to press.  In these
1 cases, mostly one character differs from the original word.
1 
1    In cases where the writer didn't know about the correct spelling of
1 the word, the word may have several characters that differ from the
1 original word but usually the word would still sound like the original.
1 Someone might think that `tough' is spelled `taff'.  No spell checker
1 without phonetic code will come to the idea that this might be `tough',
1 but a spell checker who knows that `taff' would be pronounced like
1 `tough' will make good suggestions to the user.  Another example could
1 be `funetik' and `phonetic'.
1 
1    From these examples you can see that the phonetic transformation
1 should not be too fussy and too precise.  If you implement a whole
1 phonetic dictionary as you can find it in books this will not be very
1 useful because then there could still be many characters differing from
1 the misspelled and the desired word.  What you should do if you
1 implement the phonetic transformation table is to reduce the number of
1 used letters to the only really necessary ones.
1 
1    Characters that sound similar should be reduced to one.  In the
1 English language for example `Z' sounds like `S' and that's why the
1 transformation rule `Z -> S' is present in the replacement table.  "PH
1 is spoken like "F and so we have a `PH -> F' rule.
1 
1    If you take a closer look you will even see that vowels sound very
1 similar in the English language: `contradiction', `cuntradiction',
1 `cantradiction' or `centradiction' in fact sound nearly the same, don't
1 they? Therefore the English phonetic replacement rules not only reduce
1 all vowels to one but even remove them all (removing is done by just
1 setting up no rule for those letters).  The phonetic code of
1 "contradiction" is "KNTRTKXN" and if you try to read this
1 letter-monster loud you will hear that it still sound a bit like
1 `contradiction'.  You also see that "D" is transformed to "T" because
1 they nearly sound the same.
1 
1    If you think you have found a regularity you should _always_ take
1 your word list and `grep' for the corresponding regular expression you
1 want to make a transformation rule for.  An example: If you come to the
1 idea that all English words ending on `ough' sound like `AF' at the end
1 because you think of `enough' and `tough'.  If you then `grep' for the
1 corresponding regular expression by `grep -i ough$ wordlist' you will
1 see that the rule you wanted to set up is not correct because the rule
1 doesn't fit to words like `although' or `bough'.  So you have to define
1 your rule more precisely or you have to set up exceptions if the number
1 of words that differ from the desired rule is not too big.
1 
1    Don't forget about follow-up rules which can help in many cases but
1 which also can lead to confusion and unwanted side effects.  It's also
1 important to write exceptions in front of the more general rules (`GH'
1 before `G' etc.).
1 
1    If you think you have set up a number of rules that may produce some
1 good results try them out! If you run Aspell as `aspell
1 --lang=YOUR_LANGUAGE pipe' you get a prompt at which you can type in
1 words.  If you just type words Aspell checks them and eventually makes
1 suggestions if they are misspelled.  If you type in `$$Sw WORD' you
1 will see the phonetic transformation and you can test out if your work
1 does what you want.
1 
1    Another good way to check that changes you make to your rules don't
1 have any bad side effects is to create another list from your word list
1 which contains not only the word of the word list but also the
1 corresponding phonetic version of this word on the same line.  If you
1 do this once before the change and once after the change you can make a
1 diff (see `man diff') to see what _really_ changed.  To do this use the
1 command `aspell --lang=YOUR_LANGUAGE soundslike'.  In this mode Aspell
1 will output the the original word and then its soundslike separated by
1 a tab character for each word you give it.  If you are interested in
1 seeing how the algorithm works you can download a set of useful
1 programs from
1 `http://members.xoom.com/maccy/spell/phonet-utils.tar.gz'.  This
1 includes a program that produces a list as mentioned above and another
1 program which illustrates how the algorithm works.  It uses the same
1 transformation table as Aspell and so it helps a lot during the process
1 of creating a phonetic transformation table for Aspell.
1 
1    During your work you should write down your basic ideas so that other
1 people are able to understand what you did (and you still know about it
1 after a few weeks).  The English table has a huge documentation
1 appended as an example.
1 
1    Now you can start experimenting with all the things you just read and
1 perhaps set up a nice phonetic transformation table for your language
1 to help Aspell to come up with the best correction suggestions ever
1 seen also for your language.  Take a look at the Aspell homepage to see
1 if there is already a transformation table for your language.  If there
1 is one you might also take a look at it to see if it could be improved.
1 
1    If you think that this section helped you or if you think that this
1 is just a waste of time you can send any feedback to
1 <bjoern.jacke@gmx.de>.
1