aspell: Notes on 8-bit Characters

1 
1 A.2 Notes on 8-bit Characters
1 =============================
1 
1 There is a very good reason I use 8-bit characters in Aspell. Speed and
1 simplicity. While many parts of my code can fairly easily be converted
1 to some sort of wide character as my code is clean. Other parts cannot
1 be.
1 
1    One of the reasons why is because in many, many places I use a direct
1 lookup to find out various information about characters. With 8-bit
1 characters this is very feasible because there is only 256 of them.
1 With 16-bit wide characters this will waste a LOT of space. With 32-bit
1 characters this is just plain impossible. Converting the lookup tables
1 to another form is certainly possible but degrades performance
1 significantly.
1 
1    Furthermore, some of my algorithms rely on words consisting only on a
1 small number of distinct characters (often around 30 when case and
1 accents are not considered). When the possible character can consist of
1 any Unicode character this number becomes several thousand, if that. In
1 order for these algorithms to still be used, some sort of limit will
1 need to be placed on the possible characters the word can contain. If I
1 impose that limit, I might as well use some sort of 8-bit characters
1 set which will automatically place the limit on what the characters can
1 be.
1 
1    There is also the issue of how I should store the word lists in
1 memory? As a string of 32 bit wide characters. Now that is using up 4
1 times more memory than characters would and for languages that can fit
1 within an 8-bit character that is, in my view, a gross waste of memory.
1 So maybe I should store them is some variable width format such as
1 UTF-8. Unfortunately, way, way too many of the algorithms will simply
1 not work with variable width characters without significant
1 modification which will very likely degrade performance. So the
1 solution is to work with the characters as 32-bit wide characters and
1 then convert it to a shorter representation when storing them in the
1 lookup tables. Now that can lead to an inefficiency. I could also use
1 16 bit wide characters, however that may not be good enough to hold all
1 future versions of Unicode and therefore has the same problems.
1 
1    As a response to the space waste used by storing word lists in some
1 sort of wide format some one asked:
1 
1      Since hard drives are cheaper and cheaper, you could store a
1      dictionary in a usable (uncompressed) form and use it directly
1      with memory mapping. Then the efficiency would directly depend on
1      the disk caching method, and only the used part of the
1      dictionaries would really be loaded into memory. You would no more
1      have to load plain dictionaries into main memory, you'll just want
1      to compute some indexes (or something like that) after mapping.
1 
1    However, the fact of the matter is that most of the dictionary will
1 be read into memory anyway if it is available. If it is not available
1 then there would be a good deal of disk swaps. Making characters 32-bit
1 wide will increase the chance that there are more disk swaps.  So the
1 bottom line is that it is more efficient to convert characters from
1 something like UTF-8 into some sort of 8-bit character. I could also
1 use some sort of disk space lookup table such as the Berkeley Database.
1 However this will *definitely* degrade performance.
1 
1    The bottom line is that keeping Aspell 8-bit internally is a very
1 well though out decision that is not likely to change any time soon.
1 Feel free to challenge me on it, but, don't expect me to change my mind
1 unless you can bring up some point that I have not thought of before
1 and quite possibly a patch to solve cleanly convert Aspell to Unicode
1 internally without a serious performance lost OR serious memory usage
1 increase.
1