readme

NOTES ABOUT THE DICTIONARY PROGRAMMES AND FILES

There are currently three (actually 5) dictionary versions, the original DOS version, the original Macintosh version, both in C, a C++ version for the Macintosh by myself which was superseded by an improved C++ version by James Jennings, and a Windows version by Mel Kanner. While the Macintosh and Windows C++ versions may eventually be posted, for the moment I have decided to post the original DOS C version as possibly the most suitable base for porting to Unix-type systems.
The main problem is perhaps the understanding of the nature of the data files. There are 4 files, plus an optional file prim.set if the complex-generating algorithm is wanted. This is essentially the prims file from the teach programme with some extra international and scientific prims added. The main data files are dict.log and dict.eng which present the dictionary words in lower-case alphabetical order in CR delimited records containing TAB delimited fields. However, the display is capitalized where appropriate, and the input should accept capitals but convert them to lower-case. Because of the CR delimiter, platforms that modify text delimiters should read these files as binary.
There are two index files, index.eng and index.log. These are B-tree structured files in 512 byte blocks. Originally the root sector was entered in the header file, but this had to be changed with each update. The new programmes calculate the root sector from the file length, since the root sector is the last sector in the file. The other important point is that the letters of the search word are written in 16-bit words in the Macintosh most-significant byte first manner. Each 16-bit word consists of two bytes, the first containing the test byte, and the second an ordinal number which indicates where the next search should occur. Initially separate DOS and Mac index files were produced, but current programmes use the Mac produced files and are endian-tolerant.
The search system was one I developed in the early 70s for a Selective Dissemination of Information programme, which speeded up matching by over 100 fold. While a linear search would probably have been quite effective here with sectors of 512 bytes, I adopted the same syetem for the dictionary lookup.
Briefly, comparison is done one letter at a time. If the test letter is lower in the alphabet than the first trial letter the search moves to the next lower node in the hierarchy. If still unsuccessful when it reaches a leaf node it reports failure. Else if higher it examines the pointer byte and follows it to the beginning of the next higher letter. If a match is found, it moves to the next word and repeats until it finds a
blank byte If the test word is still incomplete, search continues at the pointer of the blank byte. If it is finished, the following 4 bytes give the ordinal byte of the dict.x file containing the definition in question and of the next node. At any step if a pointer is needed and is blank, the word is not in the node and the next node is examined. If
failure occurs at the beginning of the next node, the word is not in the dictionary and failure is reported.
I did this summary from memory as a crutch to understanding the code and I may have overlooked some detail. There are more detailed comments in the code itself.