Document Reader File Format Historical note: The FRED document reader was written by Simon Cooke, and appeared first in FRED disc magazine, issue 17. It was the first of its kind to display compressed text faster than the equivalent BASIC uncompressed text reader. Since then, two more versions have appeared, one a minor bodge to provide printing facilities for the text reader (when version 1 was written, I didn't have a printer, or printer interface, so I had to give it my best guess on how it all worked. I was wrong), and the other being a slightly updated token set, as well as cleaned up file formats, addition of a "Return to help page" button instead of the "Article menu doesn't exist" message which was put in with the printer fix, and a tidier compressor program to go with it. Version 1.2 will be the last one to use this format explicitly; work is in progress on a new, multi-format text reader program (the Entropy Chimera project) which will not only interpret Document Reader files, but also COMET assembler source files, SAM Small C source, Tasword and Outwrite files, MSDOS text files, Hypertext files, SAM BASIC programs, etc etc. It's all just a matter of time. Document reader data types There are three types of file associated with Document Reader - DCP files, MAG files and JYN files. However, the JYN file type only applies to version 1.2 of the document reader program. *.DCP files contain a magic string to identify its own presence in the SAM system, as well as details of the number of pages contained in the MAG archive, and the page and offset in memory at which they can be found. DCP files are always loaded at 16384. NB: if the magic string is not found when the document reader program is run, then it will abort with an error message. *.MAG files contain (for version 1.0) article menu data, giving page numbers at which certain topics held in the document can be found, as well as (for all versions) the actual compressed data stream. *.JYN (joined) files are in actuality merely version 1.2 DCP and MAG files concatenated into a single file, in order to reduce the directory space utilised. The document reader program cannot read these files by itself; a short BASIC program stub is needed to arrange the data in memory in a suitable format before calling the reader. Version 1.0 format: *.DCP file (load to 16384) Offset Function Length 0 Magic String "1991Cookie" (12) 12 Number of pages present in document (1) 15 Page data: page number at which document page starts(1) followed by an offset in the range 0-16383 (2) The page data repeats until all pages in the document have been accounted for. (NB: The "" character is the space character, or &20 hex. The symbol is &7F hex). The page data forms a 3-byte address within the SAM's memory for the start of each page of text in the document archive. Version 1.0 format: *.MAG file (load to 38233) Offset Function Length 0 Address of article page table (2) 2 Number of articles in article table (1) 3 Maximum article name width (in characters) (1) 4 Start of article name data Article name data can be any length, and each article topic is delimited by the last character of the name having bit 7 set. The actual document text starts immediately after the article text in the file, and is pointed to by the DCP file for speed. The article width field is used in rendering menus. The article page table mentioned is a list determining which article topic corresponds to which page of the document. It is worthy of note that the first page in a MAG archive is the help page displayed when the document reader is first loaded, and that it corresponds to page number zero. Version 1.1 file format The version 1.1 file format is in fact identical to the version 1.0 format (the reason for this being so that it would be possible to load in old files for printing purposes). The only difference is that where the v1.0 file would nearly always have an associated article menu, v1.1 files usually have the article menu set to zero entries. Thus, the first 4 bytes of a version 1.1 file will nearly always be: 38233 Address of article page table: 38237 38235 Number of articles found: 0 38236 Max article descriptor width: 0 There will be no article name entries, or page table entries; thus the address pointer at 38233 is invalid, and the magazine page data itself will start at this address. All a reader has to be able to do in order to read both v1.0 files and v1.1 files reliably is to be able to determine the number of articles in the article menu, and to ignore it is the number of articles found field is set to zero. This criterion is not met by the actual v1.0 Document Reader program, and is the reason why it often crashes if attempts to view files made for v1.1 with no article menu are made. Version 1.2 file format: *.DCP files (loads to 16384) Offset Function Length 0 Magic String "1992ENTROPY" (14) 14 Number of pages present in document (1) 15 Page data: page number at which document page starts(1) followed by an offset in the range 0-16383 (2) The page data repeats until all pages in the document have been accounted for. Version 1.2 files are differentiated from their v1.0 and v1.1 equivalents by the different magic string; this is necessary, as all remnants of the article menu data have been removed from the v1.2 MAG file, and also v1.2 uses a different, slightly more optimised compression token table than versions 1.0 and 1.1 (which both use the same one). Version 1.2 file format: *.MAG files (loads to 38300) Whereas the earlier versions held article data (or remnants thereof) here, all such data has been removed in version 1.2 so as to allow the files to take up less disc space. Thus, the only data to be found in the MAG file is the compressed document text itself, as pointed to by the DCP file. Version 1.2 file format: *.JYN files As mentioned above, JYN files are composed of a DCP file, with a MAG file tacked on immediately afterwards. To extract the DCP and MAG files, it is necessary to load the file at, say, 49152, and to calculate the DCP length (using the number of pages value stored 14 bytes into the file), poke the DCP section into memory at 16384, and then move the rest of the file down in memory to 38300. The length of the DCP file is calculated as: (NUMPAGES*3)+18. As the file is just a DCP file joined to a MAG file, it can be validity checked by looking for the string "1992ENTROPY" at the start of the file. Text compression method Document text in a MAG file is compressed using a combination of run- length compression and tokenised strings. Each page is 1344 bytes long (64 characters per line, 21 lines per page). Data bytes below 128 are passed directly to the output routine, bytes with the value 128 are passed onto the run- length subroutine, and the rest (129-255) are passed onto the detokenisation routine. Arguably, better compression could be provided by allowing tokens to also have the value 0-31 - increasing the total number of tokens available from 127 to 159 - but as this is not done by the Document Reader, and there are no further incarnations of the original reader program planned, this is a moot point. Run-Length Compression Spaces are run-length compressed. The compressor works by looking for a string of 3 or more spaces in the document text. (In Outwrite and Tasword format text, 64 spaces are used for a blank line, so this can lead to considerable savings). Thus, whenever a code of &80 hex is found in the compressed text, the next byte is taken, and this is used as a counter to print spaces to the screen. Occurences of two consecutive spaces are compressed by the tokenising routine. For example, if the data in the text stream was (in hexadecimal): 21 41 45 80 10 82 83 41 This would print out as: !AEA The &80 hex is the compressed space character key, and the byte after it - &10 - indicates that 16 spaces are to be printed out. The &82 and &83 bytes are compressed tokens, and until we have gone over the operation of the detokeniser, I have printed them as . Tokenising Compression Tokenising compression works by replacing strings of characters with a reference to a dictionary. This dictionary can then be used to recreate the text. It's similar to short-hand, except for efficiency, we do not use just whole words, but word fragments as well. To compress the text, the tokeniser looks through the uncompressed data to see if it matches one of the "words" or tokens in the dictionary. (This is similar to what BASIC does when the editor puts a line into the program. BASIC does it for two reasons; to save space, and to make it much quicker to actually run a program. We're only doing it for space reasons). If a token is found, 129 is added to the token's dictionary reference number, and it takes the place of the equivalent text in the compressed data. Thus all it's necessary to do to decompress the tokens, is to have a copy of the appropriate dictionary, and then to use the token data to access that table. In the tables below, the dictionary entries are referenced by the normalised (ie with 129 subtracted from them) token numbers; this is to make it easier to write a fast decompressor routine. Each token has bit 7 set on the last character, to mark that the end of the token has been reached. In the above example text string, the compressed data was: 21 41 45 80 10 82 83 41 And the uncompressed string was: !AEA For version 1.0 and version 1.1, we can now write this as: !AEscreensscreenA And for version 1.2, it is now: !AEouldouseA The decompressed tokens are in bold. Version 1.0 and 1.1 Token Dictionary 00 01 02 03 00 address screens screen issue 04 memory screen don't SAMCO 08 SAMCo Coupe FRED bytes 0C data it's from SAM 10 199 code Code Data 14 ould out had Coupe 18 SAMCO SAMCo The the 1C tion at empt 199 20 comp Comp cons Cons 24 ... you 'll ere 28 You it .) n't 2C ity At 199 ing 30 een and And ght 34 mag pro oum ove 38 age - 'm 's 3C You I ant ial 40 ( er , 44 . ! ? 48 A or ss ee 4C ch sh un ly 50 th Th To to 54 ow qu Qu Be 58 be Up up Re 5C re en En us 60 Us ed oo ." 64 !" ?" ; : 68 ) pe Pe ir 6C Ir my pp I 70 dd ea ff ss 74 it rr at At 78 e y ic Version 1.2 Document Reader Token Dictionary 00 01 02 03 00 you'll ould ouse cons 04 comp I'll entr ight 08 ent ing out 0C ang cei ial ant 10 mag pro age I'm 14 'll had n't ean 18 eem ove I'd een 1C all oup SAM the 20 The dis key ave 24 opy oil air eer 28 ure ion vis ban 2C mon hor ard ish 30 nal . , 34 's om sh ch 38 ew ng ic tr 3C cr it ff ss 40 ee oo ou ie 44 ei 'm nt fl 48 ph qu be up 4C re en us ed 50 to ow rr ea 54 ar pe mu th 58 Th ll ff In 5C in pp my I 60 or on et sc 64 ut ex ce ck 68 at At A a 6C It is Is su 70 Co er de di 74 bi ey sp go 78 aw ay il op 7C an oc id