Home

Implementation of DCA Compression Method

1. e 3 1 representation introduced in 7 we traverse the tree in depth first order and write a 1 bit each time we find an internal node and a 0 bit each time we find an external node Therefore a binary tree of m internal nodes will need exactly 2m 1 bits which is almost identical to the 2 2 scheme e Hamming coding first we walk the trie in depth first order and count the times we go left right and up Then we compute Hamming coding for these directions and encode the trie by a depth first walk After experiments with this coding we found that it is better only on very unbalanced trees in other cases it performs worse then 2 2 e Text Generating the Antidictionary an interesting idea of how to code an anti dictionary is to supply some short text that generates the target antidictionary This representation could be very efficient this option is going to be discussed more thoroughly In this DCA implementation 2 2 representation was used with the following meaning code meaning 00 no subtree 01 only the right subtree 10 only the left subtree 11 both subtrees 4 5 1 Text generating the antidictionary While playing with antidictionaries and their automata a new possibility of representing antidictionaries was thought up It seems to be feasible to code the antidictionary even more efficiently Such antidictionary would be generated from a special text and then this antidictionary would be used for text dec
2. if str substr tofind 0 str then break else if str substr tofind 0 str then hi2 pos 1 else lo2 pos 1 else str substr bittezt sa pos sa sa pos if str lt substr tofind 0 str then hi2 pos 1 else lo2 pos 1 if hi2 lo2 then return false return true Add Antiword root lcp aw saPos p root for i 0 to aw 1 do if p aw i not defined then q new state asc q p level g i 1 p aw i q p p aufi antiword p true j saPos k saPos len aw 1 while lcp j gt len do j j 1 while k lt n and lcp k 1 gt len do k k Al1 visited asc p k j 1 43 44 CHAPTER 4 IMPLEMENTATION 4 8 Run Length Encoding As discussed before classical DCA approach has a problem with compressing strings of type 0 1 although it compresses well string 1 0 We can improve these simple repetitions handling by including RLE compression before the input of the DCA algorithm Run length encoding is a very simple form of lossless data compression in which runs of data i e sequences of the symbol are stored as a single data value and count rather than as the original run We compress all sequences with length gt 3 with count encoded using Fibonacci code 2 Sequences shorter than 3 are kept untouched Example 4 2 Compress text abbaaabbbbbbbbbbbbbbba ab a b a using RLE
3. 113 4 83 1 101 4 0 0 101 0 41 5 ptt5 74 9 74 7 74 1 74 8 77 8 83 2 87 9 77 7 78 1 94 8 sum 66 7 68 7 68 1 62 2 61 7 57 6 51 6 76 9 64 2 50 0 xargs 1 118 8 118 9 117 8 119 5 110 8 64 7 93 9 0 0 93 0 49 8 Table 5 2 Parallel antidictionaries using dynamic compression scheme 250 200 100 used memory MB 50 suffix trie suffix array lt dynamic DCA almost antiwords block size 1 262144 65 Figure 5 29 Memory requirements in relation to block size compressing plrabn12 txt 66 CHAPTER 5 EXPERIMENTS T B suffix trie suffix array dynamic DCA 60 L B almost antiwords 2 time s X eo SF X i i ji eono i 1024 4096 16384 65536 262144 block size Figure 5 30 Time requirements in relation to block size compressing plrabn12 txt to better compression ratios However in general it can t be recommended 5 10 Dividing Input Text into Smaller Blocks Huge memory requirements of DCA method were already presented that s why we can t build antidictionary over whole large files but we should rather divide them into smaller blocks and compress them separately Tests were made on plrabn12 txt file from Can terbury Corpus with mazdepth 40 splitting the file into blocks of particular size and then compressing all these parts separat
4. Implementation 4 1 Used Platform The main target of this thesis was to implement a working DCA implementation try different parameters of the method and choose the most appropriate Program was intended to work on command line and be as efficient as possible Many tests were needed to run as a batch the program was to be licensed under some public licence and that s why GNU Linux platform for selected for development Also small memory requirements low CPU usage and other optimization factors were demanded As C C is a native language for development on most platforms C language using gcc compiler g respectively which also suits best for its built in optimizations was preferred Using C we have memory management of dynamically allocated memory in our hands we don t need to rely on a garbage collector The program was developed for 32 bit platforms it would need further modifications to work under 64 bit environment For good portability GNU tools Automake and Autoconf were used that automatically generate build scripts for target platform The program was tested only under 2586 platform which uses little endian for big endian platforms modifications would be needed Nevertheless this program serves still rather for research and testing purposes it is not usable as a production tool despite the efforts it is not practically usable because of its high system resources requirements As the code is published under GNU GPL ev
5. In the input text there are two sequences with length gt 3 a and b We compress the first as aaa0 zero means no other a symbols We compress the second sequence in a similar way resulting in bbb12 Our compressed text will be abbaaa0bbbl2a 4 9 Almost Antiwords In order to improve the compression ratio also the almost antiwords improvement dis cussed in Section 3 7 was tested implementing the one pass heuristics It is based on the suffix trie implementation but an algorithm for building antidictionary with almost antiwords support could be probably also developed The results were little disappointing at the beginning later showed up that the modified algorithm performs very good on some types of texts while being worse than classical approach on others Another significant issue is that the gain of almost antiwords is based on an unknown factor of exception length which can be only roughly estimated Fine tuning this implementation would probably lead to better results For coding exceptions Fibonacci code 2 was used again 4 10 Parallel Antidictionaries Unlike classical dictionary compression algorithms working with symbols DCA is working with a binary stream where each considered symbol can gain only values 0 and 1 This is very limiting because we lose notion of symbol boundaries in text as most English text documents use only 7 bit symbols and we could just forget about the 7th bit
6. l I l l 10 15 20 25 30 35 40 45 50 maxdepth Figure 5 20 Compression ratio obtained compressing grammar lsp from Canterbury Corpus 100 T T T T T T static DCA static DCA RLE x dynamic DCA dynamic DCA RLE 3 almost antiwords F3 9 S X X X x E ba 8 E E t E El E ll a B ab o o 20 0 L fi L l l l f 10 15 20 25 30 35 Ap a maxdepth Figure 5 21 Compression ratio obtained compressing sum from Canterbury Corpus 5 8 ALMOST 180 ANTIWORDS 160 140 120 100 80 used memory MB 60 40 20 T suffix trie h suffix array lt dynamic DCA almost antiwords c F 4 5 maxdepth igure 5 22 Memory reguirements using almost antiwords time s T T suffix trie suffix array x dynamic DCA almost antiwords maxdepth Figure 5 23 Time requirements using almost antiwords 61 62 CHAPTER 5 EXPERIMENTS i static DCA dynamic DCA almost antiwords compression ratio 96 20 0 L L L L L L L 10 15 20 25 30 35 40 45 50 maxdepth Figure 5 24 Compression ratio obtained compressing paper1 T T T T total compressed data antidictionary exceptions m J m e 8 Tog 3 20b r 4 X E Mic n L Mc et a 4 15
7. larged in Figure 5 8 You can notice a similarity between Figure 5 6 and Figure 5 7 which is caused by antiword count dependendency on node count This relation can be more obvious from 5 9 5 4 Data Compression Now we are going to look at the more interesting part Compression and decompression performance of static and dynamic compression scheme in relation to mazdepth These results are again measured on paperl from Canterbury Corpus Figure 5 10 shows some expected behaviours of the implemented methods Memory re quirements are worst for static compression scheme using suffix trie while dynamic com pression scheme requires only about half of the memory Suffix array static compression scheme s performance is definitely superior to both others as its memory requirements almost don t grow with mazdepth Also its initial requirements below mazdepth 25 are not very important since below this value we don t get usable compression ratios This is what suffix array is really designed for However compression time is no longer so good for suffix array in Figure 5 11 still it outperforms suffix trie for mazdepth gt 25 Dynamic compression scheme is much faster here as it does not need to read text twice do simple pruning and self compression compute gains or count visits even to construct an antidictionary It s much faster even for large maxdepth values 52 antiwords used antiwords CHAPTER 5 EXPERIMENTS 60000 T
8. tors found yet Memory reguirements are smaller method is guite fast as it does not need to do breadth first search for building antidictionary or any other tree traversal for computing gains Even it is very simple to implement it if we don t bother with suffix trie memory greediness and we don t need to read the text twice as in the static scheme There are some disadvantages too decompression will be slower it makes the method symmetric because decompression process must do almost the same as compression As we build antidictionary dynamically parallel compressors decompressors are not possible 26 CHAPTER 3 DATA COMPRESSION USING ANTIDICTIONARIES input read text output antidictionary all antifactors 0 1 0 1 1 1 01 E except 00 00 1 011 1 00 10 00 10 010 1 0111 C 00 10 00 10 010 110 0110 0 01110 E except 00 010 0110 1111 00 010 0110 1111 01111 1 011101 C 00 010 0110 1111 00 010 100 0110 1100 1111 01111 11100 0 0111010 E except 00 0110 1011 1111 00 100 0110 1011 1100 1111 01111 11011 11100 1 01110101 C 00 0110 1011 1111 00 100 0110 1011 1100 1111 01111 10100 11011 11100 Table 3 4 Dynamic compression example to use also we lose one of DCA strong properties pattern matching in compressed text Efficient representation of the exceptions is a problem but can be solved using some universal coding or
9. 0 f lt KX F 1 i i 10 15 20 25 30 35 40 45 50 maxdepth Figure 5 11 Time reguirements for compressing paperl T T static DCA dynamic DCA x ES 2 E o o E Me EK o o 20 FF 4 0 1 L l 1 1 1 1 10 15 20 25 30 35 40 45 50 maxdepth Figure 5 12 Compression ratio obtained compressing paper1 5 5 DATA DECOMPRESSION 55 60 T T T T T T T total compressed data x antidictionary size kB 20 ge EET ie ae ce KM ek 10 F 4 M x N K X X Wd dM x x d Zia NA l l l l l l 10 15 20 25 30 a5 40 45 50 maxdepth Figure 5 13 Compressed file structure created using static scheme compressing paperl Another thing which playes for dynamic compression scheme is compression ratio see Figure 5 12 utilizing its main advantage not needing to store the antidictionary sep arately But notice the algorithm s instability whereas compression ratio of static compression scheme is improving for increasing mazdepth compression ratio of dynamic DCA is floating for mardepth gt 30 This is caused by compressing more data but at the same time getting more exceptions which are expensive to code Summarizing the results dynamic compression scheme achieves about 5 better compression ratios than static compression scheme In Figure 5 13 we can see structure of the compressed file retrieved using static com pression scheme With growing mazdep
10. 300 250 m x o 200 N o 150 F 100 X E LI P inthe C b LLL X Kaa L 1024 4096 16384 65536 262144 block size 67 Figure 5 32 Compressed file structure created using static compression scheme in relation to block size compressing plrabn12 txt 68 CHAPTER 5 EXPERIMENTS 5000 rr T T T T 4500 4 4000 3500 3000 4 2500 F 4 exceptions count 2000 1500 4 1000 500 7 L IT ET IT 117 HET T FF 0 10 20 30 40 50 exceptions distance Figure 5 33 Dynamic compression scheme exception distances histogram 5 11 Dynamic Compression In dynamic compression scheme we need to deal with exceptions but before that we need to know something about them Figure 5 33 is a histogram of exceptions distances for maxdepth 40 compressing file paperl from Calgary Corpus We can see that most distances have value below 10 Fibonacci coding 2 was picked but other universal coding or even Huffman coding could be useful too In Figure 5 34 we can see exception count in relation to maxdepth which explains strange compression ratio dependency on mazdepth mentioned in Section 5 4 Generally dynamic compression scheme achieves good compression ratios e g with fields c Figure 5 28 but sometimes things can g
11. as it remains O What was subject to test was to slice a file lengthwise creating 8 files fileN with Oth bit in file0 1st bit in filel and so on and then compressing these portions separately using different antidictionaries Using these approach 8 parallel antidictionaries over a single file were simulated 4 11 USED OPTIMIZATIONS 45 4 11 Used Optimizations To optimize the code some well known optimizations as pointer arithmetics when working with arrays loop unrolling dynamic allocation in large pools of prepared data were used Dynamic allocation has a significant impact on the performance that s why more memory than needed is allocated at a time Compiler optimizations were used too it is even possible to use profiler log to gain better performance 4 12 Verifying Results During development many compression problems occurred For verifying results extensive logging was used and also some tools for checking the algorithm performance had to be written such as antidictionary generation self compression results and gain computation The best verification we have is that after decompression we get the original text however this tells us nothing about the optimality of the used antidictionary It has to be checked some other way After antidictionary construction using suffix array was implemented there has been an advantage of two completely different methods for generating the antidictinary Compar ing
12. i 0 4 tj a a SA 16 5 LCP 0 0 E b SS o ARO MD HERON RO DO OTH Hot toe v O o Nj D doce BO OMD VNO Table 3 1 Suffix array for text abcaab search a memory representation reguires two arrays of pointers one for suffix array and one for lcps their sizes are equivalent to the length of input text Let s suppose we have an efficient algorithm for suffix array and lcp construction What we need to realize for antidictionary construction is that we are constructing suffix array over the binary alphabet so the suffix array and lep length will be 8 times length of the input text Still memory requirements for suffix array construction depends only on the length of input text with O N instead of suffix trie almost exponential complexity depending on the trie depth 3 6 2 Antidictionary construction The suffix arrays offer text searching capabilities similar to suffix tries thus why not to use them for antidictionary construction First mention of this idea can be found in 19 Now antidictionary construction using suffix array with asymptotic complexity O k Nlog N will be explained k is maximal antiword length The process takes two adjacent strings at a time and finds antifactors Special handling is needed for the last item which is not in pair 1 Take two adjacent strings u and u 41 from suffix array uj u41 M 2 Skip their common prefix c utilizing LCP u cxv wisi cyw x Z y and test
13. Cee Ze GEO c Zee a 8 prone 2 10891 ES E S si Apeuuey o SJx Apeuuo gt o o bal dsp reuiurej6 cc dsyuewwes6 x o o o spiey o spjeu g i o juu do juu do E x paa no se 1iinoAse 10 H kl EU ixregeoye 16289118 E 1 1 1 1 1 1 RENS 1 1 J o o o o o o o o o o o o o o o 8 8 B 8 8 E Bo 1x3 1ndui Jo g sseJduioo 0 papasu aun a e 4g sse1duioo o pepeeu Aousw 74 Figure 5 43 Memory needed by selected methods to compress 1 byte of input text on Canterbury Corpus 75 5 13 SELECTED PARAMETERS file original gzip bzip2 almostaw 30 dynamic rle 32 rle 34 alice29 txt 152089 54423 43202 58965 61365 63386 asyoulik txt 125179 48938 39569 52550 54462 54756 cp html 24603 7991 7624 10927 9743 11250 fields c 11150 3134 3039 4865 3784 4885 grammar lsp 3721 1234 1283 1823 1441 1880 kennedy xls 1029744 206767 130280 268827 431929 384241 lcet10 txt 426754 144874 107706 158887 159724 164221 plrabn12 txt 481861 195195 145577 187914 207005 205967 ptt5 513216 56438 49759 45556 95175 100432 sum 38240 12920 12909 19655 17527 21619 xargs 1 4227 1748 1762 2659 2102 2543 Table 5 4 Compressed file sizes obtained on Canterbury Corpus file original gzip bzip2 almostaw 30 dynamic rle 32 rle 34 bib 111261 35059 27467 40247 38078 41177 book1 768771 313370 232598 306609 360005 363396 b
14. T T T T T total found simple pruning only x single self compression multiple self compression 50000 40000 30000 20000 10000 maxdepth Figure 5 7 Number of antiwords in relation to mazdepth 10000 T T T simple pruning only single self compression lt 9000 multiple self compression P Qu n 8000 ae J 7000 A 4 6000 5000 4000 3000 2000 1000 15 20 25 30 35 40 45 50 maxdepth Figure 5 8 Number of used antiwords in relation to mazdepth 5 4 DATA COMPRESSION nodes 400000 350000 300000 250000 200000 150000 100000 50000 T T T nodes leading to antiwords antiwords lt nodes after pruning cG antiwords after pruning maxdepth Figure 5 9 Relation between number of nodes and number of antiwords used memory MB 160 140 120 100 80 r 60 r suffix trie suffix array lt dynamic DCA War MIND t eee ee oe E Figure 5 10 25 30 35 40 45 50 maxdepth Memory requirements for compressing paper1 53 54 CHAPTER 5 EXPERIMENTS 8 5 r r r r r suffix trie suffix array lt dynamic DCA 3r 25r 2r T A 2 x a 1 54 m J X x th P J K P as a ae 05 H cui J ee eee am W kok k e e a
15. all suffixes of T is denoted o T The suffix trie of T is a tree representing o T More formally we denote the suffix trie of T as STrie T U 1 X root F g f and define such a trie as an augmented deterministic finite state automaton which has a tree shaped transition graph representing the trie for o T and which is augmented with the so called suffix function f and auxiliary state 1 The set O of the states of STrie T can be put in a one to one correspondence with the substrings of T We denote by the state that corresponds to a substring x The initial state root node corresponds to the empty string e and the set F of the final states corresponds to c T The transition function g is defined as g a y for all 2 y in Q such that y za where a X The suffix function f is defined for each state as follows Let Z root Then x ay for some a X and we set f y Moreover f root L Automaton STrie T is identical to the Aho Corasick string matching automaton 1 for the key word set 7 1 lt i n 4 1 suffix links are called in 1 failure transitions 6 CHAPTER 2 PRELIMINARIES Definition 2 19 Suffix trie depth Suffix trie depth k is the maximum height allowed for the trie We will denote it as mazdepth k Note 2 4 Suffix trie depth limit Due to the suffix trie depth limit k suffix trie represents only suffixes S C o T Vx S z lt k Theorem 2 1 Suffix t
16. are pruned from the trie Both single and multiple self compression simple prune rounds are demonstrated in Figure 3 8 3 6 Antidictionary Construction Using Suffix Array In previous sections we used suffix trie for antidictionary construction One of the main problems of suffix trie structure is its memory consumption the large amount of informa tion in each edge and node make it very expensive Even for depth k larger than 30 and small input files suffix trie size grows very fast and needs tens to hundreds Megabytes of memory Also creation and traversal through the whole trie is quite slow We can consider other methods for collecting all text factors As we are dealing with 18 CHAPTER 3 DATA COMPRESSION USING ANTIDICTIONARIES RM ences moe y siete Build antidictionary Self compression Antidictionary reduced recompute Self compression Simple prune No reduction pass Optimal antidictionary Optimal antidictionary a b Figure 3 8 Antidictionary construction using single a and multiple b self compression simple prune rounds o sees All possible antiwords binary alphabet this limits usage of some of them suffix trees DAWGs or CDAWGs which are designed mainly for larger alphabets Also antidictionary constructing algo rithms need to be modified fundamentally and an appropriate way for efficiently repre senting these structures has to be developed This work focuses on usage of su
17. aw substr wcyr 0 U 71 if SA Bin Search bittext sa aw lo hi then Add Antiword root lcp aw i for ll 1 1 to Wnertl 1 do if wen ll 1 then aw substr Wyegt 0 ll 707 if SA Bin Search bittext sa aw lo2 hi2 then Add Antiword root lcp aw i 1 if sa gt 0 then process the last word lo 1 hi sal weur substr bittezt sa sa 1 maxdepth for 0 to wcyr 1 do if wcurll 0 then aw substr weur 0 D T if SA Bin Search bittext sa aw lo hi then Add Antiword root lcp aw sa 1 SA Bin Search bittext sa aw lo hi pos 0 tofind substr aw 1 aw 1 while hi gt lo do at first find tofind 1 characters pos hi 4 lo 2 if sa pos tofind 1 lt sa then str substr bittext sa pos tofind 1 if str substr tofind O str then break 4 7 ANTIDICTIONARY CONSTRUCTION USING SUFFIX ARRAY ANa kRWNEH el O IO O BUN FE O else if str lt substr tofind 0 str then hi pos 1 else lo pos 4 1 else str substr bittezt sa pos sa sa pos if str lt substr tofind 0 str then hi pos 1 else lo pos 1 if hi lo then return false lo2 lo hi2 hi while hi2 gt lo2 do find exact string pos hi2 lo2 2 if sa pos tofind lt sa then str substr bittezt sa pos tofind
18. compiler that builds the antidictionary first and then constructs the automaton over it 3 3 Antidictionary Construction Using Suffix Trie As it turns out later the most complex task of the DCA method is just the antidictionary construction It s natural to use suffix trie structure for collecting all factors of the given text although any other data structure for storing factors of words can be used such as suffix trees suffix automata DAWGs compacted suffix automata CDAWGs suffix arrays Let s consider text t cjc9c3 Cy of length n where c is a symbol at position i We are adding words c1 C1C2 C1C2C3 C1C2C3 Cn Step by step Because we are adding the words to a suffix trie structure representing all suffixes of the given words we get all factors of text t See Figure 3 2 for example of constructing suffix trie for text cacao To construct an antidictionary from the suffix trie we add all antifactors of the text For every factor u u z we add an antifactor v u y z y if factor v doesn t already exist The resulting antidictionary won t be minimal so we need to select only the minimal antifactors The antifactor v is minimal when there does not already exist an antifactor 3 3 ANTIDICTIONARY CONSTRUCTION USING SUFFIX TRIE 13 macs en doin Le cia Figure 3 3 Antidictionary construction over text 01110101 using suffix trie w such that v v w i e th
19. good performance Bibliography 1 10 11 12 A V Aho and M J Corasick Efficient string matching An aid to bibliographic search Commun ACM 18 6 333 340 1975 A Apostolico and A S Fraenkel Robust transmission of unbounded strings using fibonacci representations IEEE Transactions on Information Theory 33 2 238 245 1987 M P B al F Mignosi and A Restivo Minimal forbidden words and symbolic dynamics In C Puech and R Reischuk editors STACS volume 1046 of Lecture Notes in Computer Science pages 555 566 Springer 1996 T C Bell I H Witten and J G Cleary Modeling for text compression Computer Science Technical Reports pages 327 339 1988 M Burrows and D Wheeler A block sorting lossless data compression algorithm SRC Research Report 124 Digital Equipment Corporation 1994 M Crochemore F Mignosi A Restivo and S Salemi Text compression using antidictionaries In J Wiedermann P van Emde Boas and M Nielsen editors ICALP volume 1644 of Lecture Notes in Computer Science pages 261 270 Springer 1999 M Crochemore and G Navarro Improved antidictionary based compression In SCCC pages 7 13 IEEE Computer Society 2002 M Davidson and L Ilie Fast data compression with antidictionaries Fundam Inform 64 1 4 119 134 2005 R Grossi and J S Vitter Compressed suffix arrays and suffix trees with applications to text indexing and string match
20. is to erase symbol b from antiword w ubv if and only if there exists antiword y xb where z is a proper suffix of u which makes sure that w gt y Example 3 5 Self compress trie of the antidictionary AD 00 0110 1011 1110 Only 1011 antiword path can be compressed we remove node 101 as it can be predicted due to antiword 00 and connect nodes 10 and 1011 Antiword 1011 will actually become 3 6 ANTIDICTIONARY CONSTRUCTION USING SUFFIX ARRAY 17 Figure 3 7 Self compression example 101 in the new representation Antiword 0110 cannot be compressed because compressing 01 to just 0 will lead to a nondeterministic antidictionary reconstruction See Figure 3 7 Self compression algorithm will be explained thoroughly in Section 4 4 4 With antidictionary self compression we can further improve our static compression scheme And what about combining this technique with simple pruning In fact it makes things a bit harder because self compressing changes the antidictionary represen tation and influences antiword gains For better precision we do simple pruning on a self compressed tree and after pruning we self compress the antidictionary and consider it as final However this simplification isn t accurate after self compressing the trie still may not be optimal because on the other side simple pruning affects self compression We can fix this by applying simple pruning and self compressing iteratively as long as some nodes
21. k x N log N k which is similar to complexity of building antidictionary from suffix trie but with memory requirements O N thanks to suffix array representation This assumpts that we are able to count number of node visits in O k which can be done during walking through the suffix lcp array In this implementation it is O N in worst case but in average it is much smaller 42 CHAPTER 4 IMPLEMENTATION this implies total asymptotic complexity O k x N N log N OO NIO O RB C5 NN A OO MNO O RUN FE Build AD SA root int crc int mazdepth gt 0 read whole input bittezt crc computeCRC32 bittezt expand each single bit in bittext to 1 byte sa makeSA bittezt lcp makeLCP bittext sa root new state level root 0 fail root root for i 0 to sa do lo 1 hi sal 1 lep i 1 if mazdepth then continue weur substr bittext sali mardepth Wnert substr bittezt sa i 1 mazdepth if wcur l 4 then end of current word if wnenill V then Z1 aw substr Wpezt 0 D 7075 if SA Bin Search bittext sa aw lo hi then Add Antiword root lcp aw i 1 else if Wyegill then end of next word if wcurll then 004 aw lt substr weur 0 D T if SA Bin Search bittext sa aw lo hi then Add Antiword root lcp aw i lo2 lo hi2 hi for ll 1 1 to weur 1 do if wcur ll 0 then
22. outputs of these two different algorithms gives us a very good hint if the generated antidictionary is correct So to verify the complete antidictionary is quite easy but ver ification of the simple pruning process is impossible in fact as there is no algorithm for constructing the most efficient antidictionary available It s much easier with self compression Self compression check was implemented using a dummy python script which compresses all antiwords using all shorter ones Performance is quite slow but the result is sufficient Also after code rewriting or implementing something new the new code has to be tested every time on a set of files if it compresses and decompresses them correctly This all was possible thanks to strong scripting abilities of the GNU Linux environment and without all the verification tools it would not be so easy 4 13 Dividing Input Text into Smaller Blocks Earlier the memory greediness of DCA method and some options of reducing these re quirements were discussed Still one of the simplest options is to divide the file into smaller blocks and compress them separately This will reduce memory requirements a lot as the trie memory size depends strongly on the length of input text On the other hand we need to realize that this will affect compression ratio which will be worse because the antidictionary will be smaller and the gains of antifactors will be lower 46 CHAPTER 4 IMPLEMENTATION Chapter
23. s X b X appears just a few times in the text but its prefix s appears so many times that it is better to consider sb as an antifactor Of course to be able to recover the original text we need to code somehow those text positions where the bit predicted by taking the string as an antifactor is wrong We call exceptions the positions in the original text where this happens that is the final positions of the occurrences of sb in the text 3 7 1 Compression ratio improvement Usage of almost antifactors can theoretically bring compression ratio improvement to original DCA algorithm but it s not so easy as it looks at the first sight By introducing some almost antifactors we can remove also good antifactors whose gain was better than of the newly introduced almost antiword Also we completely prune branches con nected to factors we turned into antifactors In contrast of improving gain introducing new almost antiwords we lose some gain elsewhere the whole tree changes a lot The key problem 7 is that the decision of what is an almost antifactor depends in turn on the gains produced so we cannot separate the process of creating the almost antifactors and computing their gains creating an almost antifactor changes the gains upwards in the tree as well as the gains downwards via suffix links So there seems to be no suitable traversal order It is not possible either to do a first pass computing gains and then a second pass de
24. the gains using the self compressed tree 4 4 6 Simple cruning Simple pruning function prunes from the trie all nodes which does not have positive gain Gain function is computed using the self compressed trie As we are walking the trie bottom up from terminal nodes the traversal is not deterministic and it s not 38 CHAPTER 4 IMPLEMENTATION guaranteed that in each node we process gains of both subtrees are already computed For this we have set g r of each node r to value 1 which means uninitialized value if we hit a node with an uninitialized subtree we stop walking bottom up and continue with the next antiword After processing all antiwords all trie nodes will have a defined gain Whenever we find a node with negative gain we prune it with the whole subtree belonging to it and at the same time we prune also the subtree from the orginal trie As we forbid nodes with negative gains we can simplify function M maz g 51 g S2 g 51 g S2 2 from Section 4 4 5 to M g Si g S Because this is not the only one possibility of static compression scheme implementa tion for testing purposes simple pruning was implemented also without self compression Because it is very similar to the version using self compression only the more interest ing version of simple pruning the self compressed trie is going to be presented Using O N for antidictionary size according to 14 and O 2 for pruning subtree we get
25. txt 37 23 39 25 38 95 38 90 38 54 asyoulik txt 41 19 43 48 43 49 41 66 41 67 cp html 44 37 38 68 38 83 44 69 44 81 fields c 43 47 32 84 33 44 42 50 43 11 grammar lsp 48 99 37 92 38 46 49 48 50 01 kennedy xls 25 04 26 36 27 18 25 85 25 91 lcet10 txt 34 36 35 17 34 64 34 90 34 30 plrabn12 txt 36 96 41 37 41 34 38 40 38 39 ptt5 8 88 93 44 16 49 96 78 18 36 sum 51 01 46 88 43 42 61 39 55 15 xargs l 62 67 49 58 49 58 60 04 60 04 compression ratio 96 Table 5 3 Best compression ratios obtained on Canterbury Corpus 100 90 almost AW lt static DCA sos dynamic DCA 80 F static DCA RLE 1 dynamic DCA RLE 70 60 50 T 40 i Swn E E has U EN UN a rj ne M io MW Rea 30 E oM JM i Li 8 M 1 a4 1 il 20 be M 4 Y OB E x 5 EN o M NA O EE os NN F a o gt 10 io R 4 MOD e i Q tE i E NN M Ys da NE m DM gt 4 E pr M D E g E ai o E c E E S a alice29 txt IB asyoulik txt Em fields c Masse kennedy xls Icet1 0 txt xargs 1 5 13 SELECTED PARAMETERS 71 100 T T T T T T T static DCA static DCA RLE x dynamic DCA dynamic DCA RLE c almost antiwords compression ratio 96 20 r 1 L 10 15 20 25 30 35 40 45 50 maxdepth Figure 5 37 Average compression ratio obtained by each method on Cant
26. up to date An example of collaboration diagram of class DCA created by Doxygen in cooperation with Graphviz can be seen on Figure 4 1 We can t rely just on this type of documentation as it does not describe how the al gorithms work in common only describes the meaning of each variable and how the functions are called For that type of documentation some Wiki system IATEX or an other kind of offline documentation is more appropriate It should also support including tables and graphics for better descriptions of used data structures and algorithms At this point this text is serving for this purpose 4 3 Debugging As was mentioned before it was needed to debug how the program was really performing some tasks including building trie and its online modification Normal program debug gers as gdb are not very useful as we need to see the program outputs and contents of 4 3 DEBUGGING 31 large data structures contained in memory Therefore the program was eguipped with different debugging options to provide information about each part of the process they can be turned on in compile time by defining the following macros e DEBUG if not set turns off all debugging messages otherwise shows debugging messages according to LOG MASK filter also enables counters of total nodes and nodes deleted during simple pruning e DEBUG DCA enables trie and antidictionary debugging every node contains also its complete string representatio
27. x Al 10 F s 5r gy g ged 5 db B a E EF B a E l E Que 1 1 1 1 1 fi 10 15 20 25 30 35 40 45 50 maxdepth Figure 5 25 Compressed file structure created using almost antiwords compressing pa per1 5 8 ALMOST ANTIWORDS compression ratio compression ratio 100 T T T T T T static DCA N dynamic DCA x N almost antiwords 80 r X 60 40 r 20 F 7 0 L L L L L L L 10 15 20 25 30 35 40 45 50 maxdepth Figure 5 26 Compression ratio obtained compressing alice29 txt EE T n x X X XXX XX ke X 80 60 L static DCA J static DCA RLE x dynamic DCA dynamic DCA RLE almost antiwords 40 tp i em a A A 3 20 SM a a e a T LE A dide SAS a a ee 0 L 1 1 L 1 l 10 15 20 25 30 35 40 45 50 maxdepth Figure 5 27 Compression ratio obtained compressing ptt5 63 64 CHAPTER 5 EXPERIMENTS 100 T T T T static DCA dynamic DCA x almost antiwords K3 P ape dee cler iidem 8 c Mr o PER e O A RR RAR a o E 40 J 8 20 7 0 L L L L L L L 10 15 20 25 30 35 40 45 50 maxdepth Figure 5 28 Compression ratio obtained compressing xargs 1 on ptt5 file it gives better compression ratio than standard compression programs such as gzip or bzip2 Figure 5 27 However not everything is good using this technique it looks like it has problem with small f
28. 49 260 1995 B Zhao K Iwata S Itoh and T Kato A new approach of DCA by using BWT In DCC 10 page 495 Appendix A User Manual Usage dca OPTION lt FILE gt Compress on uncompress FILE by default compress FILE d f lt filename gt 1 lt level gt h v Options d f lt filename gt 1 lt level gt V Examples decompress output file maximal antiword length show help be verbose Decompress specified FILE Save output data to the specified file in case of compression that is antidictionary compressed data checksum when decompress ing it is the original uncompressed data If file is not specified output is directed to standard output Compression level this option can take values from 1 up to 40 or even more but larger values don t improve compression much only reguire excessive system resources Default 30 Show help Be verbose displaying compression ratio antidictionary and com pressed data size and time taken This is useful for measurements Compress file foo and save it to foo dz dca f foo dz foo Decompress file foo dz to foo dca d f foo foo dz Compress file foo using level 40 and save it to foo dz dca 140 f foo dz foo 81
29. 5 Experiments 5 1 Measurements All measurements were executed on an AMD Athlon XP 2500 1024MB RAM with Mandriva Linux and kernel version 2 6 17 i686 All time measurements were made 5 times for time measurements the minimal achieved value was selected for memory mea surements only one measurement was sufficient as the algorithm is deterministic and use the same amount of memory every time for the same configuration Memory was mea sured using memusage from GNU libc development tools summarizing heap and stack peak usage Time was measured using getrusage as a sum of user and system time used Program was compiled with O3 and DEBUG PROFILING MEASURING op tions turned on and all debug logging turned off Program was called with v option displaying summary with compressed data length antidictionary size compression ratio achieved and total time taken We are going to choose appropriate parameters for static as well as dynamic compression scheme In static compression scheme we have parameters mazdepth how to use self compression and whether to use suffix array In dynamic compression scheme we can affect only mazdepth 5 2 Self Compression Self compression is one of the static compression scheme parameters Using self compression is not mandatory so we can skip it and use simple pruning only Another op tion is to use it together with simple pruning we will denote it as single self compression For better precisi
30. ANTIDICTIONARIES Figure 3 9 Example of using almost antiwords Node Gain as an antiword 0 16 5 15 59 1 16 5x x1 11 00 15 5 14 55 01 15 5x 1 10 000 14 5 x 13 51 001 14 5 1 9 Table 3 3 Example of node gains as antiwords suffix tries we can get worse results than with the classical approach it depends on the particular file Using multi pass heuristics k 2 passes count are recommended for good results but it is not suitable for us because of its time complexity Example 3 7 Compress text 0000000000000001 0191 using suffix trie depth limit k 3 with classical approach and with almost antiwords then compare the results We build suffix trie from the text as can be seen in Figure 3 9 Next to each node there is a visit count written in brackets Let s suppose the following representation of every node in antidictionary trie takes A 2 bits 2 extra bits for coding root node representation of each exception takes E 5 bits Now we can compute gain of each node r as an antiword using function g r p r is parent of r v r is visit count of r g r v p r Ex u r After gain computation see Table 3 3 we convert node 1 to a terminal node because of its positive gain Nodes 01 and 001 are not converted because they would be not minimal antifactors We obtain antidictionary AD 1 which compresses the input data to Using classical approach we get an empty dict
31. Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science and Engineering Diploma Thesis Implementation of DCA Compression Method Martin Fiala Supervisor Ing Jan Holub Ph D Master Study Program Electrical Engineering and Information Technology Specialization Computer Science and Engineering May 2007 Prohlaseni Prohla uji e jsem svou diplomovou pr ci vypracoval samostatn a pou il jsem pouze podklady uveden v p ilo en m seznamu Nem m z va n d vod proti u it tohoto koln ho d la ve smyslu 60 Z kona 121 2000 Sb o pr vu autorsk m o pr vech souvisej c ch s pr vem autorsk m a o zm n n kter ch z kon autorsk z kon V Praze dne 14 ervna BDO nn li Anotace Komprese dat metodou antislovn ku je nov metoda komprese dat zalo en na faktu e n kter posloupnosti znak se v textu nikdy nevyskytuj Tato pr ce se zab v imple mentac r zn ch metod DCA komprese dat metodou antislovn ku zalo en ch na prac ch Crochemore Mignosi Restivo Navarro a dal ch a srovn v v sledky na standardn ch sad ch soubor pro vyhodnocov n kompresn ch metod Je p edstavena konstrukce antislovn ku pomoc suffix array se zam en m na sn en pam ov ch n rok statick ho zp sobu komprese D le je vysv tlena a implementov na dynamick DCA komprese jsou testov na n kter mo n vylep en a imp
32. G PROFILE show profiling info One of the most useful options is the antidictionary debugging which stores antidictionary state in different stages to an external file for later examination From this we can find out if the antidictionary construction is working properly or which antiwords are problematic With this we have an essential hint for finding implementation errors Another important option is the trie debugging which outputs a trie structure in a text format as well as in a graphical format created using graphviz It is possible to even watch the suffix trie construction step by step Regardless these graphs are drawn automatically and are not so nice as diagrams drawn by hand they can be still very handy An example of a suffix trie graph generated by graphviz can be seen on Figure 4 2 4 4 Implementation of Static Compression Scheme In Section 3 5 has been already outlined how the antidictionary and compression trans ducer is prepared for static compression scheme Let s look at overview what we actually Graph Visualization Software open source graph network visualization project from AT amp T Re search http www graphviz org 44 IMPLEMENTATION OF STATIC COMPRESSION SCHEME 33 Compression Save antidictionary Antidictionary automaton Compress data Compressed data a b Figure 4 3 File compression a and file decompression b schemes do with a ready made transducer In Figure 4 3 both compres
33. ME 37 The original r value is a pointer to the original code in non compressed suffix trie Time asymptotic complexity is O N x k 1 Self Compress root 2 rootCompr new state 3 add root rootCompr to empty queue Q 4 while Q 4 do 5 extract p p from Q 6 if go and q are children of p then 7 create qj and q as children of p 8 original q qo original g1 qi 9 ela Lela L 10 add go qj and qi di to O 11 else if q is a unique child of p q p a a 0 1 then 12 if antiword p a then 13 add g p to O 14 else 15 create g as a child of p 16 original g g 17 ed lt l 18 add q q to 19 return rootCompr 4 4 5 Gain computation Gain computation is based on the fact that we can estimate gain of a node being an antifactor if we know how much costs its representation For this 2 2 representation is used for antidictionary storage described in Section 4 5 Gain g S of subtree S is defined as in 6 0 if S is empty c S 2 if isa leaf antiword 9g 51 2 if S has one child Si M if S has two children S and Sy g 8 where M maa g 51 53 g 91 g S2 2 It is clear that it is possible to compute gains in linear time with respect to the size of the trie in a single bottom up traversal of the trie But how the self compression affects our gain computation The answer is that it doesn t in fact we simply compute
34. MPRESSION USING ANTIDICTIONARIES Lemma 3 1 Synchronizing property 17 If w gt k 1 then u w d root w for any state u O such that u w AD where w is the searched pattern k is length of the longest forbidden word in the antidictionary Q is the set of all compression decompression transducer states function u w is the state reached after applying all transitions uj bi i 1 4wpu Uu Ui du bj 1 w bibo bu be 0 1 Unfortunately this works only for patterns longer than k so we need to search the com pressed text using a different technique presented in 17 which solves the problem in O m M n r time using O m M n space where m and n are the pat tern length and the compressed text length respectively M denotes the total length of strings in antidictionary M and r is the number of pattern occurrences This is achieved using a decompression transducer with eliminated e states similar to the one mentioned in 8 and an automaton build over the search pattern The algorithm has a linear complexity proportional to the compressed text length when we exclude the pattern processing This pattern searching algorithm can be used on texts compressed using static compression method because it needs to preprocess the antidictionary before searching starts not on texts produced by dynamic method which also lacks synchro nization property of static compressed texts Chapter 4
35. NG ANTIDICTIONARIES T Suffix trie with limited depth k Build antidictionary F Figure 3 6 Antidictionary construction using simple pruning 3 5 2 Antidictionary self compression As with static approach we need to store the antidictionary together with the com pressed data it might cross our minds that there is a possibility to compress also the antidictionary itself This depends heavily in which form we are going to represent the antidictionary list Basically we have two options 1 antiword list antidictionary size length of each antiword and the antiword itself With this we could use all previous antiwords to compress decompress the following antiwords e g for AD 00 0110 1011 1110 we get AD 00 010 101 1110 2 antiword trie trie structure represented in some suitable way Using this method we are actually saving a binary tree of course the tree can be also self compressed Longer antiwords can be compressed using shorter antiwords but with some limi tations Let s consider the following w ubv is an antiword u v X b c X If we compressed antiword w ubv using antiword z ub it would become w uv and w z could happen which means that nodes representing antiwords z and w would be on the same level in the compressed suffix trie and could overlap This is generally not what we want because we wouldn t be able to reconstruct the original tree Reasonable solution
36. Piste s e elo 8 ate Ron nS AK he a A ac e ee eS 29 4 2 Documentation and Versioning uu 29 dd DODHBEJHE lt z aida aak e dk wy bale RO e vu di an et 30 4 4 Implementation of Static Compression Scheme 32 dad Sumy Hie constructio s ias eu A eo Bind Reds 33 4 4 2 Building antidictionary s s ae oa aca lt 4 4 44 35 44 3 Building automaton lt lt 2x 4242 846 dwa awa agina wda 36 444 bBelbcompressiol s a ooo cres ae none A ROAD RE A 8 36 4 15 Gam computation ss spem d AK A k A kok RE ROE e Ea 37 4 46 Simple crnini a es be eA moto dorum k GE Y E dn kc EU 37 4 5 Antidictionary Representation I ee 38 4 5 1 Text generating the antidictionary 39 4 6 Compressed File Format lt lt e 40 4 7 Antidictionary Construction Using Suffix Array 41 47 1 Sufhx array construction os s a rep aoe be EORR RS AR RR 41 4 7 2 Antidictionary construction s spse lt lt 4 lt lt lt lt lt 0 41 48 Run Length Encoding sess eee o o RR RR RE 44 4 9 Almost Antiwords 2s 44 40 Parallel Antidietionaries lt lt cc ob ox RR k cel ko ec Roc Ro eos 44 4 11 Used Optimizations Vo suum nc non Lo Eee R POR ER o k dock cn e 45 4 12 Verifying Results 2 lt s 229 x9 oo goo E S A Book Ox ee hd 45 4 13 Dividing Input Text into Smaller Blocks 2 45 xiv 5 Experiments 51 Measurements 02 coc ebook eA ee ee Dae Pelf ONIDEOSSIUB
37. ally it slightly improves the compression ratio for smaller mazdepth values but with larger mazdepth on the contrary it can make it slightly worse as in Figure 5 20 The main advantage comes with a particular type of files where RLE improves compression ratio significantly as in Figure D 5 8 Almost Antiwords For testing purposes almost antiwords implementation was developed using one pass heuristics based on suffix trie As we can see in Figure 5 22 and Figure 5 23 for the same maxdepth value this method needs more time and memory due to more complicated antidictionary construction More interesting is compression ratio in Figure 5 24 where almost antiwords technique is better for smaller mazdepth values but for mazdepth gt 30 it is unable to improve compression ratio further This implementation is not fine tuned and using multi pass heuristics could lead to better values In Figure 5 25 we can see structure of the com pressed file exceptions coding takes less space than antidictionary On many files almost antiwords technique surprisingly outperforms all the others Fig ure 5 26 which makes this method very interesting for future experiments For example 60 CHAPTER 5 EXPERIMENTS 100 T T T T T T static DCA static DCA RLE x dynamic DCA dynamic DCA RLE 3 almost antiwords compression ratio 1 1 m mx P w if i T ji 20 r 1 0 l l
38. antifactor w ub where w u X b X and while reading text we find occurence of string u Because the next symbol can t be b we can predict it as b Therefore when we compress the text we erase symbols that can be predicted and in reverse when decompressing we predict the erased symbols back The general idea of the method is quite interesting first we analyze input text and find all antifactors Using binary alphabet 0 1 we erase symbols whose occurrences can be predicted using the set of antifactors As we can see DCA is a lossless compression method which operates on binary streams so far i e it is working with single bits not symbols of larger alphabets but some current research is dealing with this too Example 3 1 Compress string s u 1 s X using antifactor u 0 Because u 0 is an antifactor the next symbol after u must be 1 So we can erase the symbol 1 after u To be able to compress the text we need to know the forbidden words First we analyze the input text and find all antifactors which can be used for text compression Figure 3 1 For our purpose we don t need all antifactors but just the minimal ones The antifactor 10 CHAPTER 3 DATA COMPRESSION USING ANTIDICTIONARIES Analyze text and find all antifactors Antidictionary Figure 3 1 DCA compression basic scheme Compression process Compressed text is minimal when it does not have any proper factor that is forbidd
39. ary AD 00 0110 1011 1111 step 12 3 4 5 6 7 8 input 01 011 0111 01110 011101 0111010 01110101 dls af dato Zes Waders day d output 0 O 01 O01 01 01 01 01 1 Current prefix e There is no such word z in AD so we pull 0 from input and push 3 2 DATA COMPRESSION AND DECOMPRESSION 11 it to output 2 Current prefix 0 There is word 00 in AD so we erase the next symbol 1 3 Current prefix 01 There is no such word u Wx in AD where u is suffix of 01 We read 1 from input and push it to output 4 Current prefix 011 There is word 0110 in AD so we erase the next symbol 1 5 Current prefix 0111 There is word 1111 in AD so we erase the next symbol 0 The result of compressing text 01110101 is 01 To be able to decompress this text we need to store the antidictionary and the original text length also which could be more obvious from the following decompression example Example 3 3 Decompress text 01 using antidictionary AD 00 0110 1011 1111 Decompression is just an inversed compression algorithm 1 Current prefix There is no such word z in AD so we pull 0 from input and push it to output 2 Current prefix 0 There is word 00 in AD so we predict the next symbol as 1 and push it to output 3 Current prefix 01 There is no such word u u z in AD where u is suffix of 01 We read 1 from input and push it to output 4 Current prefix 011 There is word 0110 in AD so we predict t
40. by the null pointers that are children of internal nodes The exception are the forcedly external nodes at depth k 1 that are children of internal nodes at the maximum depth k which may or may not be antifactors Definition 2 25 Terminal nodes Each external node of the trie that surely corresponds to an antifactor i e at depth k is converted into an internal leaf node These new internal nodes are called terminal nodes Note 2 7 Terminal nodes Note that not all leaves are terminal as some leaves at depth k are not antifactors Definition 2 26 Almost antifactor 7 Let us assume that a given string s appears m times in the text and that s 0 and s 1 8 CHAPTER 2 PRELIMINARIES where means concatenation appear mo and m times respectively so that m mo m except if s is at the end of the text where m my m 1 Let us assume that we need e bits to code an exception Hence if m gt e mo then we improve the compression by considering s 0 as an antifactor similarly with s 1 Almost antifactors are string factors that improve compression when considered as antifactors Definition 2 27 Suffix array Suffix array is a sorted list of all suffixes of given text represented by pointers Note 2 8 Suffix array 12 When a suffix array is coupled with information about the longest common prefixes Icps of adjacent elements in the suffix array string searches can be answered in O P log N time wit
41. ciding which will be terminals because if one converts a node into terminal its gain changes and modifies those of all the ancestors in the tree It is not possible to leave the removal of redundant terminals for later because the removal can also change previous decisions on ancestors of the removed node 3 7 2 Choosing nodes to convert In the original document two ways of solving this problem were introduced one pass and multi pass heuristics Both heuristics work with the whole suffix trie not with just the trie with antifactors This is very limiting for designing a fast DCA implementation multi pass heuristics needs repetitious tree traversal over the whole suffix trie which is very expensive Although we can use the one pass heuristics according to 7 it doesn t perform as well as the multi pass one The one pass heuristics first makes breadth first top down traver sal determining which nodes will be terminal and then applies the normal bottom up optimization algorithm to compute gains and decide which nodes deserve belonging to the antidictionary The problem of heuristics is that it s not accurate since considering that it may be a bad decision to convert into terminal a node that turns out to have a subtree with a large gain we lose it an also when we give the preference to the highest node it is not necessarily always the best choice Even when testing one pass heuristics with deeper 22 CHAPTER 3 DATA COMPRESSION USING
42. d about 5 better According to worse compression ratios more memory used and similar time needed we can rule out option simple pruning only We can not judge about single and multiple self compression from just one file The most interesting difference should be in compression ratios so the tests were performed on Canterbury Corpus with maxdepth 40 Figure 5 4 Again compression ratios were practically the same so we can choose single self compression as the better one because it needs less memory and time to achieve the same compression ratio 5 3 Antidictionary Construction and Optimization Another important part is to understand why is the DCA algorithm so greedy when constructing suffix trie In Figure 5 5 we can see how much nodes we create when building suffix trie and how their count decrease to almost negligible count that we really use for compression Number of nodes drops at most just after antidictionary construction and selection of only nodes leading to antiwords Another reduction follows with self compression and subseguent simple pruning We can see that both are guite effective In Figure 5 6 is this showed in more detail Notice also the lowest node count plot rep resenting simple pruning only option without self compression More nodes are pruned because their gains are not improved using self compression 50 CHAPTER 5 EXPERIMENTS 4 sByex uns gud y no sel
43. d of suffix trie also dynamic compression scheme was explained and some suggestions of how to improve compression ratios and how to solve huge memory requirements were provided Several data compression using antidictionaries methods were implemented static com pression scheme using suffix trie or suffix array for antidictionary construction dynamic compression scheme using suffix trie and static compression scheme with almost antiwords using suffix trie Their results compressing files from Canterbury and Calgary corpuses were evaluated It turned out that one of the biggest problems concerning data compression using an tidictionaries is actually building the antidictionary it requires not only much time but even much memory Using suffix array for antidictionary lowers memory requirements significantly As the antidictionary is usually built over the whole input file it is a good idea to restrict block length processed in one round and to split the file into more blocks using different antidictionaries This limit depends on the memory available as larger blocks mean better compression ratio and lower the time requirements The most important parameter of all considered methods is mazdepth k limiting the length of antiwords A suitable value of this parameter was selected for each method For static compression scheme an idea of representing antidictionary by text generating it is introduced which could improve compression ratio by reduci
44. depth Figure 5 39 Average time needed to compress 1MB of input text on Canterbury Corpus 73 5 13 SELECTED PARAMETERS B p N l i 1 i i x i i i i i F x x i i x L i x i x x s i VE H x h i A x id Xx E Wi x 1 i i bi xu n bl dpi Um gt n im 2336 ESOS i xong i xs A BEES at 300 Ni L Dow gt 8 o i E Bt BE E f fi 1 1 1 cd o o o o o o o D D e N a g ei4g sse1duioo o pepeeu Aowsw 15 20 25 30 35 40 45 50 maxdepth 10 Figure 5 40 Memory needed in average by each method to compress 1 byte of input text on Canterbury Corpus almost antiwords dynamic DCA RLE static DCA RLE static DCA RLE static DCA RLE gud i DO Z UOA m 101109 a i a a s x pauuay IS Je UJWeJb 70 60 r onea uoisseiduioo o spjoll unu do pawino Ase KT6G991 E Figure 5 41 Compression ratio obtained by selected methods on Canterbury Corpus on z c s L sBiex t s E d D A 4 ec z zi Ed E E jou cnd g FEEEE PET 4 gud SOGSSLTSS s EOxRMK KEES 1 oOouoorroo u EFFEEEZZO LELLLELEL o gt T SSBLTLTE wre ugesid a E przyugeyd SOR xa x SS s ZOxExEx 5
45. e g Huffman coding and storing the exceptions separately With this method it is possible to reach better compression ratios than with the static compression scheme These results were presented in the original paper 6 But as this method is not asymmetric and the decompression has the same complexity as compres sion this method is not suitable for compress once decompress many times behaviour 3 9 Searching in Compressed Text Compressed pattern matching is a very interesting topic and many studies have been already made on this problem for several compression methods They focus on linear time complexity proportional not to the original text length but to the compressed text length And one of the most interesting properties of text compressed using antidictionaries is just its ability of pattern matching in compressed text This data compression method doesn t transform the text in a complex way but just erases some symbols From the single point of view it could be possible to erase just some symbols from the searched pattern and look for it This is really possible but with some limitations because what symbols we erase depends also on the current context If we search for a long pattern we can utilize its synchronizing property from which we obtain 3 9 SEARCHING IN COMPRESSED TEXT Figure 3 11 Suffix trie construction over text 01110101 for dynamic compression 27 28 CHAPTER 3 DATA CO
46. ead input and compress it using the current antidictionary 3 Add the read symbol to the factor list and recompute antidictionary 24 CHAPTER 3 DATA COMPRESSION USING ANTIDICTIONARIES 4 Every exception that occurs code and save into separate file 5 Repeat steps until the whole input processed Because we don t know the correct antidictionary in advance we are making mistakes in predicting the text Every time we read a symbol that violates the current antidictionary and brings a forbidden word we need to handle this exception We do it by saving distances between the two adjacent exceptions This can be represented by number of successful compressions bits erased between the exceptions there is no need to count symbols just passed to the algorithm in non determined transitions Exception occurs only when there exists a transition with output from the current state but we don t take it Exceptions can be represented by some kind of universal codes arithmetic or Huffman coding It needs to be stored along the compressed data in a separate file or when we use output buffers large enough they can be a direct part of compressed data 3 8 1 Using suffix trie online construction For compressing text using the dynamic compression model we don t need a pruned antidictionary or even a set of minimal antifactors because the compressor and the decompressor share both the same suffix trie they can use all available antifactors d
47. ely summarizing the results Influence of block size on time and memory can be found in Figure 5 29 and Figure 5 30 respectively but be careful the x axis is logarithmic For suffix array memory require ments double with double file size suffix trie and dynamic DCA need more memory but with higher block sizes their requirements grow slower Considering time larger files are better as it is not needed to run some phases such as building antidicionary self compression and simple pruning more times It looks that for files larger than 512kB suffix array will be the slowest but for smaller files it is the fastest from static compression scheme methods Selected block size has a significant influence on compression ratio as shows Figure 5 31 this means we should use block as large as possible A little surprise is improvement of almost antiwords s method with growing block size In Figure 5 32 we see structure of the compressed file with smaller blocks many information in antidictionaries are duplicated that s why the antidictionary size is getting lower 5 10 DIVIDING INPUT TEXT INTO SMALLER BLOCKS 100 T T compression ratio 20 static DCA L dynamic DCA x almost antiwords 1024 4096 16384 block size 65536 262144 Figure 5 31 Compression ratio obtained compressing plrabn12 txt 400 total compressed data x antidictionary x 350
48. en The set of all minimal antifactors the antidictionary AD is sufficient because for every antifactor w uv where u is a string over X there exists a minimal antifactor v in antidictionary AD Currently there is not any known good working implementation of the DCA compression method Yet we are trying to develop it we are still far from practical use due to the excessive system resources needed to compress even a small file However thanks to rapid research progress of strings suffix arrays suffix automata DAWGs compacted suffix automata CDAWGs and other related issues we might be able to design a practical implementation soon 3 2 Data Compression and Decompression Let w be a text on the binary alphabet 0 1 and let AD be an antidictionary for w 6 By reading the text w from left to right if at a certain moment the current prefix v of the text admits as suffix a word u such that u u z AD with x 0 1 i e u is forbidden then surely the symbol following v in the text cannot be x and since the alphabet is binary it is the symbol y z In other terms we know in advance the next symbol y that turns out to be redundant or predictable The main idea of this method is to eliminate redundant symbols in order to achieve compression The decoding algorithm recovers the text w by predicting the symbol following the current prefix v of w already decompressed Example 3 2 Compress text 01110101 using antidiction
49. erbury Corpus compression scheme with RLE Another interesting evaluation are average compression speed and time input characters compressed per second in Figure 5 38 and in Fig ure 5 39 What we are also interested in is memory needed to compress 1 byte of input text in Figure 5 40 5 13 Selected Parameters Results on Canterbury Corpus were analyzed and all is prepared for the final test The smallest memory and time requirements were claimed while still obtaining reasonable compression ratios For each method appropriate mazdepth k were selected and compared their results are to be compared The selection follows e almost antiwords k 30 e dynamic DCA RLE k 32 e static DCA RLE k 30 suffix trie suffix array static DCA RLE k 34 suffix trie suffix array static DCA RLE k 40 suffix trie suffix array From the results presented in Figures 5 41 5 42 and 5 43 we can make some conclusions Exact values obtained can be found in Table 5 4 72 CHAPTER 5 EXPERIMENTS 5r T T T T T T A suffix trie suffix array 45 dynamic DCA x 4 almost antiwords compression speed MB s maxdepth Figure 5 38 Average compression speed compressed characters per second of each method on Canterbury Corpus 60 T T suffix trie suffix array gt dynamic DCA al almost antiwords time needed to compress 1MB of input text s max
50. ere is no such antifactor w that is a suffix of v This can be easily checked using suffix link dashed line in Figure 3 2 The antifactor v u y is minimal when f u y is an internal node otherwise a shorter antifactor w certainly exists Example 3 4 Build an antidictionary for text 01110101 with maximal trie depth k 5 We construct a suffix trie over the text then we add all antifactors Antifactors together form a set 00 100 0100 0110 1011 1100 1111 01111 11100 11011 10100 which is ob viously not antidictionary set of minimal antifactors e g 00 is a suffix of antifactors 100 0100 1100 11100 10100 We have to remove antifactors that are not minimal Re sulting set is antidictionary AD 00 0110 1011 1111 See Figure 3 3 for suffix trie construction process On the final diagram we can see trie containing only necessary nodes to represent the antidictionary leaf nodes are antifactors 14 CHAPTER 3 DATA COMPRESSION USING ANTIDICTIONARIES Figure 3 4 Antidictionary AD 00 0110 1011 1111 and the corresponding compres sion decompression transducer For representing the antidictionary we don t need the whole tree so we keep only the nodes and links which lead to antifactors This simplified suffix trie is going to be used later for compression decompression transducer building 3 4 Compression Decompression Transducer As we have suffix trie of the antidictionary now which is in fact an au
51. eryone can use the code experiment with it and try to improve it 4 2 Documentation and Versioning Because program code was changing a lot during development a versioning system Sub version which remembers all the changes can find differences to the current version or provide a working version from the past was used This ability was used more than once as to get the algorithm working correctly is quite difficult Huge data structures are built in the memory suffix tries are modified online traversed in different ways and 29 30 CHAPTER 4 IMPLEMENTATION snext next pesto asc fail Noriginal N K next tre dca DCAstateC P asc stPool com pTrie a dca DCAantidict ad dca DCAcom pressor Figure 4 1 Collaboration diagram of class DCAcompressor directions and it is a challenge to debug what is really going on Although we can verify compressed data by decompressing them this doesn t tell us anything about optimal selection of forbidden words or correct and complete building of the data structure for representing all text factors Documentation was written along with the code using documentation generator Doxygen This documentation generator takes the program source code extracts all classes func tions and variables definitions and provides them with comments specified in the source output formats are HTML and IATEX Unlike offline documentation systems Doxygen keeps the documentation
52. ession process using suffix trie static compression scheme 5 6 Different Stages For optimizing the implementation and for future research it is needed to know how long each phase of the compression process lasts This measurement was performed over paperl using suffix trie and suffix array Graphs in Figure 5 16 illustrate time needed by each phase graphs in Figure 5 17 display time contribution of each phase to the total compression time Looking at the graphs we can see that building antidictionary and counting visits are the most expensive phases and their times are rising exponentially with maxdepth while building suffix trie time rises only lineary Also looking at suffix array graphs Figure 5 18 and Figure 5 19 we see constant com plexity of building suffix array and least common prefix array as they don t depend on mazdepth while time of building antidictionary from suffix array rises exponentially It looks like most efforts should be targeted on speeding up antidictionary construction whether using suffix trie or suffix array 5 7 RLE For experiments static and dynamic compression version with RLE run length encoding filter on input were also implemented This filter compresses all runs of characters 58 3 5 25 time s 0 5 CHAPTER 5 EXPERIMENTS T T compress data K save antidictionary build automaton self compress simple prune self compress build automaton count visits bu
53. f compression Ea single self compression multi self compression eau mrgpugeu d 40 1189 ds wewuwesb osproy lunu do panno se 116279918 100 90 F 80 L L o o N ones uoisseiduioo Figure 5 4 Compression ratios on Canterbury Corpus for different self compression op tions o RE Y i f kod X x o E H E x d e idu r Ho i X x ad E E xm E ii wo L x48 14 t is kE T s te H XX Md ES uw EN id o oo n oooo 3 BEML in 02053 B csorz i 2 59 M ES i ese 85 i ow D ge 5 o0 Ao L S a o 1 9 i i f o t zi i i i 1 1 1 1 Eo Et o o o e e e e o o o o t a i ai 10 sepou maxdepth Figure 5 5 Number of nodes in relation to mazdepth 5 4 DATA COMPRESSION 51 400000 T T T T T T T nodes leading to antiwords simple pruned x self compress 350000 self compress prune self compress prune self compress gt 300000 4 250000 F 1 200000 1 E 150000 7 100000 1 50000 ae aoc M AER Oh e UU ee A O O A O THI ar rn M 0 VI i 1 fi fi fi fi 10 15 20 25 30 35 40 45 50 maxdepth Figure 5 6 Number of nodes leading to antiwords in relation to mazdepth Dependency of antiword count on mazdepth is shown in Figure 5 7 with lower part en
54. ffix arrays which were a favourite subject to study in recent years and many implementations are already available 3 6 1 Suffix array The suffix trees were originally developed for searching for suffixes in the text Later the suffix arrays were discovered They are used for example in Burrows Wheeler Transfor mation 5 and bzip2 compression method The suffix array is a sufficient replacement for the suffix trees allowing some tasks that were done with suffix trees before The major advantage is much smaller memory requirements and also smaller complexity considering some tasks performed during DCA compression e g node visits counting It is possible to save even more space using compressed suffix arrays 9 The suffiz array is built on top of the input text representing all text suffixes and alphabetically sorted In fact it contains indexes pointing into the original text Because for most algorithms this is not enough we also build cp array on top of the input text and suffix array An example of suffix array can be seen in Table 3 1 Symbol denotes the end of the text lexicographically the smallest symbol Lcp Least Common Prefix array contains the adjacent word prefix length common with the previous word With just these two structures we can do all needed operations as we will show later String searches can be answered with a complexity similar to the binary 3 6 ANTIDICTIONARY CONSTRUCTION USING SUFFIX ARRAY 19
55. g antidictionaries idea are implemented static compression scheme as well as dynamic compression scheme and static compression scheme with support for almost antifactors that are words which oc cur rarely in the input text Their results compressing files from Canterbury and Calgary corpus are presented 1 4 Organization of the Thesis In Chapter 2 basic definitions and terminology used in the thesis can be found Chap ter 3 describes the way different methods of DCA are working brings some examples and also some new ideas Furthermore dynamic compression scheme is described and anti dictionary construction using suffix array is introduced Also basics of different stages in static compression scheme are explained In Chapter 4 possible implementation of differ ent DCA methods are presented along with some ideas of improving their performance or limiting time and memory reguirements Chapter 5 focuses on experiments with different parameters of implemented methods Their comparison and results on Canterbury and Calgary corpuses could be found here provided with comments and recommendations to their usage The last chapter concludes the thesis and suggests some ideas for future research Chapter 2 Preliminaries Definition 2 1 Lossless data compression A lossless data compression method is one where compressing a file and decompressing it retrieves data back to its original form without any loss The decompressed file and the orig
56. gt a aa ace ae oes XR ee Eus eR EUR a oba de 5 3 Antidictionary Construction and Optimization Dd Data COMPLESSI N us lides Rae ae RR OE ee RR EME 55 Date Decompression lt e osc s uxo o po EX ro a R9 B G Dilerenb tapes ies Rss osos OR ORO POR GEO E eo uy e 3 P 2 CC 58 Almost ABO WOBUS lt 4 X5 xo dex e B Ae 9484454844958 S Aem 5 9 Sliced Parallel Antidictionaries 5 10 Dividing Input Text into Smaller Blocks lt 5 11 Dynamic Compression 5 12 Canterbury Corpus 5 13 Selected Parameters 5 14 Calgary Corpus 6 Conclusion and Future Work 6 1 Summary of Results 6 2 Suggestions for Future Research A User Manual XV 47 47 47 49 51 55 57 57 59 64 66 68 68 71 76 77 TT 78 81 xvi List of Figures 2 1 3 dai 3 3 3 4 3 5 3 6 3 7 3 8 3 9 3 10 31 4 1 4 2 4 3 4 4 5 1 5 2 5 3 5 4 5 5 5 6 Suffix trie vs suffix tree uu 2o low ooo yox ow Poo o e y 7 DCA basic seheng y uod x ee is a Vo ees 10 Suffix trie construction su sso koc Eee E ub Be AO ee RU 12 Antidictionary construction 4 i222 iE Y s 13 Compression decompression transducer 14 Basic antidictionary construction I lt lt 15 Antidictionary construction using simple pruning 16 Self compression example 2e 17 Self compression combined with simple pruning 18 Example of using almost ant
57. h a simple augmentation to a classic binary search P is searched string length The suffix array and associated lep information occupy a mere 2N integers and searches are shown to require at most P logy N 1 single symbol comparisons The main advantage of suffix arrays over suffix trees is that in practice they use three to five times less space Definition 2 28 Stopping pair 6 A pair of words v vi is called stopping pair if v ua v uib AD with a b 0 1 a Z b and u is a suffix of u1 Lemma 2 1 Only one stopping pair 6 Let AD be an antifactorial antidictionary of a text t M If there exists a stopping pair v v1 with vj u1b b 0 1 then wu is a suffix of t and does not appear elsewhere in t Moreover there exists at most one pair of words having these properties l The concatenation mark is omitted when it is obvious Chapter 3 Data Compression Using Antidictionaries 3 1 DCA Fundamentals DCA Data Compression using Antidictionaries is a novel data compression method presented by M Crochemore in 6 It uses current theories about finite automata and suffix languages to show their abilities for data compression The method takes advantage of words that do not occur as factors in the text i e that are forbidden we call them forbidden words or antifactors Thanks to existence of these forbidden words we can predict some symbols in the text Just imagine that we have an
58. have an antidictionary AD built over string 01010101 AD 100 11 If we encodeded it using 242 representation we would get antidictionary size AD 2 x k 10 bits as the trie has k 5 nodes It is easy to see that the text 010 generates the same antidictionary while it needs only 3 bits for coding the generating text and a few bits for coding the text length Two promising ways of generating text Ty were found The first uses compression trans ducer processing input text 7 avoiding loops that don t bring new nodes into suffix trie The second walks back from the stopping pair using transitions and suffix links in inversed directions until it walks through all nodes Both methods may not work on general suffix tries rather on the non pruned constructed from a real text string but this needs more experiments Disadvantage of this representation is slower decompression because we need to build the antidictionary first from text 7y also the decompressor must use the same method as compressor to reconstruct the antidictionary Another problem could be complexity of constructing text T in compression process Still antidictionary representation using a generating text looks promising 4 6 Compressed File Format To be able to store the files compressed using DCA a compressed file format had to be developed The proposed file format is similar to the one used by gzip and contains only the necessary fields targeting it to rep
59. he next symbol as 1 and push it to output 5 Current prefix 0111 There is word 1111 in AD so we predict the next symbol 0 and push it to output step 12 3 4 5 6 7 8 input 0 O 01 01 01 01 01 01 Eo de 117 IX PS output O 01 O11 0111 01110 011101 0111010 01110101 After decompression of text 01 we get the original text 01110101 What is important is that we don t know exactly when to stop the algorithm by knowing just the compressed text and the antidictionary This means we need to know the length of the original text or we could decompress even infinitely Another possibility is to store the number of erased 12 CHAPTER 3 DATA COMPRESSION USING ANTIDICTIONARIES Figure 3 2 Constructing suffix trie for text cacao symbols after using the last input bit which could be sufficient for most implementations but this supposes that we can determine exactly end of the input text For compression and decompression process the antidictionary must be able to answer the query on a word v if there exists a word u u x x 0 1 u AD such that u is a suffix of v The answer determines if the symbol x will be kept or erased in the compressed text To speedup the queries we can represent the antidictionary as a finite transducer which leads to fast linear time compression and decompression Then we can compare it to the fastest compression methods To build the compression decompression transducer we need a special
60. ild antidictionary build suffix trie 15 20 25 30 35 40 45 50 maxdepth Figure 5 17 Suffix trie static compression scheme compression phases contribution to total compression time time s 0 8 0 6 0 4 T T T build suffix array build lcp array build antidictionary build automaton self compress simple prune self compress e build automaton 4 save antidictionary 4 compress data 7 maxdepth Figure 5 18 Time consumption of individual phases during compression process using suffix array static compression scheme 5 8 ALMOST ANTIWORDS 59 2 2 T T T T T T T compress data Ex save antidictionary 2r build automaton meses self compress 18L simple prune Self compress build automaton 16 F build antidictionary build Icp array build suffix array 1 4 H 1 25 time s 0 8 H 0 6 F 0 4 F 0 2 10 15 20 25 30 35 40 45 50 maxdepth Figure 5 19 Suffix array static compression scheme compression phases contribution to total compression time longer than 3 and encodes the sequence length using Fibonacci coding 2 Using of RLE has practically no influence to time or memory needed to compress a file but it has significant impact to compressing particular files Gener
61. iles and it is not stable as it gets quickly to good compression ratio at mazdepth about 25 but then it is not able to improve the ratio fur ther Figure 5 28 it even gets worse compression ratios with increasing maxdepth Most important on this method is that we can get compression ratios similar to the ratios obtained by other compression schemes but at lower mazdepth requiring less memory 5 9 Sliced Parallel Antidictionaries This test is based on idea from Section 4 10 It was tested to compress bits from one byte separately using 8 parallel antidictionaries The results for both static and dynamic compression scheme can be found in Tables 5 1 and 5 2 This measurement was performed on Canterbury Corpus with mazdepth 40 Columns b0 b represent compression ratio obtained by compressing files created from the n th bit of the original file only Column total represents overall compression ratio obtained using parallel antidictionaries column orig contains compression ratio obtained by not using parallel antidictionaries Static compression scheme using parallel antidictionaries reached much worse compression ratios in almost all cases Dynamic compression scheme performed in a similar way with the exception for ptt5 file where it surprisingly obtained significantly better compression ratio The experiment demonstrated that there exists some type of files where this method would be useful Also compression the 7 th bit separately c
62. inal are identical lossless compression preserves data integrity Definition 2 2 Lossy data compression A lossy data compression method is one where compressing a file and then decompressing it retrieves a file that may well be different to the original but is close enough to be useful in some way Definition 2 3 Symmetric compression Symmetric compression is a technique that takes about the same amount of time to compress as it does to decompress Definition 2 4 Asymmetric compression Asymmetric compression is a technique that takes different time to compress than it does to decompress Note 2 1 Asymmetric compression Typically asymmetric compression methods take more time to compress than to decom press Some asymmetric compression methods take longer time to decompress which would be suited for backup files that are constantly being compressed and rarely decom pressed But basically faster compression than decompression is what we want for usual compress once decompress many times behaviour Note 2 2 Canterbury Corpus The Canterbury Corpus 15 is a collection of typical files for use in the evaluation of lossless compression methods The Canterbury Corpus consists of 11 files shown in Table 2 1 Previously compression software was tested using a small subset of one or two non standard files This was a possible source of bias to experiments as the data used may have caused the programs to exhibit anoma
63. ing SIAM J Comput 35 2 378 407 2005 IEEE Computer Society 2005 Data Compression Conference DCC 2005 29 31 March 2005 Snowbird UT USA IEEE Computer Society 2005 S Kurtz Reducing the space requirement of suffix trees Softw Pract Exper 29 13 1149 1171 1999 U Manber and G Myers Suffix arrays A new method for on line string searches In SODA pages 319 327 1990 79 80 13 14 15 16 BIBLIOGRAPHY G Manzini and P Ferragina Engineering a lightweight suffix array construction algorithm In R H M hring and R Raman editors ESA volume 2461 of Lecture Notes in Computer Science pages 698 710 Springer 2002 H Morita and T Ota A tight upper bound on the size of the antidictionary of a binary string In C Mart nez editor 2005 International Conference on Analysis of Algorithms volume AD of DMTCS Proceedings pages 393 398 Discrete Mathemat ics and Theoretical Computer Science 2005 M Powell Evaluating lossless compression methods 2001 S J Puglisi W F Smyth and A Turpin The performance of linear time suffix sorting algorithms In DCC 10 pages 358 367 Y Shibata M Takeda A Shinohara and S Arikawa Pattern matching in text compressed by using antidictionaries In M Crochemore and M Paterson editors CPM volume 1645 of Lecture Notes in Computer Science pages 37 49 Springer 1999 E Ukkonen On line construction of suffix trees Algorithmica 14 3 2
64. ionary which means no compression at all Our output will look like len empty AD 2b original length 5b data 16b 23b classical T Using almost antiword approach we get the antidictionary with one word one exception 3 8 DYNAMIC COMPRESSION SCHEME 23 Read input Finish compression Read character Figure 3 10 Dynamic compression scheme and empty compressed data our output size will be len AD 2b 2b original length 5b data 0b exception 1 5b 14b almost aw As we can see using almost antiword improvement we saved 9 bits in comparison with classical approach and this could be even more interesting for longer texts 3 8 Dynamic Compression Scheme Till now we were considering static compression model where the whole text must be read twice once when the antidictionary is computed and once when we are actually compressing the input Using this method we need to store the antidictionary separately To do it in an efficient way we use techniques like simple pruning and self compression But there is also another possible solution how to use DCA algorithm With dynamic approach we read text only once we compress input and modify antidictionary at the same time Whenever we read some input we recompute the antidictionary again and use it for compressing the next input Figure 3 10 The compression process can be described in these steps 1 Begin with an empty antidictionary 2 R
65. irectly With advantage we can use suffix tries for representing already collected factors as well as for compressing the input online We use the suffix trie as an automaton maintaining suffix links For this we can use the following algorithm 1 Dynamic Build Fact int mardepth gt 0 2 root new state 3 level root 0 4 cur root 5 while not EOF do 6 read a 7 p cur 8 if level p mazdepth then 9 p lt fail p 10 while p has no child and p root do 11 p lt fail p 12 if p has only one child p z then 13 if x a then 14 erase symbol a 15 else 16 write exception 17 else 18 write a to output 19 cur Next cur a mazdepth 20 return root 1 Next state textit cur bool a int mardepth gt 0 2 if cur a defined then 3 8 DYNAMIC COMPRESSION SCHEME 25 3 return cur a 4 else if level textit cur mazdepth then 5 return Next fail textit cur a maxdepth 6 else 7 q new state 8 level g level textit cur 1 9 d cur a q 10 if cur root then fail g root 11 else fail g Next fail textit cur a mazxdepth 12 return g Function Dynamic Build Fact builds suffix trie and compresses data at once function Next takes care of creating suffix links and missing nodes up to the root node For dynamic compression we need to know only p a transitions fail function and current level of the node where p is some node a 0 1 This
66. is less than half information needed for each node and edge in comparison with suffix trie representation in static ap proach which means less memory reguirements Considering time asymptotic complexity of Dynamic Build Fact gives us only O N k which is identical to suffix trie construc tion complexity in Section 4 4 1 but already compressing text where static compression scheme starts Example 3 8 Compress text 11010111 using dynamic compression scheme We start compressing from the beginning but it could be useful to skip some symbols and get the suffix trie filled a bit It s typical to get many exceptions at the beginning because the suffix trie is only forming at first This is subject for further experiments Suffix trie construction process with antifactors found in every step can be seen in Figure 3 11 Compression process in steps can be seen in Figure 3 4 We get output O E 1 E E where C means compressed E means exception and is an empty space after erased sym bol Totally 3 exceptions 3 compressions occurred and 2 symbols were passed What we can see is that antidictionary changes only when we pass a new symbol or when an exception occurs while the set of all antifactors can change any time 3 8 2 Comparison with static approach We can think about advantages of this method we don t have to separately code the antidictionary do self compression or even to simple prune it We simply use all antifac
67. iwords lt 444 22 Dynamic compression scheme lt 4 lt 44 23 Dynamic compression example 4 444 27 Collaboration diagram of class DCAcompressor lt lt lt 30 Suffix trie generated by graphviz lt lt lt uu 32 File compression decompression 33 Implementation of static scheme a 34 Memory requirements of different self compression options 48 Time requirements of different self compression options 48 Compression ratio of different self compression options 49 Self compression compression ratios on Canterbury Corpus 50 Number of nodes in relation to mardepth lt lt I 50 Number of nodes leading to antiwords in relation to mazdepth 51 xvii 5 7 5 8 5 9 5 10 511 5 12 5 13 5 14 5 15 5 16 5 17 5 18 5 19 5 20 5 21 5 22 5 23 5 24 5 25 5 26 5 27 5 28 2 29 5 30 5 31 5 32 5 33 5 34 5 35 5 36 Number of antiwords in relation to mardepth 52 Number of used antiwords in relation to mazdepth 52 Relation between number of nodes and number of antiwords 53 Memory requirements for compressing paperl 53 Time requirements for compressing paperl 54 Compression ratio obtained compressing paperl 54 Compressed file structure created using static scheme compressing
68. lementovan DCA metody jsou porovn ny z hlediska dosa en ho kompresn ho pom ru pam ov ch n rok a rychlosti komprese a dekomprese U ka d z metod jsou doporu eny vhodn parametry a nakonec jsou shrnuty klady a z pory srovn van ch metod Abstract Data compression using antidictionaries is a novel compression technique based on the fact that some factors never appear in the text Various DCA Data Compression using Antidictionaries method implementations based on works from Crochemore Mignosi Restivo Navarro and others are presented and their performance evaluated on standard sets of files for evaluating compression methods Antidictionary construction using suffix array is introduced focusing on minimizing mem ory requirements of the static compression scheme Also dynamic compression scheme is explained and implemented Some possible improvements are tested implemented DCA methods are evaluated in terms of compression ratio memory requirements and speed of both compression and decompression Finally appropriate parameters for each method are suggested At the end pros and cons of evaluated methods are discussed vil vill Acknowledgements I would like to thank my thesis supervisor Ing Jan Holub Ph D not only for the basic idea of this thesis but also for many suggestions and valuable contributions I would also like to thank my parents for their support Dedication To my mother xi xii Content
69. lous behaviour Running 4 CHAPTER 2 PRELIMINARIES File Category Size alice29 txt English text 152089 asyoulik txt Shakespeare 125179 cp html HTML source 24603 fields c C source 11150 grammar lsp LISP source 3721 kennedy xls Excel spreadsheet 1029744 Icet10 txt Technical writing 426754 plrabn12 txt Poetry 481861 ptt5 CCITT test set 513216 sum SPARC Executable 38240 xargs 1 GNU manual page 4227 Table 2 1 Files in the Canterbury Corpus compression software experiments using the same carefully selected set of files gives us a good evaluation and comparison with other methods Note 2 3 Calgary Corpus The Calgary Corpus is the most referenced corpus in the data compression field espe cially for text compression and is the de facto standard for lossless compression evaluation It was founded in 1987 It is a predecessor of the Canterbury Corpus Definition 2 5 Compression ratio Compression ratio is an indicator evaluating compression performance It is defined as Length of compressed data Compression ratio P Length of original data Definition 2 6 Alphabet An alphabet is a finite non empty set of symbols Definition 2 7 Complement of symbol A complement of symbol a over X where a X is a set X a and is denoted a Definition 2 8 String A string over is any sequence of symbols from X Definition 2 9 Set of all strings The set of all strings over X is denoted M Defini
70. n which enlarges memory requirements a lot e PROFILING turns on off profiling info designated for performance measurements Profiling is using POSIX getrusage function and reads ru utime and ru stime repre senting user time and system time used by the current process These times are measured at the beginning and at the end of the measured program part if the procedure runs more times the measured time is accumulated For this purpose classes TimeDiff_t and AccTime t are provided the first measures time interval the second measures accumu lated time What we need to know is that using getrusage function also influences the program performance especially the accumulated time measurement of a program part repeated many times With verbose option the program reports the whole time taken to perform the operation as well as antidictionary size and compression ratio achieved CPU time consumption is just one part of system requirements what we are very inter ested in too is a memory consumption It was measured using memusage utility that comes from the GNU libc development tools and it is a part of many GNU Linux dis tributions This tool reports the peak memory usage of heap and stack extra What we are interested in more is the heap peak size as the most memory used is allocated dynamically Also the amount of allocated and correctly deallocated memory could be seen But in fact we don t really need to check it as the program com
71. ng space needed for antidictionary representation TT 78 CHAPTER 6 CONCLUSION AND FUTURE WORK Another improvement is to use RLE run length encoding as an input filter to hide static and dynamic compression scheme inability of compressing strings of form 0 1 respectively their repetitions in case of dynamic compression scheme According to the experiments methods equipped with RLE performed better It is not possible to pick the best method for everything different usage scenarios have to be considered If we need fast decompression speed we can use static compression scheme or almost antiwords dynamic compression scheme is not appropriate as its time and memory requirements for decompressing text are the same as when compressing Nevertheless dynamic compression scheme could be useful for its best compression speed good compression ratios and easy implementation Somewhat different it is when we need fast compressed pattern matching then currently we have only choice of static compression scheme If we have only little amount of memory available static compression scheme with suffix array construction has the least memory requirements And finally for best compression ratios we can choose from almost antifactors or dynamic compression scheme equipped with RLE in dependence to our decompression speed demands Based on the measurements some candidates were selected and compared with stan dard compression programs gzip and bzi
72. o worse with increasing maxdepth like with plrabn12 txt in Figure 5 35 5 12 Canterbury Corpus All tests were run on the whole Canterbury Corpus and found the best compression ratios obtained by each compression method They can be found in Table 5 3 and also in Figure 5 36 Static compression scheme stays in back only with plrabn10 txt is better while dynamic compression and almost antiwords alternately give the best compression ratios but both have some odd behaviour on particular files Some interesting evaluation can be seen from graph averaging compression ratios over Canterbury Corpus in relation to mazdepth Figure 5 37 Here looks dynamic com pression scheme with RLE as the best followed by almost antifactors and later by static 5 12 CANTERBURY CORPUS 30000 25000 20000 t be 8 o 5 15000 a 9 x o 10000 5000 0 L L fi 1 i i i 10 15 20 25 30 35 40 45 50 maxdepth Figure 5 34 Exception count in relation to mazdepth 100 T T static DCA dynamic DCA gt almost antiwords 80 a o 60 E a o E 40 8 20 J 0 1 1 1 1 1 i i 10 15 20 25 30 35 40 45 50 maxdepth Figure 5 35 Compression ratio obtained compressing plrabn12 txt 69 70 Figure 5 36 Best compression ratio obtained by each method on Canterbury Corpus CHAPTER 5 EXPERIMENTS method almostaw dynamic dynamic rle static static rle alice29
73. o the node The difference between suffix tree and suffix trie could be more obvious from Figure 2 1 The large amount of information in each edge and node makes the suffix tree very ex pensive consuming about ten to twenty times 11 the memory size of the source text in good implementations The suffix array reduces this requirement to a factor of four and researchers have continued to find smaller indexing structures Definition 2 21 Antifactor 3 Antifactor or Forbidden Word is a word that never appears in a given text Let X be a finite alphabet and X the set of finite words of symbols from X the empty word e included Let L C X be a factorial language i e Vu v X uv L gt u v L The complement Y NL of L is a two sided ideal of amp Denote by MF L its base X NL X MF L Figure 2 1 Comparison of suffix trie left and suffix tree over string cacao MF L is the set of Minimal Forbidden words for L A word v X is forbidden for L if v L The forbidden word is minimal if it has no proper factors that are forbidden Definition 2 22 The set of all minimal forbidden words we call an antidictionary AD Definition 2 23 Internal nodes The internal nodes of the suffix trie correspond to nodes actually represented in the trie that is to factors of the text Definition 2 24 External nodes The external nodes correspond to antifactors and they are implicitly represented in the tree
74. og N time where P is length of v Example 3 6 Build antidictionary for text 01110101 using suffix array First we build suffix array and lcp structure it can be seen in the Table 3 2 suffix tree for the same text can be found in Figure 3 3 Using algorithm introduced above we find pos sible antifactors their positions are marked with a frame around symbols Positions with minimal antifactors are underlined This leads to the set of minimal antifactors anti dictionary AD 00 0110 1011 1111 which corresponds with antidictionary computed using suffix trie 3 7 Almost Antifactors The idea of almost antifactors was introduced in 7 After more detailed examination of antidictionaries we can discover also their odd behaviour If we try to compress the string 10 71 with k gt 2 then the result is satisfying because we can use 01 11 as our antidictionary This permits compressing the string to 1 n plus the small antidictionary However if we reverse the string to 0 11 then for any k lt n the set of antifactors contains 10 11 which indeed does not yield any compression The classical algorithm produces an empty antidictionary Yet both strings have the same 0 order entropy 3 7 ALMOST ANTIFACTORS 21 As we can see the main problem is that a single occurrence of a string in the text in our second example the string 01 outrules it as an antifactor In a less extreme case it may be possible that a string sb
75. ompression The idea is based on fact that for an antidictionary AD z generated from string Ty whose automaton contains cycles there exists some string T which generates the same antidictionary AD T 4 Ty Ty Tz To be more efficient than representation R we need to find a text 7y generating antidic tionary AD T lt rsize R AD where rsize is number of bits used by representation R to code antidictionary AD Problem is that this method probably won t work for general pruned antidictionaries An in depth future research on this topic is needed Another problem is that the generating text T should contain the original stopping pair Dif ferent text ending could lead to new unwanted antiwords An example of antidictionary whose compression decompression transducer contains a loop is provided 40 CHAPTER 4 IMPLEMENTATION size item description 3B DCA text identificator to unambiguously identify DCA com pressed file 4B original length length of the original input text this length is important when decompressing data to determine end of the decom pression process 4B CRC32 cyclic redundancy check of the original data it is computed and checked during decompression antidictionary antidictionary representation as specified in Section 4 5 compressed data data with bits erased using DCA compression Table 4 2 Compressed file format Example 4 1 Let s suppose that we
76. ompression ratio we can choose from almost antiwords or dynamic DCA Methods overview is presented in Table 5 6 algorithm instability means that it does not give better compression ratio with increasing k Looking at static compression performance mazdepth k 34 looks the best it gives better compression ratio than for k 30 comparable to ratios obtained using almost antiwords or dynamic DCA also it runs in much shorter time than for k 40 Using suffix array for antidictionary construction is generally a good idea not only considering memory requirements but it is also faster at k 34 than antidictionary construction using suffix trie 5 14 Calgary Corpus Results for Calgary Corpus are presented too as it is still broadly used for evaluating compression methods see Table 5 5 Chapter 6 Conclusion and Future Work 6 1 Summary of Results In this thesis data compression using antidictionaries with different techniques was im plemented and their performance on standard sets of files for evaluating compression methods were presented This work extends research of Crochemore Mignosi Restivo Navarro and others 6 7 who introduced the original idea of data compression using antidictionaries described the static compression scheme thoroughly with antidictionary construction using suffix trie and also introduced almost antifactors improvement This thesis has introduced antidictionary building using suffix array structure instea
77. on we can use self compression and simple pruning as long we prune some antiwords from the trie we call this multiple self compression Following tests were performed over paperl file from Calgary Corpus In Figure 5 1 we can see that simple pruning only requires more memory than methods using self compression In Figure 5 2 we can see that simple pruning only is the fastest but the 47 48 CHAPTER 5 EXPERIMENTS 90 T T T T simple pruning only single self compression OR 80 L multiple self compression a s used memory MB 40 maxdepth Figure 5 1 Memory requirements of different self compression options T T simple pruning only single self compression x multiple self compression time s 30 32 34 36 38 40 maxdepth Figure 5 2 Time reguirements of different self compression options 5 3 ANTIDICTIONARY CONSTRUCTION AND OPTIMIZATION 49 T T simple pruning only single self compression x multiple self compression a KJ z J 8 lt 2 i N 3 TE E B aoh 8 20 F 0 1 l L l 10 15 20 25 30 35 M maxdepth Figure 5 3 Compression ratio obtained compressing paperl for different self compression options difference is not very interesting in comparison with the whole time needed Finally in Figure 5 3 we can see compression ratio achieved where both self compression versions performe
78. ook2 610856 206681 157443 237339 242923 253609 geo 102400 68489 56921 77219 88801 78570 news 377109 144835 118600 173141 164269 180022 objl 21504 10318 10787 16154 13160 14544 obj2 246814 81626 76441 132067 108350 130868 paperl 53161 18570 16558 24496 21408 24146 paper2 82199 29746 25041 33499 34076 35792 pic 513216 56438 49759 45556 95175 100432 progc 39611 13269 12544 18602 15961 18496 progl 71646 16267 15579 24898 21711 25339 progp 49379 11240 10710 18670 14157 17256 trans 93695 18979 17899 34159 24166 32223 Table 5 5 Compressed file sizes obtained on Calgary Corpus Method Advantages Disadvantages almost antiwords good compression ratios de compression speed hard implementation pression speed and memory requirements algorithm in stability com static DCA memory requirements when using suffix array decom pression speed compressed pattern matching compression speed slightly worse compression ratios dynamic DCA good compression ratios fast compression speed compres slon memory reguirements slow decompression and de compression memory reguire ments Table 5 6 Pros and cons of different methods 76 CHAPTER 5 EXPERIMENTS If we have only little amount of memory antidictionary construction using suffix array would be the right choice When we are not interested in low decompression time dy namic DCA gives us the best performance If we are interested in c
79. ould lead 5 9 SLICED PARALLEL ANTIDICTIONARIES file b0 b1 b2 b3 b4 b5 b6 b7 total orig alice29 txt 99 9 99 8 99 9 99 7 99 5 85 3 93 6 0 0 84 7 39 9 asyoulik txt 99 9 99 5 99 9 99 8 99 5 85 7 94 6 0 0 84 9 42 4 cp html 89 7 92 4 91 7 91 7 90 2 75 4 76 0 100 0 88 4 45 1 fields c 99 3 98 3 98 4 98 8 94 7 78 3 88 5 0 1 82 1 42 9 grammar lsp 100 2 96 3 100 0 99 6 97 4 73 8 89 5 0 2 82 3 49 7 lcet10 txt 99 0 98 9 99 2 99 1 99 0 89 1 91 5 0 0 84 5 36 3 plrabn12 txt 99 9 100 0 100 0 100 0 99 9 90 3 88 1 0 0 84 8 39 9 ptt5 91 1 91 5 91 2 91 3 93 2 93 3 93 3 92 9 92 2 97 3 sum 87 1 85 5 84 6 78 9 77 2 71 6 67 4 80 7 79 1 62 8 xargs 1 100 0 100 2 100 2 100 2 99 4 85 6 90 4 0 2 84 6 60 1 Table 5 1 Parallel antidictionaries using static compression scheme file b0 b1 b2 b3 b4 b5 b6 b7 total orig alice29 txt 124 8 120 1 127 8 122 2 108 7 73 2 104 3 0 0 97 6 39 4 asyoulik txt 127 1 123 3 128 8 125 9 113 4 71 8 103 6 0 0 99 2 44 0 cp html 95 8 99 1 99 0 100 0 93 5 61 8 75 1 2 3 78 3 39 0 fields c 99 3 97 5 97 3 97 2 84 6 59 7 79 5 0 0 76 8 33 1 grammar lsp 104 3 100 5 99 1 101 5 92 5 56 7 81 6 0 0 79 5 38 1 Icet10 txt 122 4 120 1 124 1 119 2 110 2 82 7 96 4 0 0 96 9 35 5 plrabn12 txt 128 2 125 0 130 6 126 2
80. p2 DCA showed that on particular files it could be a good competitor to them but still with much larger resources requirements However DCA is an interesting compression method using current theories about finite automata and data structures for representing all text factors Its potential could be in the compressed pattern matching abilities 6 2 Suggestions for Future Research Almost antiwords technique performed surprisingly well even using one pass heuristics but its time and memory requirements for building antidictionary from suffix trie were too greedy More research could be done to reduce them as well as in compressed pattern matching in texts compressed using almost antiwords Another possible improvement concerns static compression scheme where representing the antidictionary by text generating it seems promising improving the compression ratio In the considered methods Fibonacci code was used for storing exception distances or lengths of runs in RLE but may be some more optimal code could be used The suffix array construction is using modified Manzini Ferragina implementation for sorting text with binary alphabet whereas it was originally developed for larger alpha bets which means that the used suffix array and lcp array construction may be not optimal Finally some more work could be done on optimizing the codes as they were developed rather for testing and research purposes with many customizable parameters than for
81. paperl 55 Memory requirements for decompressing paperl dz 56 Time requirements for decompressing paperl dz 56 Individual phases of compression process using suffix trie 57 Individual phases time contribution using suffix trie 58 Individual phases of compression process using suffix array 58 Individual phases time contribution using suffix array 59 Compression ratio obtained compressing grammar Isp 60 Compression ratio obtained compressing sum 60 Memory requirements using almost antiwords 61 Time requirements using almost antiwords 0004 61 Compression ratio obtained compressing paperl 62 Compressed file structure created using almost antiwords 62 Compression ratio obtained compressing alice29 txt 63 Compression ratio obtained compressing ptt5 lt 444 63 Compression ratio obtained compressing xargs 1 64 Memory requirements in relation to block size compressing plrabn12 txt 65 Time requirements in relation to block size compressing plrabn12 txt 66 Compression ratio obtained compressing plrabn12 txt 67 Compressed file structure in relation to block size 67 Dynamic compression scheme exception distances histogram 68 Exception count in relation to mazdepth
82. presses only one file at a time and the memory is automatically deallocated when the program terminates For debugging purposes LOG MASK was introduced to have the ability to select what we really want to debug and not to be choked up with other extensive useless debug messages LOG MASK is a set of the following items DBG MAIN print information about which part of the algorithm is currently being run e DBG TRIE debug suffix trie data structures and displays its contents in different parts of algorithm it also exports the contents into graphviz graph file language for drawing directed graphs using dot tool e DBG_COMP debug compression process shows read symbols used transitions in compression transducer compressed symbols DBG DECOMP debug decompression process shows read symbols used transitions and symbol output 32 CHAPTER 4 IMPLEMENTATION Figure 4 2 Suffix trie for text 11010111 generated by graphviz e DBG PRUNE debug simple pruning process gain computation traversal over the suffix trie tested and pruned nodes e DBG AD prints antidictionary and self compressed antidictionary contents in dif ferent stages of algorithm such as before and after simple pruning e DBG_STATS print some useful statistics like results of simple pruning antidic tionary self compression and overall algorithm performance e DBG_ALMOSTAW debug almost antifactors related information e DB
83. resenting files compressed by static compression scheme CRC32 is used to check integrity of the decompressed data See Table 4 2 4 7 ANTIDICTIONARY CONSTRUCTION USING SUFFIX ARRAY 41 4 7 Antidictionary Construction Using Suffix Array 4 7 1 Suffix array construction In these days we have a choice between many algorithms for suffix array construction They were a subject to test in 16 according which the used implementation were chosen Problem is that these algorithms are designated to 8 bit symbols while we need just 0 s and 1 s for antidictionary construction It was obvious that it was necessary to make some modifications to the existing implementation to handle binary alphabet According to the paper 16 none of the new linear time construction algorithm would be a good choice for their large constants and memory requirements instead a highly optimized Manzini Ferragina implementation 13 with non linear asymptotic complexity in almost all the tests performed better in both memory consumption and CPU time used Their implementation using C was modified to compile under C and also adjusted the algorithm to use just 1 bit symbols instead of 8 bits in input text Corrections were applied to deep sort shallow sort and bucket sort subroutines LCP algorithm did not need this adjustment But because the whole algorithm wasn t examined well after this 8 bit to 1 bit modification the whole algorithm could be not optimal now This
84. rie 18 Suffix trie STrie T can be constructed in time proportional to the size of STrie T which in the worst case is O T Definition 2 20 Suffix tree 18 Suffiz tree STree T of T is a data structure that represents S7rie T in space linear in the length T of T This is achieved by representing only a subset Q U 1 of the states of S7rie T We call the states in O Uf LY the explicit states Set Q consists of all branching states states from which there are at least two transitions and all leaves states from which there are no transitions of S7rie T By definition root is included into the branching states The other states of STrie T the states other than root and L from which there is exactly one transition are called implicit states as states of STree T they are not explicitly present in STree T Note 2 5 Suffix link Suffix link is a key feature for linear time construction of the suffix tree In a complete suffix tree all internal non root nodes have a suffix link to another internal node Suffix link corresponds to function f r of state r If the path from the root to a node spells the string bv where b X is a symbol and v is a string possibly empty it has a suffix link to the internal node representing v Note 2 6 Suffix tree Suffix tree represents all suffixes of a given string It is designed for fast substring search ing each node represents a substring which is determined by the path t
85. rue 4 p asc p 4 4 3 Building automaton After building antidictionary and removing all nodes not leading to antiwords the trie is not consistent some of the suffix links are pointing to deleted nodes This has to be fixed before self compressing the trie At the same time we correct the suffix links we also define new 6 transitions and transitions creating a compression transducer Time asymptotic complexity of Build Automaton is O N x k 1 Build Automaton root 2 for a 0 1 do 3 if 5 root a defined and not deleted root a then 4 root a root a 5 fail d root a root 6 else 7 root a root 8 if antiword root a for a 0 1 then 9 e root root 10 for each node p p Z trie in breadth first order do 11 for a 0 1 do 12 if p a defined and not deleted p a then 13 p a p a 14 fail p a fail p a 15 else if not antiword p 16 p a fail p a 17 else 18 p a p 19 if not antiword p then 20 if antiword p a for a 0 1 then 21 e p d p a 4 4 4 Self compression Using self compression we create a new compressed trie from the original trie and initialize gain values g r of all nodes to 1 This is necessary for the simple pruning algorithm as it walks the trie bottom up and needs to know if both subtrees were already processed 44 IMPLEMENTATION OF STATIC COMPRESSION SCHE
86. s List of Figures List of Tables 1 Introduction LI Problem Statement 2 cs dae em eke e e aw Deo ees L3 Sialeal Phe Arh i whee boa SR a ee eR a E es 13 Contribution of the Thesis w oca 2 ee 1 4 Organization of the Thesis lt lt ug 2 Preliminaries 3 Data Compression Using Antidictionaries 34 DCA Fundamentals a socs kok 256 58 o E ow dle o RO S LUE eG 3 2 Data Compression and Decompression lt II 3 3 Antidictionary Construction Using Suffix Trie 3 4 Compression Decompression Transducer lt lt 3 5 Static Compression Scheme lt lt lt lt lt lt es B Simple pruning secr ma sace ep AR RA be okol 3 5 2 Antidictionary self compression I 3 6 Antidictionary Construction Using Suffix Array SION ME ito ATIY A TE a d aa a e Aan p aaa A e Te E S 3 6 2 Antidictionary construction lt lt lt 4 lt lt Yu 3 7 Almost Antifactors 4 o lt lt ER RE RR RR A O kw XV Xvii 3 7 1 Compression ratio improvement lt lt 444 21 3 7 2 Choosing nodes to convert lt lt 4 21 3 8 Dynamic Compression Scheme co bida de GRE RF RE GOS 23 3 8 1 Using suffix trie online construction 2 4 24 3 8 2 Comparison with static approach lt 25 3 0 Searching m Compressed Texte e lt os ces kg Roue dk e Bok 4 XU 26 Implementation 29 ak Used
87. sion and decompression process schemes can be found examining both processes further With static compression scheme after building the antidictionary and compression trans ducer we have already read the whole text we know its length and we can easily compute its CRC32 checksum while building the antidictionary As the decompression process needs to know the length of original data we save this along with CRC32 for verifying data integrity into the output file Then we save the antidictionary using one of the methods discussed in Section 4 5 And only after this we start compressing the input data using the compression transducer and writing its product to the output file The decompression process should be clear First we read the data length CRC32 and we load the antidictionary into the memory in a suffix trie representation Then we self decompress the trie updating all suffix links and creating decompression transducer at a time With prepared transducer we run decompression until we get originalLen amount of data writing the product to output file and calculating CRC32 of decompressed data At the end we verify the checksum and notify user of decompression result CRC OK or description of which error occurred In Section 3 5 antidictionary construction process has been presented but it was not complete Actually Build Automaton phase must be executed one more time just after Build Antidictionary stage to fix all incorrect s
88. sizes obtained on Calgary Corpus 75 Pros and cons of different methods lt 444 75 xxi xxii Chapter 1 Introduction 1 1 Problem Statement DCA Data Compression using Antidictionaries is a novel data compression method presented by M Crochemore in 6 It uses current theories about finite automata and suffix languages to show their abilities for data compression The method takes advantage of words that do not occur as factors in the text i e that are forbidden Thanks to existence of these forbidden words some symbols in the text can be predicted The general idea of the method is quite interesting first input text is analyzed and all forbidden words are found Using binary alphabet 0 1 symbols whose occur rences can be predicted using the set of forbidden words are erased DCA is a lossless compression method which operates on binary streams 1 2 State of The Art Currently there are no available implementations of DCA all were developed for experi mental purposes only Some research is being done on using larger alphabets rather than binary and on using compacted suffix automata CDAWGs for antidictionary construc tion 1 3 Contribution of the Thesis In the thesis dynamic compression using DCA is explained and implemented and anti dictionary construction using suffix array is introduced focusing on minimizing memory requirements Several methods based on data compression usin
89. sn 69 Compression ratio obtained compressing plrabn12 txt 69 Best compression ratio obtained by each method on Canterbury Corpus 70 xviii 5 37 Average compression ratio obtained on Canterbury Corpus 71 5 38 Average compression speed on Canterbury Corpus 72 5 39 Average time needed to compress 1MB of input text 72 5 40 Memory needed to compress 1B of input text 73 5 41 Compression ratio obtained by selected methods on Canterbury Corpus 73 5 42 Time needed to compress 1MB of input text lt lt 74 5 43 Memory needed by selected methods to compress 1B of input text 74 XIX List of Tables 2 1 3 1 dud 3 3 3 4 4 1 4 2 5 5 2 5 3 5 4 5 5 5 6 Canterbury OOEDUS s Se 44 0 Regdewog a X9 Xox Pedum 4 4 outhx array for text abeaab z 22 2 29 ew B o ed a 19 Suffix array used for antidictionary construction 20 Example of node gains as antiwords 22 Dynamic compression example a 26 DCAstate structure implementation 35 Compressed file format gt s s sou o a ai W G 40 Parallel antidictionaries using static compression scheme 65 Parallel antidictionaries using dynamic compression scheme 65 Best compression ratios obtained on Canterbury Corpus 70 Compressed file sizes obtained on Canterbury Corpus 75 Compressed file
90. structure DCAstateC Figure 4 1b which is more efficient meaning of variables is similar to DCAstate variables original is a pointer to an original node in non compressed trie 4 4 2 Building antidictionary Function Build AD walks through the tree and adds all minimal antifactors Using function MarkPath it marks all nodes in the path from root node to the antiword Then using traversal in depth first order it computes node total visits and removes all nodes that does not lead to any antiword We also omit stopping pair antifactors as they don t bring any compression Time asymptotic complexity of Build AD is O N x k which makes it strongly dependant on maxdepth k 1 Build AD root 2 for each node p level p lt k in breadth first order do 3 for a 0 1 do 4 if p a defined then 5 p a p a 6 else if fail p a defined and p a defined then 7 q new state 8 p a q 9 antiword g true 10 MarkPath q 11 for each node p not antiword p in depth first order do 12 if fvis p gt 0 then 13 vis 0 14 q p 15 while q root 16 if fvis q gt 0 then 17 vis vis fvis q 18 fvis g 0 36 CHAPTER 4 IMPLEMENTATION 19 q fail q 20 visited q visited g vis 21 if not awpath p then 22 if asc p defined then 23 asc p b p NULL 24 deleted p true 1 MarkPath p 2 while p defined and not awpath p do 3 awpath p t
91. t 0 deleted root false antiword root false awpath root false cur root while not EOF do read a cur Next cur a mazdepth fvis cur fvis cur 1 return root O DON O O c r2 Fr Next state cur bool a int mazdepth gt 0 if cur a defined then return cur a else if level cur mazdepth then return Next fail cur a maxdepth else q new state level g level cur 1 fvis g 0 visited g 0 b g a 9 deleted g false antiword q false awpath q false 10 cur a q 11 if cur root then fail g root 12 else fail q Next fail cur a mardepth 13 return q O0 SO cw So N 44 IMPLEMENTATION OF STATIC COMPRESSION SCHEME 35 struct DCAstate DCAstate next 0 1 struct DCAstateC DCAstate snext 0 1 DCAstateC next 0 1 DCAstate fail DCAstateC asc DCAstate asc DCAstate original DCAstate epsilon int gain int visited fvis level bool b bool b awpath antiword b a Table 4 1 DCAstate structure a representing a suffix trie node and DCAstateC struc ture b representing a node in self compressed trie Suffix trie nodes are represented by DCAstate structure Figure 4 1a next represents 6 transitions snert represents 6 epsilon represents e transitions other variables represent functions with corresponding names For representation of nodes in self compressed trie there is used another
92. th antidictionary size increases to shorten the compressed data together we have got decreasing total file size 5 5 Data Decompression In turn in Figure 5 14 we can see that static compression scheme requires only a little amount of memory to decompress data while dynamic compression scheme requires as much memory as during compression process Smaller difference we can see in Figure 5 15 in time required to decompress file paperl static compression scheme is much faster again Decompression speed and low memory requirements are an apparent advantage of static compression scheme 56 used memory MB time s CHAPTER 5 EXPERIMENTS 90 static DCA dynamic DCA x 80 A 60 Fi 4 40 P 4 30 X 10 15 20 25 30 35 40 45 50 maxdepth Figure 5 14 Memory reguirements for decompressing paperl dz 0 35 T static DCA dynamic DCA x E 0 3 A EF A i 0 25 Ut J x x 0 2 x x 0 15 Pd maxdepth Figure 5 15 Time requirements for decompressing paperl dz 5 6 DIFFERENT STAGES 57 T T T build suffix trie build antidictionary count visits 12L build automaton self compress simple prune self compress e build automaton save antidictionary compress data v x rbecmn maxdepth Figure 5 16 Time consumption of individual phases during compr
93. the first differing symbol c v w X x y M If x y 1 then add antifactor c 0 3 For each symbol v of string u cxv such that v 0 add antifactor OHV v Ld 4 For each symbol w of string ui cyw such that vj 1 add antifactor Cywi Wj 10 5 Repeat previous steps for all suffix array items 20 CHAPTER 3 DATA COMPRESSION USING ANTIDICTIONARIES i To 4 t lo 0 SA 8 7 LCP 0 0 m I th IRA olo o RE O Ry RoR a o k ojw or w ie o woo OJ FO eee wr e w E o FR O oln ALR bw Table 3 2 Suffix array for binary text 01110101 highlighting antifactor positions 6 For the last item un v for each symbol v 0 add antifactor v vj 11 This simple algorithm finds all text antifactors we only need to limit antifactor length Using this technique we find all antifactors but what we really want are minimal an tifactors One possible way is to construct a suffix trie from the found antifactors and then choose just the minimal antifactors using suffix links Second option is to utilize the suffix array ability for searching strings To check if antifactor u av a 0 1 is minimal try to find string v in the suffix array If string v appears in the text then the antifactor u is minimal Search for a string in suffix array equipped with lep array takes O P l
94. time asymptotic complexity O N k x 2 1 Simple Prune rootCompr 2 for each node p antiword p do 3 while p 4 rootCompr do 4 q asc p 5 if antiword original p then 6 g p visited asc original p 2 7 else if p has children pi p then 8 if g p1 1 or g p2 1 then 9 break 10 p g p1 g p2 2 11 else if p has child pp then 12 if g pp 1 then 13 break 14 g p g pp 2 15 if g p lt 0 then 16 prune subtree p prune subtree original p 17 P q 4 5 Antidictionary Representation In static compression scheme we need to provide the compressed data with the andic tionary used for compression in order to be able to decompress it And as the antidic tionary size reaches about 50 of the compressed data length we know antidictionary is really important and has a great significance to the final compression ratio As in the original paper 2 2 representation is used but other possibilities are also discussed e antiword list probably the least efficient storage for antiwords it is not using any common prefixes but has better abilities for self compression as mentioned in Section 3 5 2 4 5 ANTIDICTIONARY REPRESENTATION 39 e 2 2 binary tree with k nodes is encoded using 2k bits for the whole tree Using depth first order we encode in each node if it has both subtrees only the right subtree only the left subtree or no subtree by the strings 00 01 10 11
95. tion 2 10 Substring String z is a substring factor of string y if y uxv where z y u v Definition 2 11 Prefix String x is a prefix of string y if y xv where z y v Definition 2 12 Suffix String x is a suffix of string y if y ux where x y u X Definition 2 13 Proper prefix factor suffix A prefix factor and suffix of a string u is said to be proper if it is not u Definition 2 14 Length of string The length of string w is the number of symbols in string w X and is denoted w Definition 2 15 Empty string An empty string is a string of length 0 and is denoted e Definition 2 16 Deterministic finite automaton A deterministic finite automaton DFA is quintuple Q qo F where is a finite set of states X is a finite input alphabet is a mapping O x X Q qo O is an initial state F C Q is the set of final states Definition 2 17 Transducer finite state machine A transducer finite state machine is sixtuple Q Z T 0 go w where Q is a finite set of states X is a finite input alphabet I is a finite output alphabet 6 is a mapping O x X O qo O is an initial state w is an output function Q x ZU e gt T Definition 2 18 Suffix trie 18 Let T t t2 tn be a string over an alphabet X String r is a substring of T Each string T t t where 1 lt i lt n l is a suffix of T in particular T4441 is the empty suffix The set of
96. tomaton accepting antifactors we are able to construct a compression decompression transducer from it From every node r except terminal nodes we have to make sure that transitions for both 0 1 symbols are defined For a missing transition r z 0 1 we create this transition as f r x As we do this in breadth first search order f r z is always defined The only exception is the root node which needs special handling Transducer construction can be found in more detail in 6 as L automaton Then we remove the terminal states and assign output symbols to the edges The output sybols are computed as follows if a state has two outgoing edges output symbol is the same as the input one if a state has only one outgoing edge output symbol is an empty word An example is presented in Figure 3 4 3 5 STATIC COMPRESSION SCHEME 15 Build trie Build antidictionary Build transducer Compression transducer Figure 3 5 Basic antidictionary construction ini mm Suffix trie with limited depth k en 3 5 Static Compression Scheme Antidictionary is needed to build the compression decompression transducer but in prac tical applications the antidictionary is not apriori given we need to derive it from the input text or from some similar data source We build the antidictionary using one of the techniques mentioned in Section 3 3 With bigger antidictionary we could obtain better compression but it gro
97. uffix links and forward edges pointing to removed nodes as they are required for self compression After this correction antidic tionary construction with single self compression and single simple pruning round looks like this Figure 4 4 Now individual stages will be described in more detail 4 4 1 Suffix trie construction For suffix trie construction an algorithm very similar to the one presented in 6 is used Function Build Fact reads input and builds suffix trie adding new nodes fvis r is direct visits count of node r Function Nezt takes care of creating all suffix links and 34 CHAPTER 4 IMPLEMENTATION Build trie Build antidictionary Build automaton Self compression gt Build transducer G Simple prune Compression transducer Figure 4 4 Real implementation of static scheme antidictionary construction with self compression and single simple pruning Suffix trie with limited depth k n missing nodes up to the root node b r is the input symbol leading to node r visited r is total visits of the node used later for gain computation asc r is parent of node r fail r is suffix link of node r antiword r and deleted r indicate node status awpath r indicates if there exists a path from root node to some antiword through node r Time asymptotic complexity of Build Fact is O N x k Fr Build Fact int maxdepth gt 0 root new state level root 0 fvis root 0 visited roo
98. would need a further study and may be some more optimal algorithm for computing suffix array of a bit array could be developed The whole byte is now used for representation of just a single bit that means the algorithm is now wasting 7 bits for each single bit of input text Addressing single bits in bytes was not tested but it is very likely that addressing a single bit and swapping bits would need more CPU time while memory occupied by the whole input text expanded 8 times is still marginal in comparison with the total memory needed for representing the compression transducer 4 7 2 Antidictionary construction Anyway we have a working suffix array construction algorithm and we are going to use it for antidictionary construction An outline of antidictionary construction using the suffix array was already presented in Section 3 6 In this implementation functions Build AD SA SA Bin Search and Add Antiword are used where the first one is the most important doing most of the job finding all possible antifactors SA Bin Search checks if the antifactor is minimal and Add Antiword adds the new antiword to the suffix trie and computes visit counts of the added nodes Symbol denotes end of string function substr str ind len returns substring of string str starting at position ind of length len functions makeSA and makeLCP return suffix array and LCP array for specified text Time asymptotic complexity of Build AD SA is O
99. ws with the length of input text and we need to control its size or its representation will be inefficient and the compression could be very slow A rough solution is to limit length of the words belonging into antidictionary which is done by limiting suffix trie depth during its construction This will simplify and lower the sys tem requirements for building the antidictionary This simple antidictionary construction scheme is presented in Figure 3 5 3 5 1 Simple pruning In static compression scheme we compress and decompress the data with the same anti dictionary However the decompression process has to know the original antidictionary that was used for compression That s why we need to store the used antidictionary together with the compressed data The question is if the stored antiword will erase more bits than the bits needed to actually store the antiword Possible antidictionary representations will be discussed in Section 4 5 Let s consider that we know how many bits are needed for representation of each an tiword then we can compute the gain of each antiword and prune all antiwords with negative gain We call this simple pruning After applying this function on the antidic tionary we can improve the compression ratio of the static compression scheme by storing only the good antiwords and using just them for compression Our static compression scheme will now look like Figure 3 6 16 CHAPTER 3 DATA COMPRESSION USI

Implementation of DCA Compression Method

Contents

Download Pdf Manuals

Related Search

Related Contents