You see a lot on offer for compressing sorted lists of integers, but not so much on the rather uncomfortable question of how to compress a list of unsorted numbers. So why would you want to do that? Surely all one needs to do is sort the list then compress it? The problem is that in many cases the order of the integers represents information that sorting would destroy. Here is my case in point.
Actual hit-positions in a search engine
In a search-engine you have to make a list of documents in which a term is found. Each document is assigned an identifier. So say we had a list of documents in which the word 'dog' occurs. It might be found in 7 documents: 0,1,1,2,2,3,4,6,11,21,21. We can use an algorithm like FastPFOR because the sorted list can be converted into a list of deltas, or the differences between successive entries. These will typically be much shorter numbers than the actual values. In my case an array of 100 document identifiers compressed down to just 13 integers. Cool. But what if you wanted to store the locations in those documents where the word 'dog' occurred as well? This would bloat the index considerably, since 'dog' might occur 100 times in a single document. I could sort and compress it the same way, but quite often it would be really short, maybe only one entry. Then the compression algorithm would actually increase the list size by three or four-fold, since the overhead for a compressed list adds a number of ints to the start, and – this is the real killer – one compressed list would be needed for each document the word was found in. So trying to compress it the conventional way would first, probably increase the overall index size, and second, you would need to maintain a lot of compressed lists.
The solution
Ideally, we would like just one list of word-positions for each word, just as we had one list of document identifiers for each word. But such a list would have to be unsorted, because if we sort it we would lose the information about which documents those positions refer to. Fortunately, most positions are quite small. They can't be greater then the length of the longest document the word is found in. Or if it was found in only a few documents, or always at the start, the values would be even smaller. Lets say that all the values are less than some amount like 127. (Convenient, huh?) Then the list could be stored in 8-bit integers with one 32-bit integer at the start to say how many bytes there were per integer. Or if 16 bits were needed, then we could use 2-byte ints, or 3 bytes for 24 bits etc. Worst case is when the documents are bigger than 4MB or so, when we will need 32-bit integers. But that's rare.
So the strategy is pretty simple. Scan all the numbers first to see what is the biggest (or smallest negative number) and work out how many bits we will need. I kind of cheat by rounding this up to the nearest 8 bits, but if you're interested you can refine it to have an arbitrary number of bits. But you won't gain much in compression and you will lose something in speed. Here's my Java code. It just uses ByteBuffer to build arbitrary-sized ints up to 32 bits, in 8-bit hops, and then stores the list as an array of 32-bit ints – compressed, of course. So typically what you'd expect is a 25%-50% reduction in the array size and in some cases 75%. Compressing it further by any significant amount seems to be impossible, given the near-random nature of the data.
I offer no guarantees that this works for all cases etc. But it is freeware. Do what you like with it.
No comments:
Post a Comment