Linux compression – all about compromise
Compression is useful for file storage, archiving, and transferring. It reduces the amount of storage space required, improves bandwidth, and can save money.
When it comes to compression, it’s all about compromise. You can either have fast compression, or thorough compression, or somewhere in between. You can compress with reduced quality, or you can sacrifice a bit on the degree of compression.
Lossy or lossless compression
There are two types of compression, lossy and lossless. Lossy compression works by removing less important information from the file. Using a JPEG image as an example, with lossy compression (as the name almost suggests) you’d expect to see a loss of image quality. In a high-resolution image, lossy compression finds multiple colours that are very similar in shade and replaces them with a single combined colour. Reducing the number of different colours in the image compresses it, but it comes with a loss of quality.
This quality reduction can be lessened or increased depending on how compressed the user wants the file to be. An increased degree of compression will result in a smaller file size, but an image of greatly reduced quality. Comparatively, less compression will give a better quality image, but a larger file size. With this method, after the file is compressed, it cannot be returned to its original quality when decompressed.
Intuitively, lossless compression combats this and compresses without the loss of quality. When the compressed file is decompressed, it returns to the exact state that it was in previously. Lossless compression works by replacing certain information with a placeholder, and the compression utility uses algorithms to recognise repeats of this information.
For example, an image might be compressed to r2w4b3 to represent two red pixels followed by four white pixels, followed by three blue pixels. It’s clear that storing a textual representation would require less storage space than a graphical representation. Or in a document, a string of text ‘hhhhhaaaaaaaaa’ might be compressed to h5a9. These shorthand algorithms compress the file, but because no information is actually lost, it ensures that it can be restored to its original state after decompressing.
Lossless is preferred over lossy because there is no reduction in file quality, but because lossy compression is actually removing information, it usually results in a smaller compressed file size.
Comparing compression tools – fast or thorough?
Much like lossy/lossless, when choosing the tool to compress files with, a compromise is necessary. There is a choice between fast compression, and thorough compression, or somewhere between the two.
Compression can be quick, but the file size will not be reduced drastically. To greatly reduce file size is going to require more processing time, and thus compression will be slower.
Because of the required compromises, there is no single compression tool that is always going to be the perfect choice. Instead, there are a wide range of tools which each have their strengths in certain use cases. All of these tools are available on Linux operating systems, and are examples of lossless compression
Gzip is possibly the ‘original’ and most well-known method of compression, and with good reason. Gzip uses very little system memory, and as such offers very quick compression and decompression of files when compared to other tools. But, as we’ve learnt, this quick compression comes with a compromise. Files are compressed less thoroughly with Gzip than with other tools, which leads to a comparatively larger file size. This compression rate/compression speed ratio can be adjusted to offer more thorough compression, but – again – speed will be compromised.
One benefit of Gzip is its compatibility. Because it’s been around for so long – since 1992 – almost all Linux systems will be compatible with the tool.
Gzip has the disadvantage that only one file can be compressed at a time, whereas other tools can compress entire folders at once. As such, Gzip is probably only the best choice when a single file needs compressing quickly, at a lesser rate of compression. The fast decompression is a result of the file not being thoroughly compressed in the first place, but is also a benefit. Gzip is not a logical option for larger-scale use-cases, where a more thorough compression is needed.
7-Zip compression is a better option for larger-scale needs, because it offers compression of entire file directories and folders. 7-Zip offers impressively thorough compression, and can reduce file sizes significantly with no loss of quality. However, this high compression rate demands significant system performance to support it, and this results in a slow compression.
The 7-Zip utility does perform well in decompression, though. Even thoroughly compressed files can be decompressed quickly on the other end. This has benefits in use-cases like the distributions of apps and software. The developer can significantly compress their app to a small file size with 7-Zip and upload it to be downloaded. The compression might take a while, but it is thorough, and the download on the user’s side is quick. And, assuming the developer has a powerful machine, the high system performance requirements shouldn’t be too much of an issue.
Bzip2 compression falls in the space somewhere between Gzip and 7-Zip. Its compression rate is more thorough than Gzip, but less so than 7-Zip, and its compression speed is faster than 7-Zip, but not as fast as Gzip. Bzip2 results in smaller file sizes than Gzip, but the compression takes about four times as long.
Like Gzip, it also has the limitation of only being able to compress single files at one time. This makes it unsuited for archiving of large file directories, but it is more suited to single-file compression where a smaller file size is required.
Again, the thorough compression requires more system/memory resources than Gzip.
A more modern solution to the compression problem, lbzip2 looks to solve the system requirement drawbacks of bzip2 by spreading the compression over multiple cores. The older, more traditional compression methods like Bzip2, 7-Zip and Gzip all compress files over one single core. Modern computers have two, four, eight, sixteen-plus cores. By spreading the compression over multiple cores, and then stitching them all together at the end, lbzip2 can compress files quicker, with less drain on system resources.
Because lbzip2 is basically using the same compression algorithm as bzip2, it results in basically the same file size – it just does it quicker.
There are many more compression utilities available – like Xz, lzop and p7zip – but they all follow the same basic trend: the smaller the compressed file size, the slower the compression. Technology like that in lbzip2 aims to buck this trend by adopting modern, multi-core techniques for compression. Want fast? Gzip. Want thorough? 7-Zip. Somewhere in between? Bzip2. So far, there is no catch-all answer for compression, it’s more about finding the compression utility that suits the specific needs.