bit 101: data compression


bit 101: data compression

how lossless and lossy forms of compression work to reduce data size and transmit at a rapid pace

Transmitting information from a computer to a source can be a lengthy process, especially if the file contains hundreds or thousands of megabytes of data. Data compression is the process of using encoding techniques to reduce the amount of data needed to transmit certain information. This technique dates all the way back to Morse Code, where common letters like ‘e’ or ‘i’ were given the shortest codes, and less frequent letters like ‘q’ or ‘z’ were more complex.

This article by Britannica gives a comprehensive overview of data compression and how it works to reduce file size and speed up transmission.

As we have learned in previous entries, digital information is encoded in 0s and 1s, which are called binary digits. Using the ASCII table, each character is given a unique 8-bit code. However, 8 bits per letter is a lot of information, especially for letters that are used frequently. One method of compression involves assigning a variable-length binary code to each letter. For example, a: 0, while r: 110. Since ‘a’ is used more frequently, encoding it in binary uses less digits, and therefore allows for faster compression for a word like “abracadabra.” Assigning shorter codes to vowels and other common letters allows for infrequent letters, like ‘x’ or ‘z’ to be assigned a three or four bit code. This type of compression allows digital data to transfer much faster than if each character was encoded in 8 bits of information.

There are two types of data compression - lossless (exact) or lossy (inexact). Each serve their own purpose in the digital world; lossless compression is necessary for text forms where each character is crucial, while lossy compression is used for image or voice data that can stand to lose some quality. A common format for image compression is using a GIF format, which is a lossless technique that limits images to 256 colors. JPEG and MPEG are other compression techniques used for images and videos, where the user can select how high or low quality they want the media to be, which translates to the amount of data lost in compression. Compression programs require a model of data which describes the frequency of distribution of characters and words in the English language. Adaptive models are also able to estimate the distribution of characters or words based on what they have already processed.

Although the file size is smaller, meaning it will load faster, the downside of lossy compression is the lack of detail preserved. For example, when compressing digital images, colors that are very similar to each other may be compressed into a single color, where the difference is indistinguishable to the human eye. The most common method uses a mathematical formula which breaks the image into separate parts of differing levels of importance for image quality. Video images are compressed in a similar way by storing only slight differences between successive frames. MPEG-1 is used to compress video for CD’s and MP3 files, while MPEG-2 is a higher quality format used for DVDs and television. Video compression can achieve compression ratios approaching 20-to-1 with minimal distortion.

Compression techniques are different for each type of media; english text can be reduced to 1/3 of its original size, while images can be compressed by 1/20th or more. Although computer storage continues to become more advanced, data compression is crucial for storing and transmitting large sets of data at a rapid pace.