Article

The science behind audio compression

The science behind audio compression

Written by: Gina Collecchia

Gina is an audio engineer and author of Numbers and Notes: An Introduction to Musical Signal Processing. She currently works as a Senior Audio DSP Engineer at Antares.

Published:

Mar 8, 2024

Each week at Highnote, we ship new features and performance refinements based on your feedback and requests. 

We recently released a new feature that allows the listener to choose their audio playback quality, located at the bottom of the audio playback bar. This improves playback performance by optimizing for whatever connection strength the listener is on. 

We thought we’d bring you into some of the research that went into our decision making process behind the audio quality selector.

Audio compression enables audio files to be stored in much smaller versions than their uncompressed counterparts. This means they can be downloaded and streamed more quickly, and users with poor network connections can still playback content on Highnote without having to wait for it to buffer. 

We went into detail about listening experiments that we performed on several different pieces of music in order to decide how low was too low, and how high was good enough in a previous article on our blog. Now, we’re going to get even more technical, and explore the science behind audio compression.

There are a few pieces of terminology we’d like to clear up right away. 

Compression

This overloaded word means a few different things when it comes to audio, but here we’re concerned with compressed and uncompressed audio, which is to say, audio that has been reduced in size (“compressed”) in order to take up less bandwidth when streaming, and audio that is presented in raw, digital values (“uncompressed”). Compressed audio can be “lossy”, wherein some of the original samples of the audio can not be retrieved when converting back to an uncompressed format, or “lossless”, wherein the original uncompressed file can be fully retrieved. Examples of lossy formats include MP3 and AAC, and examples of lossless formats are FLAC and ALAC. Common uncompressed formats include WAV and AIFF. 

Sample rate

The sample rate refers to the numbers of samples per second represented in a digital audio file. According to the Nyquist limit, if we want to accurately reproduce sound for humans to hear it from its analog counterpart, we need to choose a sampling rate of at least 2 times that of the auditory limit, which is 20,000 Hz. Streaming audio is typically sampled, therefore, at 44,100 or 48,000 Hz, commonly abbreviated as 44.1kHz or 48kHz. Professional audio for movies and TV or streaming services like Tidal and Pono that boast hi-fi audio can be as high as 96kHz or 192kHz. 

Bit depth

Digital audio is represented as a time series of numbers between -1.0 and 1.0, represented in binary. The bit depth refers to the maximum quantity of 1’s and 0’s that this binary number can be. A bit depth of 8 is pretty low. Bit depths of 16 and 24 are more common. A number such as 0.48294, for example, would be represented in these different ways under 8-bit, 16-bit, and 24-bit audio:

  • 0130384308

  • 45989359

  • 283803508805

As you can see, information is lost in the 8-bit and 16-bit scenario that is preserved in the 24-bit scenario. But there are irrational values out there that even 24-bit audio cannot represent, a key difference between digital and analog audio!

Bit rate

Commonly denoted by “kbps” or “kilobytes per second”, the bit rate refers to the bit depth times the sample rate times 8, in order to convert bits to bytes. However, due to the complex nature of compression and how it represents different frequencies, this number is basically an average of the kilobytes per second over the whole song. An MP3 with a 320 kbps bit rate and 48kHz sample rate would imply that it was sampled with an average bit depth of 8 bits. 

Now that we have some jargon cleared up, let’s learn more about how compression works.

Humans with the healthiest of ears are able to hear frequencies in the range of 20-20,000 Hz. Frequency translates to musical pitch, so this range extends from about 4 octaves below middle C to 4 octaves above it. Uncompressed and compressed audio alike make use of this fact through digital sampling: high quality audio on CD’s has a sampling rate of 44.1kHz. It needs to be twice as large as the maximum frequency we wish to represent (20kHz) due to the Nyquist theorem, central to the topic of digital signal processing (DSP).

So how does an MP3 file achieve quality audio at 1/8th the size of a WAV file of the same piece of music? The answer boils down to psychoacoustics and human perception of sound. Take a look at this graph showing our sensitivities to frequencies.

This graph (the “Fletcher-Munson curve”) shows us that at frequencies below around 50Hz and frequencies above 10kHz, sound has to be really, really loud for us to hear it at all. It also shows that our ears are particularly sensitive to frequencies between 500 and 8,000 Hz, and especially between 3kHz and 5kHz (an important range for speech). Compression utilizes this sensitivity by using less data to represent frequencies outside of these important ranges, as we aren’t as good at hearing really low and high frequency content as we are in this middle range. 


Another important aspect to the way we hear is something called masking. As depicted in the image below, frequencies that are close to one another can mask or hide the other when one is louder than the other. Therefore, MP3’s and other compressed audio schemes can discard frequency information that can’t be heard by us due to masking.

One final way that MP3 files save space is during quiet sections, or sections without a lot of noise. Sometimes it is more efficient to represent sound by its frequency content when that content is “simple” in terms of harmony. Electronic music is much easier to compress than acoustic music or music with vocals in it.

Audio compression enables audio files to be stored in much smaller versions than their uncompressed counterparts. This means they can be downloaded and streamed more quickly, and users with poor network connections can still playback content on Highnote without having to wait for it to buffer. 

We went into detail about listening experiments that we performed on several different pieces of music in order to decide how low was too low, and how high was good enough in a previous article on our blog. Now, we’re going to get even more technical, and explore the science behind audio compression.

There are a few pieces of terminology we’d like to clear up right away. 

Compression

This overloaded word means a few different things when it comes to audio, but here we’re concerned with compressed and uncompressed audio, which is to say, audio that has been reduced in size (“compressed”) in order to take up less bandwidth when streaming, and audio that is presented in raw, digital values (“uncompressed”). Compressed audio can be “lossy”, wherein some of the original samples of the audio can not be retrieved when converting back to an uncompressed format, or “lossless”, wherein the original uncompressed file can be fully retrieved. Examples of lossy formats include MP3 and AAC, and examples of lossless formats are FLAC and ALAC. Common uncompressed formats include WAV and AIFF. 

Sample rate

The sample rate refers to the numbers of samples per second represented in a digital audio file. According to the Nyquist limit, if we want to accurately reproduce sound for humans to hear it from its analog counterpart, we need to choose a sampling rate of at least 2 times that of the auditory limit, which is 20,000 Hz. Streaming audio is typically sampled, therefore, at 44,100 or 48,000 Hz, commonly abbreviated as 44.1kHz or 48kHz. Professional audio for movies and TV or streaming services like Tidal and Pono that boast hi-fi audio can be as high as 96kHz or 192kHz. 

Bit depth

Digital audio is represented as a time series of numbers between -1.0 and 1.0, represented in binary. The bit depth refers to the maximum quantity of 1’s and 0’s that this binary number can be. A bit depth of 8 is pretty low. Bit depths of 16 and 24 are more common. A number such as 0.48294, for example, would be represented in these different ways under 8-bit, 16-bit, and 24-bit audio:

  • 0130384308

  • 45989359

  • 283803508805

As you can see, information is lost in the 8-bit and 16-bit scenario that is preserved in the 24-bit scenario. But there are irrational values out there that even 24-bit audio cannot represent, a key difference between digital and analog audio!

Bit rate

Commonly denoted by “kbps” or “kilobytes per second”, the bit rate refers to the bit depth times the sample rate times 8, in order to convert bits to bytes. However, due to the complex nature of compression and how it represents different frequencies, this number is basically an average of the kilobytes per second over the whole song. An MP3 with a 320 kbps bit rate and 48kHz sample rate would imply that it was sampled with an average bit depth of 8 bits. 

Now that we have some jargon cleared up, let’s learn more about how compression works.

Humans with the healthiest of ears are able to hear frequencies in the range of 20-20,000 Hz. Frequency translates to musical pitch, so this range extends from about 4 octaves below middle C to 4 octaves above it. Uncompressed and compressed audio alike make use of this fact through digital sampling: high quality audio on CD’s has a sampling rate of 44.1kHz. It needs to be twice as large as the maximum frequency we wish to represent (20kHz) due to the Nyquist theorem, central to the topic of digital signal processing (DSP).

So how does an MP3 file achieve quality audio at 1/8th the size of a WAV file of the same piece of music? The answer boils down to psychoacoustics and human perception of sound. Take a look at this graph showing our sensitivities to frequencies.

This graph (the “Fletcher-Munson curve”) shows us that at frequencies below around 50Hz and frequencies above 10kHz, sound has to be really, really loud for us to hear it at all. It also shows that our ears are particularly sensitive to frequencies between 500 and 8,000 Hz, and especially between 3kHz and 5kHz (an important range for speech). Compression utilizes this sensitivity by using less data to represent frequencies outside of these important ranges, as we aren’t as good at hearing really low and high frequency content as we are in this middle range. 


Another important aspect to the way we hear is something called masking. As depicted in the image below, frequencies that are close to one another can mask or hide the other when one is louder than the other. Therefore, MP3’s and other compressed audio schemes can discard frequency information that can’t be heard by us due to masking.

One final way that MP3 files save space is during quiet sections, or sections without a lot of noise. Sometimes it is more efficient to represent sound by its frequency content when that content is “simple” in terms of harmony. Electronic music is much easier to compress than acoustic music or music with vocals in it.

Audio compression enables audio files to be stored in much smaller versions than their uncompressed counterparts. This means they can be downloaded and streamed more quickly, and users with poor network connections can still playback content on Highnote without having to wait for it to buffer. 

We went into detail about listening experiments that we performed on several different pieces of music in order to decide how low was too low, and how high was good enough in a previous article on our blog. Now, we’re going to get even more technical, and explore the science behind audio compression.

There are a few pieces of terminology we’d like to clear up right away. 

Compression

This overloaded word means a few different things when it comes to audio, but here we’re concerned with compressed and uncompressed audio, which is to say, audio that has been reduced in size (“compressed”) in order to take up less bandwidth when streaming, and audio that is presented in raw, digital values (“uncompressed”). Compressed audio can be “lossy”, wherein some of the original samples of the audio can not be retrieved when converting back to an uncompressed format, or “lossless”, wherein the original uncompressed file can be fully retrieved. Examples of lossy formats include MP3 and AAC, and examples of lossless formats are FLAC and ALAC. Common uncompressed formats include WAV and AIFF. 

Sample rate

The sample rate refers to the numbers of samples per second represented in a digital audio file. According to the Nyquist limit, if we want to accurately reproduce sound for humans to hear it from its analog counterpart, we need to choose a sampling rate of at least 2 times that of the auditory limit, which is 20,000 Hz. Streaming audio is typically sampled, therefore, at 44,100 or 48,000 Hz, commonly abbreviated as 44.1kHz or 48kHz. Professional audio for movies and TV or streaming services like Tidal and Pono that boast hi-fi audio can be as high as 96kHz or 192kHz. 

Bit depth

Digital audio is represented as a time series of numbers between -1.0 and 1.0, represented in binary. The bit depth refers to the maximum quantity of 1’s and 0’s that this binary number can be. A bit depth of 8 is pretty low. Bit depths of 16 and 24 are more common. A number such as 0.48294, for example, would be represented in these different ways under 8-bit, 16-bit, and 24-bit audio:

  • 0130384308

  • 45989359

  • 283803508805

As you can see, information is lost in the 8-bit and 16-bit scenario that is preserved in the 24-bit scenario. But there are irrational values out there that even 24-bit audio cannot represent, a key difference between digital and analog audio!

Bit rate

Commonly denoted by “kbps” or “kilobytes per second”, the bit rate refers to the bit depth times the sample rate times 8, in order to convert bits to bytes. However, due to the complex nature of compression and how it represents different frequencies, this number is basically an average of the kilobytes per second over the whole song. An MP3 with a 320 kbps bit rate and 48kHz sample rate would imply that it was sampled with an average bit depth of 8 bits. 

Now that we have some jargon cleared up, let’s learn more about how compression works.

Humans with the healthiest of ears are able to hear frequencies in the range of 20-20,000 Hz. Frequency translates to musical pitch, so this range extends from about 4 octaves below middle C to 4 octaves above it. Uncompressed and compressed audio alike make use of this fact through digital sampling: high quality audio on CD’s has a sampling rate of 44.1kHz. It needs to be twice as large as the maximum frequency we wish to represent (20kHz) due to the Nyquist theorem, central to the topic of digital signal processing (DSP).

So how does an MP3 file achieve quality audio at 1/8th the size of a WAV file of the same piece of music? The answer boils down to psychoacoustics and human perception of sound. Take a look at this graph showing our sensitivities to frequencies.

This graph (the “Fletcher-Munson curve”) shows us that at frequencies below around 50Hz and frequencies above 10kHz, sound has to be really, really loud for us to hear it at all. It also shows that our ears are particularly sensitive to frequencies between 500 and 8,000 Hz, and especially between 3kHz and 5kHz (an important range for speech). Compression utilizes this sensitivity by using less data to represent frequencies outside of these important ranges, as we aren’t as good at hearing really low and high frequency content as we are in this middle range. 


Another important aspect to the way we hear is something called masking. As depicted in the image below, frequencies that are close to one another can mask or hide the other when one is louder than the other. Therefore, MP3’s and other compressed audio schemes can discard frequency information that can’t be heard by us due to masking.

One final way that MP3 files save space is during quiet sections, or sections without a lot of noise. Sometimes it is more efficient to represent sound by its frequency content when that content is “simple” in terms of harmony. Electronic music is much easier to compress than acoustic music or music with vocals in it.

Join the thousands bringing calm to their creative process with Highnote

Join the thousands bringing calm to their creative process with Highnote

Join the thousands bringing calm to their creative process with Highnote

Features

Lossless Audio Streaming

Voice Comments

Group Chat

Audio Polls

Version Management

Archiving + Storage

Security

Timestamp Commenting

A/B Instant Playback

Public + Private Spaces

Highnote, Inc

©

2024

Features

Lossless Audio Streaming

Voice Comments

Group Chat

Audio Polls

Version Management

Archiving + Storage

Security

Timestamp Commenting

A/B Instant Playback

Public + Private Spaces

Highnote, Inc

©

2024

Features

Lossless Audio Streaming

Voice Comments

Group Chat

Audio Polls

Version Management

Archiving + Storage

Security

Timestamp Commenting

A/B Instant Playback

Public + Private Spaces

Highnote, Inc

©

2024