Music fingerprinting is the process of extracting essential information from digitized audio recordings (e.g. CD tracks) so it can be searched in a database. Typically a large database is compiled of known recordings and then the database is searched for fingerprints of unknown audio samples. When a match is found, the previously unknown audio sample can be identified reliably. Music fingerprinting has a lot of applications, some of them are:
Music fingerprinting is big business, but entry into the market with independently developed software (not just buying one of the existing solutions) is not simple. And this is a gross understatement. There are several reasons why it is so hard to enter the market:
This article focuses on the copyright implications of music fingerprinting.
This article by Bryan Jacobs describes how the matching algorithm in Shazam works. Here is an explanation of the music fingerprinting and matching algorithm that is accompanied by a Matlab implementation. Finally Roy van Rijn in The Netherlands created a rough implementation of a music fingerprinting program in Java as described here, but is unable to post the source code due to patent threats. Yet I will use the published details of his program as an example to illustrate the process.
The fingerprinting operation of an audio sample consists of the following steps.
This is only the fingerprinting operation itself. What is yields is a sequence of hash values, which do in fact represent dominant frequencies in the spectrum at a certain instance of time. This finding will be crucial in a later part of this article.
So let's briefly sketch the matching process. The unknown audio sample and the known recording that we want to identify are usually not aligned in time. The sample may be taken from the middle of a song. The unknown sample is usually not aligned a whole number of sections with respect to the known recording. So the spectral information from a single section of one sample will be divided over two sections in the other sample. This will cause most hash values not to match exactly. Distortion and noise also cause the number of exactly matching hash values to decrease. Yet there are typically enough matching hash values to reliably identify a song from a 10s sample. For this we are not interested in the number of matching hash values as such, but in the number of time consistent hash values. Suppose that the hash value 1 second from the start of the sample matches the hash value 20 seconds from the start of the song. When the hash value 2.5 seconds from the start of the sample also matches the value 21.5 seconds from the start of the song, this is a time consistent match with respect to the first match. If the hash value at 5s into the sample again matches the sample at 24s into the song, this adds one more time consistent match to the set. A match at 6s into the sample with the hash at 40s into the song would not be time consitent with the rest of the set. The song with the largest set of time consistent matches is the one identified.
The example program is a very crude implementation of fingerprinting. It chops the audio signal into fixed sized sections and takes a fingerprint of each section. The patents suggest that fingerprints should be taken at variable points in time which are determined by the content of the signal. Therefore the time must be stored with each fingerprint. In the example application, timing information is only implicit in the order of the sequence of hash values. The matching process can then be reduced to fuzzy substring matching, i.e. finding a sequence of hash values that agrees in as many positions as possible but with disagreeing points in between. This approach may even avoid infringing some patent claims.
Assuming the file fingerprints are themselves unencumbered by copyrights, can we create them without copyright infringement? Yes we can, but it takes some care.
Before you can fingerprint a recording, you need to obtain access to it in the first place. The Shazam catalog holds 8 million songs, which is an estimated 600 thousand CDs. This is indeed a huge number. How would you ever hope to compile a large fingerprint catalog and do it legally? There are several options:
If your goal is to have a complete music catalog, no single method will get you all recordings and obtaining that many recordings takes lots of time anyway. Some recordings may only be available in analog formats or DRM-protected digital formats.
Once you have all the recordings in one way or another, you need to fingerprint them. You have two options:
However, in many countries intermediate copies cannot be made without permission. In the USA this is probably permissible as fair use. In the Netherlands a private person may make copies for personal use, but for business use a license is probably required. In the UK a license is certainly required. This may be obtainable from the relevant copyright collecting society (PRS for Music in the UK). In most cases the license to make and keep intermediate copies will be conditional on possession of the original recording. This is also true for the fair use permission to make copies in the USA. So you may store a copy of a physical CD on your hard disk, but you are obliged to keep the original CD, even though you will never handle it again. It is probably not allowed to keep copies of CDs borrowed from libraries.
The main question of this article: are the fingerprints themselve encumbered by copyright? Must a fingerprint be viewed as a derivative work, a modified reproduction of the original work for which permission is required? This is not an easy question to answer and the answer may depend on technical details of the fingerprinting process itself.
As long as a fingerprint is only a fingerprint that helps uniquely identify an item, but it conveys no other information about the item, it can be considered unencumbered by copyright. An examples of such a fingerprint would be the MD5 hash of a complete file. We cannot derive any information of the contents of the file from its MD5 hash. But music fingerprints are different.
As discussed earlier, the fingerprint of a recording is a sequence of hashes, each of which represents the dominant frequencies in the spectrum at a certain time instant. The dominant frequency is roughly equivalent to the pitch of the note being played and a sequence of note pitches is roughly equivalent to a melody. A melody is copyrighted! It is likely that a recognizable melody can be extracted from the fingerprints in the example case, at least for a significant fraction of the recordings. It would be nice to implement the first steps of the fingerprinting process described by Roy van Rijn. Instead of using the dominant frequencies to make a fingerprint, you can construct a spectrum with just these frequencies at a constant amplitude and convert it back to the time domain with FFT. Now you can hear for yourself whether there is a recognizable melody, even when using a trivial method to extract it.
The fingerprinting algorithms used by commercial programs are of course different from the example. But they are essentially capturing significant points in the spectrogram, which roughly translate to pitch at a specific time. Points that are relevant for recognizing a recording are probably also helpful for recognizing a melody. Al long as a hash value can be translated back into spectral components, melody information can be extracted from the fingerprint.
There are three possible solutions to the problem:
There is a way to make it much harder to extract melody information from music fingerprints. Consider the example program. Each hash consist of four small integers that together can be represented by 22 bits. There are about 4 million different hash values. As long as the hash values can be converted back to the orignal spectral patterns, melody extraction is easy. Instead of using a one-to-one mapping use a many-to-one mapping as the hash function. Instead of a 22 bit number it yields a 20 bit number; the mapping has 1 million possible outputs for 4 million possible inputs. This must be a highly nonlinear hash function, so each hash value can be the result of four totally unrelated spectral patterns.
A sample of 10 FFT sections (less than 1 second in duration). can now be the result of 1 million combinations of spectral patterns (2 bits of uncertainty per section). As not all spectral patterns are equally likely and certainly not all spectral patterns are equally likely given the previous spectral patterns, it may still be possible to guess the most likely sequence, but by throwing away 2 bits per hash we certainly made the melody extraction job much harder. We have reduced the reliability of the matching process, but only slightly. Any hash that would match in the original algorithm, would still match. We only introduce a few false matches (pattern A in the sample matches pattern B in the reference recording), but these false matches occur only rarely.
So we may conclude that by throwing away a small amount of fingerprint information we can dramatically increase the diffuculty to extract meaningful information from it, while only slightly affecting the reliability of the matching process.