Observations on spectrum and spectrum histograms in BpmDj

Werner Van Belle¹^* - werner@yellowcouch.org, werner.van.belle@gmail.com

1- Yellowcouch;

Abstract : BpmDj Is a program for DJ's. It helps to select songs and play them. To achieve this the program relies on a number of signal processing techniques. One of the available techniques compares song spectra. Both a standard spectrum analysis is performed as well as a distribution analysis ('echo' characteristics). In this short article we describe how this property is calculated and how it is further used in song comparison. We also present a short analysis of the structure of the space using correlation techniques.

Keywords: PCA analysis sound spectrum song clustering bark scale BpmDj psychoacoustics
Reference: Werner Van Belle; Observations on spectrum and spectrum histograms in BpmDj; September 2005
See also:
BpmDj homepage

1 Introduction

BpmDj is a program that helps DJ's to select songs and play them. To do this the program relies on a number of signal processing techniques. One of the techniques the program offers is comparison of the spectra of songs. Both a standard spectrum analysis is performed as well as a distribution analysis ('echo' characteristics). In this short article we will describe how this property is calculated and how it is used to compare songs in BpmDj.

The dataset we used for test purposes contains 16361 songs ranging from techno music (with in general a coherent 'sampled' signal) to the other end of the scale: metal and guitars noise (typically the signal of such music is not coherent, rather the energy content and positioning seems more important). The data set also included classical music, which is notorious different compared to modern music. There is no clear age to the music we used (ranging from the Beatles and deep purple to modern day R&B). This broad dataset has not been limited in any way: not a single song has been considered improper for the dataset.

2 Sound Color

2.1 Calculation: The Bark Frequency Scale

Calculating the spectrum of a song is done using a sliding window Fourier transform (S-FFT). We rely on a window size of 2048. At a sample rate of 44100 Hz, this measures (in a position dependent manner) frequencies ranging from 21.5 Hz to 22050 Hz in steps of 21.5 Hz. For every position the Fourier transform will take 2048 samples and convert them to a spectral frame. Every such a frame is converted to a nonlinear psychoacoustic scale taken from literature: the Bark acoustic scale [1]. Table 2 gives the boundaries and central position of every band. The center frequencies are to be interpreted as samplings of a continuous variation in the frequency response of the ear to a sinusoid or narrow-band noise processes. That is, critical band shaped masking patterns are seen around these frequencies [1, 2]. Denote $S_{i}$ the full spectrum of song then we calculate the bark band as

$\begin{displaymath} B_{i,j}=\frac{\sum_{k=low_{j}}^{hi_{j}}S_{i,j}}{hi_{j}-low_{j}} \end{displaymath}$

(1)

Table 2: The bark frequency scale and its 'perfect' values.

Bark band	Lower bound	Upper Bound	Center	Strength	Variance
1	0	100	50	7.48973	2.76566
2	100	200	150	8.60855	2.15834
3	200	300	250	7.30108	2.05418
4	300	400	350	6.08735	2.16343
5	400	510	455	5.05958	2.13226
6	510	630	570	4.22398	1.98973
7	630	770	700	3.55564	1.71427
8	770	920	845	2.84894	1.56728
9	920	1080	1000	1.99757	1.39549
10	1080	1270	1175	1.37338	1.25741
11	1270	1480	1375	0.814002	1.17718
12	1480	1720	1600	0.479557	1.14893
13	1720	2000	1860	-0.125073	1.18648
14	2000	2380	2190	-0.522579	1.41797
15	2380	2700	2540	-0.939505	1.60925
16	2700	3150	2925	-1.68769	1.55704
17	3150	3700	3425	-2.2336	1.68244
18	3700	4400	4050	-3.15377	1.70208
19	4400	5300	4850	-4.26618	1.82271
20	5300	6400	5850	-5.23179	1.83829
21	6400	7700	7050	-6.09273	1.88582
22	7700	9500	8600	-7.06462	1.88918
23	9500	12000	10750	-8.35773	1.90112
24	12000	15500	13750	-10.1641	1.79223

2.2 First and second moments: the 'perfect' sound

Figure 1: Perfect spectral band based on analysis of 15000+ songs. Every green line is 3dB. The grey bars represent the standard deviation for that specific band. The actual numbers are given in table 2.

Energy (dB )
	Bark Band (#)

Every song in our dataset has been analyzed using the above formulas. In order to avoid influence of the total energy of a song we translated every spectrum over its own 1st central moment. Those vectors were then used as input into subsequent statistics: the first central moment (the mean) and the second central moment (the standard deviation).

The result of this experiment are 48 values: table 2 contains the values for mean and standard deviation for every bark band. Figure 1 shows a graphical representation. An interesting thing one can do with these values is to generate noise shaped according to this 'perfect' ¹ spectrum. An example of such a sound can be found at http://cryosleep.yellowcouch.org/index.html.

Looking at the numbers, we observe a maximum distance of 30 dB between the high and low frequencies. We also find two curious phenomena. a) the first bark band (#0), representing the lowest tone, is not as high as what one would expect. This is most strange and there is not a good explanation why this happens. Probably a resolution of 16 bits cannot be used to its full extent when the bass levels are too high, limiting the bits allocated for the high frequencies. Another possible explanation lies in the fact that most sound engineers don't like 'rumble' (everything below 50 Hz), as such it can be expected that those frequencies, which indeed are located in the first bark band, will be removed.

The second phenomena b) is that the standard deviation from bark band 12 suddenly jumps up. This is consistent with the bump noted in [1] and with experiments we performed on the PCA analysis, which we describe later on. It is however an observation that requires further attention.

2.3 Comparison

BpmDj uses the above statistics (the perfect spectrum and the standard deviation) to compare the spectrum of two songs. After normalization using the first moment (translation over the mean frequency strength) and second moment (division by the standard deviation) one can use an $L_{2}$ metric for further comparison. If and are two song spectra ( $\in\Re^{24}$ ) and and the mean and standard deviation vectors, then we define the spectral distance as

$\begin{displaymath} dist(A,B)=\sum_{i=0}^{23}(\frac{A_{i}-M_{i}}{D_{i}}-\frac{B_... ...}{D_{i}^{2}}=\sum_{i=0}^{23}\frac{(A_{i}-B_{i})^{2}}{D_{i}^{2}}\end{displaymath}$

This turned out to be a good measure for comparison.

2.4 Visualization

Table 5: First 12 primary vectors of PCA analysis.

	Vectors
	$v_{1}$	$v_{2}$	$v_{3}$	$v_{4}$	$v_{5}$	$v_{6}$	$v_{7}$	$v_{8}$	$v_{9}$	$v_{10}$	$v_{11}$	$v_{12}$
1	0.0515	0.2486	-0.2135	0.5267	0.2569	-0.3474	-0.3694	0.2681	0.0021	0.0954	0.1363	-0.3194
2	0.0878	0.2254	-0.4314	0.3045	0.0947	0.0682	0.2497	-0.3087	0.013	-0.2138	-0.1621	-0.2469
3	0.2028	0.1737	-0.3045	-0.08	-0.131	0.356	0.3366	-0.1641	-0.0326	0.0708	0.1143	-0.2342
4	0.255	0.1141	-0.1374	-0.2219	-0.1668	0.2222	0.014	0.1552	-0.0084	0.1553	0.1087	-0.2475
5	0.2706	0.0543	-0.0202	-0.2559	-0.1205	0.0826	-0.1984	0.3344	-0.0959	0.1176	-0.0191	-0.2439
6	0.274	-0.0002	0.053	-0.2549	-0.0511	-0.0448	-0.2633	0.1739	-0.102	-0.0489	-0.2183	-0.2277
7	0.2652	-0.058	0.1009	-0.2224	0.0812	-0.1737	-0.1548	-0.219	0.35	-0.419	-0.1233	-0.1959
8	0.253	-0.1059	0.174	-0.1168	0.1456	-0.2433	-0.0382	-0.419	0.1565	0.0822	0.1506	-0.1793
9	0.224	-0.1575	0.2434	0.0325	0.2248	-0.1925	0.1935	-0.2325	-0.1331	0.2616	0.2432	-0.1593
10	0.1812	-0.219	0.2353	0.15	0.2183	-0.0265	0.3872	0.1515	-0.4728	0.1396	-0.2482	-0.1437
11	0.1194	-0.2761	0.2108	0.2633	0.1172	0.2519	0.1759	0.336	0.1737	-0.4604	-0.1188	-0.1343
12	0.0377	-0.3227	0.0987	0.2823	-0.1159	0.3524	-0.0832	0.0277	0.3272	0.0086	0.315	-0.1322
13	-0.0386	-0.3326	0.0041	0.2337	-0.2329	0.1876	-0.2703	-0.1533	0.0421	0.3934	-0.0809	-0.1367
14	-0.119	-0.3096	-0.1402	0.0813	-0.2813	-0.0607	-0.1673	-0.2487	-0.1684	0.0247	-0.2721	-0.1635
15	-0.1626	-0.274	-0.1978	-0.0525	-0.2084	-0.182	-0.0336	-0.0993	-0.2423	-0.2752	-0.0676	-0.1852
16	-0.1983	-0.2374	-0.1705	-0.1295	-0.0972	-0.2137	0.1142	0.165	-0.1752	-0.1802	0.2922	-0.1785
17	-0.231	-0.1834	-0.1675	-0.1916	0.0412	-0.1936	0.1901	0.1949	0.1007	-0.001	0.2293	-0.193
18	-0.2647	-0.0992	-0.0972	-0.1803	0.2028	-0.0348	0.1776	0.1259	0.3458	0.238	0.0419	-0.195
19	-0.2804	-0.01	-0.0269	-0.1438	0.2665	0.0783	0.0524	-0.0282	0.2206	0.2086	-0.3409	-0.2087
20	-0.2735	0.071	0.0932	-0.0988	0.2693	0.2163	-0.1435	-0.0619	-0.0546	0.0094	-0.2699	-0.2102
21	-0.2478	0.1448	0.2067	-0.0677	0.168	0.2402	-0.1827	-0.0802	-0.2279	-0.1243	0.0825	-0.2156
22	-0.2148	0.2012	0.2703	-0.0066	-0.0257	0.155	-0.0833	-0.1208	-0.2003	-0.146	0.3037	-0.2161
23	-0.1694	0.2411	0.3118	0.0649	-0.272	-0.0487	0.0934	-0.0934	-0.0148	-0.068	0.1406	-0.218
24	-0.0873	0.2597	0.3105	0.1376	-0.484	-0.2856	0.2705	0.1218	0.246	0.1109	-0.2783	-0.2054

Table 7: Last 12 vectors of PCA analysis.

	Vectors
	$v_{13}$	$v_{14}$	$v_{15}$	$v_{16}$	$v_{17}$	$v_{18}$	$v_{19}$	$v_{20}$	$v_{21}$	$v_{22}$	$v_{23}$	$v_{24}$
1	0.0798	0.1181	0.2208	0.0927	0.0281	0.0801	0.003	-0.0073	0.0186	-0.0156	0.0142	-0.0456
2	-0.2046	-0.261	-0.3895	-0.2019	-0.0167	-0.2026	0.003	0.005	-0.0333	0.0273	-0.0238	0.1093
3	0.132	0.1597	0.1917	0.3653	-0.0591	0.4337	0.0019	0.0157	0.0492	-0.0627	0.0929	-0.2271
4	0.1736	0.0795	0.3627	-0.2112	-0.0317	-0.4634	0.0053	-0.0415	-0.1497	0.092	-0.1769	0.4035
5	0.0242	-0.015	-0.3544	-0.4257	0.041	0.0165	-0.0155	-0.0052	0.176	-0.0677	0.1655	-0.4793
6	-0.2865	-0.1186	-0.2622	0.4116	0.1698	0.272	0.0173	0.0249	-0.0656	0.0195	-0.1377	0.4231
7	0.0447	-0.1871	0.2584	0.2665	-0.0261	-0.3345	-0.0125	0.0006	0.0683	-0.0352	0.1498	-0.3008
8	-0.2185	0.1558	0.1236	-0.4483	-0.2579	0.3918	-0.0013	-0.0028	-0.0668	-0.0032	-0.0254	0.1544
9	0.4708	0.0894	-0.4111	0.2113	0.1598	-0.1976	-0.0001	0.0036	-0.0024	0.0259	-0.076	0.0245
10	-0.3787	-0.0795	0.3131	0.0175	0.0063	-0.1316	-0.0131	-0.0238	0.0284	0.0141	0.0818	-0.145
11	0.3896	-0.0724	-0.0009	-0.19	-0.0223	0.2778	0.016	0.0149	-0.0368	-0.0187	-0.0677	0.1516
12	-0.4435	0.373	-0.1462	0.1202	0.1359	-0.2036	-0.0085	-0.0032	-0.0064	-0.0176	0.0357	-0.0504
13	0.1053	-0.511	0.0397	0.0845	-0.3597	0.029	0.0124	0.0328	0.1677	0.0625	-0.1407	-0.037
14	0.1667	0.0965	0.0968	-0.1198	0.3921	0.0405	0.0011	-0.0115	-0.4136	-0.2604	0.325	0.0225
15	0.0385	0.2966	0.0332	-0.0409	0.0663	0.0219	-0.0129	-0.0249	0.3151	0.5125	-0.3521	-0.1096
16	-0.0086	0.005	-0.0968	0.0894	-0.408	-0.0853	0.0036	-0.0414	0.2335	-0.28	0.4178	0.2867
17	-0.1051	-0.1955	-0.0098	0.023	-0.0187	0.0272	0.0007	0.1716	-0.4504	-0.2252	-0.4761	-0.2725
18	-0.0237	-0.2343	0.0326	-0.0248	0.2752	0.1131	-0.0327	-0.3203	-0.005	0.459	0.3286	0.0668
19	0.0292	0.192	0.0436	-0.0495	0.129	-0.0506	0.1654	0.4673	0.4065	-0.274	-0.076	0.1183
20	0.0475	0.2445	-0.1022	0.0724	-0.2957	-0.0578	-0.4118	-0.4877	-0.1246	-0.1502	-0.1429	-0.0575
21	0.0071	0.0243	-0.0543	0.0331	-0.2027	-0.0316	0.6047	0.1478	-0.2945	0.2695	0.156	-0.0772
22	-0.0261	-0.2237	0.0881	-0.0647	0.2051	0.0396	-0.5684	0.3823	0.0608	0.0733	0.081	0.0502
23	-0.0467	-0.1611	0.0853	-0.064	0.2714	0.0392	0.3206	-0.4576	0.2696	-0.316	-0.2146	0.0168
24	0.0123	0.1705	-0.1161	0.0714	-0.2474	-0.0363	-0.0942	0.165	-0.1655	0.1711	0.0989	-0.0216

In BpmDj ever song is visualized using a specific color. Using equation 1 to reduce a FFT frame, we still have 24 values describing the sound color of a every song. These 24 dimensions cannot be immediately used to define a color for a song. As such, we rely on a PCA analysis and rearrange the 24 dimensional space to store the most energy in the first three dimensions (implementing a dimensionality reduction). Those 3 dimensions determine then the red, green and blue part of the song color. To avoid colors which are too dark every dimension is normalized after the PCA analysis and mapped to .

For readers unaware of what a PCA analysis is we briefly summarize it [3]. When given a matrix $A\in\Re^{m\times n}$ ( being the number of songs, being the number of frequency band), a singular value decomposition will find matrices $U=[u_{1},\ldots,u_{m}]\in\Re^{m\times m}$ and $V=[v_{1},\ldots,v_{n}]\in\Re^{n\times n}$ such that

$\begin{displaymath} A=U.diag(\delta_{1},\ldots,\delta_{p}).V^{T}\in\Re^{m\times n}\qquad p=min\{ m,n\}\end{displaymath}$

and $\delta_{1}\geq\delta_{2}\geq\ldots\geq\delta_{p}\geq0$ . In our case a PCA analysis will calculate and incrementally and only keep . This is then used to remap the original to its new orthogonal basis, being .

**Figure 2:** Importance of dimensionality in bark PCA analysis (purple line). A test on a unit vector (orange line)

We performed the PCA analysis on our test set. The resulting primary vectors are given in table 5 and 7. Their strength is shown in figure 2. When using only the first three components we are able to capture 83.17%² of the information content, which is very reasonable from a visualization point of view. A test on the unit vector reveals that the most important bands are bark band 12 (around 1600 Hz), bark band number 2 (around 150 Hz), and bark band number 7 (around 700 Hz).

The interpretation of this data should of course be considered carefully. One can assume that because we have a 'principal' component picking out these frequencies that those frequencies are 'important'. There is no good basis for this assumption. It might be exactly the opposite: because those frequencies doesn't matter that much, nobody really cares to tune them properly, hence they have the largest variability, and thus become a major component in the PCA analysis.

For the selected primary frequencies we might argue that they have the largest variation because standard mixing desks and audio engineers can tune them easily. Mixing desks typically offer a lo-, mid- and hi- frequency tuning, indeed targeting 150 Hz, 700 Hz and 1600 Hz. Furthermore, sound engineers wanting to modify the sound will target frequencies with the highest impact, hence the maximal spread over the bark scale.

Whether we can assume that variation equals importance is difficult to prove. When only interested in song visualization the PCA analysis will perform as wanted: songs sounding similar will automatically have a similar color.

2.5 Spectrum Clustering

**Figure 3:** Visualization of the spectrum of about 15000 songs, based on the two first principal components.

A possibility arising from a PCA analysis is checking for clusters. A quick test reveals that sound color analysis is not that very useful for clustering songs. Figure 3 shows an XY plot of all songs used in the analysis. X is the first primary component, y is the second primary component.

3 Energy accents

The echo characteristic of a song is a small image. On the abcis we plot a negative dB range (0 on the left, -96 on the right). On the ordinate we put the different bark bands. Bottom of the image is bark band 0, top of the image is bark band 23. (Figure 4 is rotated 90 degrees). The value of an image pixel relates to the number of times this frequency occurs in the song with that specific strength. One horizontal slice out of this image offers an histogram of the energy within the song (for that specific bark band).

3.1 Calculation

The 'echo' characteristic measures the distribution of the energy amplitude throughout the song. It is measured using a Sliding Window Fourier Transform which is normalized to the Bark Psychoacoustic scale. Every frame contributes to the creation of the image. Every frame (which normally consist of real and imaginary values) will be further be reduced to only the real part (Skipping the imaginary part of the Fourier transform implements a partial DCT transform, which has the nice property that it will compact energy [4]). The real value of a specific bark band then determines which bin's (dB) color value is increased. Once all those values are accumulated, we normalize the image.

Normalizing the image is done by autocorrelating and differentiating the 24 binned histograms. This highlights the relations between the different energy levels and removes sound color information.

Figure 4 shows the energy accent picture of the song 'Kittens - Underworld'. It presents 24 bark bands (0 being low, 24 being high frequencies). The first band shows how the bass tones are present at all kinds of energy-amplitudes. This is normal since this bin will also capture all residual energy that could not be captured by any other bin. Which also explains the previous encountered anomaly of the first bark band in the mean spectrum. The second bark band (band #1) show that the bass drum has very little accent. The higher the frequency the more structure we start to recognize. From band 12 and up we wee a split, indicating that the hi hats do have an accent. This can either be because they were programmed with a volume difference or because a delay is present. If you know the song, you will recognize this.

**Figure 4:** The horizontal axis contains the 24 bark band (0 is low, 23 is hi). The vertical axis shows the energy level in dB. A pixel is colored bright when that frequency with that energy occurs a lot. The song shown is Kittens by Underworld.

3.2 Comparing Echo Characteristics

BpmDj will store the echo characteristic of a song in the meta information associated with that song. Due to space constraints every image is reduced to $24\times96$ pixels. The dynamic range of the picture is always reduced (or increased) to $[0\ldots255]$ . Comparison of the characteristic is done using the L2 norm. If and are echo characteristics of two songs ( $\in\Re^{24\times96}$ ) then,

$\begin{displaymath} dist(A,B)=\sqrt{\sum_{x}\sum_{y}(A_{x,y}-B_{x,y})^{2}}\end{displaymath}$

3.3 Understanding the echo characteristic

In order to understand the echo characteristic better, we relied on a song database each annotated with the echo characteristic as described. I tried to perform a PCA analysis on the energy accent maps, but it seemed that there was no convergence, probably due to accumulated round of errors. As such I fell back to an old technique I recently reinvented to analyze larger spaces: correlation analysis.

3.4 Structural Analysis

Figure: Behavioral similarities - the blue spot is the position to which the audio stack is correlated. Green color is positive correlation, red color is anti-correlation. The 16 planes are superimposed in figure 6.

a	b	c	d
e	f	g	h
i	j	k	l
m	n	o	p

The correlation analysis process is based on the idea to correlate some parameter with the stack of song images. In order to explain the process I first need to explain how a single correlation analysis works with respect to 1 external parameter. If we call the stack of echo accent images $A_{x,y,z}$ and we have an external parameter B associated with every image: $B_{z}$ then we can correlate every pixel in every image with the external parameter and obtain a new one:

$\begin{displaymath} C_{x,y}=\rho(A_{x,y},B)\end{displaymath}$

This image will show which areas correlate with the external parameter. The trick we now use is to make the external parameter internal. By choosing a position in the audio stack which has a) a high variance, b) a large mean and is not covered by a previous analysis we can assign that position to become B. So $B_{z}=A_{a,b,z}$ . The correlation then becomes

$\begin{displaymath} C_{x,y}=\rho(A_{x,y},A_{a,b})\end{displaymath}$

If we perform this step once we color all positions that behave similarly to the originally chosen position . We then reiterate the process for different positions. Figure 5 shows different planes from the correlation analysis. Finally if we have enough coverage of the entire stack we can stop and color the image in pseudo-colors. Of course this resembles a coloring problem because we want colors that are close to each other to have different colors in order to distinguish them. In order to achieve this we performed a PCA analysis to determine which 'color' belongs to which picture.

Pseudo-Coloring of the superimposed behavioral planes

Figure 6: Pseudo-colored overlay of the different same-behavior planes.

The coloring used to have a useful visualization is based on the performed PCA analysis. This analysis gives us the best projection of the different correlation images to have the maximum differentiation between them. This allows us to map the correlation stack onto a 2 dimensional plane. In this plane we search the central position and then measure the angle of every other projected correlation plane. Intuitively, if two correlation images are close to each other, (in the sense that they overlap a lot) then their angle will be close to each other. This information can be used in two ways.

In the first approach one could assume that we want to keep them close to each other and assign very similar colors to similar planes. This however would not accentuate the different areas that well. As such we need to have the maximum distance between the different correlation planes. In order to achieve this we will sort the correlation planes based on their angle and then use their index in the sorting, bit reverse it and used that as the hue of the color associated with a given correlation plane.

Discussion

The result of the correlation analysis are striking. Figure 6 shows the pseudo-colored overlay. We clearly wee how the lower frequencies (Bass & bass drum) run out into a less accentuated body. We also find that the body of the bass drum is below the signal strength of the real content. The real content has a variation at the left area (picture 5b,c,g,h,l,p). If one side correlates then the other anti-correlates. There can however be an accent seen in picture 5a,e,j,m). We also find a crosstalk back from the main content (green) to the lower frequencies. The blue and red sections are very likely due to compression, meaning that MP3/OGG compression will limit out those frequencies. Right of the compression are we find plain noise.

4 Conclusions

In this article we discussed how many interesting properties available in BpmDj are calculated. We presented an intriguing analysis of normal energy distributions in songs and we gave values for the 'perfect' spectrum based on an analysis of 16361 songs.

Acknowledgments

I would like to thank Kristel Joossens for pointing out the existence of PCA analysis a couple of years ago. This information seed has grown since then. I would also like to thank Sam Liddicot and Camille d'Alméras for financially supporting BpmDj through donations. Sourceforge.net supports this work by offering web space and support facilities.

Bibliography

1.	The Bark and ERB bilinear transforms Julius O. Smith, Jonathan S. Abel IEEE Transactions on Speech and Audio Processing, December 1999 https://ccrma.stanford.edu/~jos/bbt/
2.	Psychoacoustics, Facts & Models E. Zwicker, H. Fastl Springer Verlag, 2nd Edition, Berlin 1999
3.	Matrix Computations Gene H. Golub, Charles Van Loan John Hopkins University Press, second edition edition, 1993
4.	Discrete-Time Signal Processing Alan V. Oppenheim, Ronald W. Schafer, John R. Buck Signal Processing Series. Prentice Hall, 1989

Footnotes

...¹ This 'perfect' is in analogy with the 'perfect' face which is composed of the mean distances of many different human faces.
...² The values plotted in figure 2 are relative to the number of dimensions. Hence, the sum of the energy captured by the 3 first vectors is 19.9608. Divided by 24 gives 83.17 %

http://werner.yellowcouch.org/
werner@yellowcouch.org