Zathras-1: time stretching audio through matrix-multiplication
Werner Van Belle1* - email@example.com, firstname.lastname@example.org
Abstract : Zathras-1 extends BpmDj with a time stretcher that does not modify the pitch of the sound. This article describes how we implemented a sinusoidal modelling by measuring the amplitude and phase envelope of peaks.
time stretching, audio speed change
Modulation of real signals
Envelope Sampling Rate
|Time stretching/shrinking the envelope
Resampling the envelope as a matrix multiplication
Fourier transform of c
The timestretcher builds around a sliding window fourier transform. From the input signal, overlapping wave segments are retrieved, each of which is converted to its spectrum. Each frame contains multiple peaks. Each peak its envelope is analyzed, time stretched and then synthesized. Then the phases of the analyzed peak are investigated and aligned against the phases of a synthesized peak from the previously synthesized frame. Once all peaks are copied, the synthesized frame is converted to a wave and added to the output.
The output stride is a constant, while the input stride is variable. E.g: with a window size of 4096 samples and an overlap of 4, we have an output stride of 1024 samples. If the input stride is 1000 samples then the song will play at 97.65% of its original speed.
The spectrum of a discrete time domain signal is given by Fourier's great mathematical poem:
Generally, we will use a capital letter to refer to the spectrum, and a small letter to refer to the time domain representation.
With the above definitions we have the convolution theorem
This can also be written as
The complex conjugfate of is written as .
Reversed signals are written with an overline. E.g: .
We will also use a unit sample at a specific position: is a vector with a 1 at position .
Sinusoidal modeling aims to model a signal x as a sum of sine waves
, and are the parameters of 1 sine wave. They are respectively the gain, the frequency and the phase. This model can be generalized such that not only the gain is time dependent, but also the frequency/phase combination.
Given an input signal, there will be many solutions to this model. The main challenge is to find appropriate models , such that we can modify the time, pitch parameters in a manner that is acoustically sensible.
|Illustration of a sine wave (green) modulated with an envelope (the red signal). The result is shown in blue. Left the time domain version. Right, the spectrum. As can be seen, the envelope spectrum is merely shifted to the location of the sine wave in the spectrum
To obtain the phase and gain modulation different techniques can be used. Some people don't extract those parameters and simply rely on the sequence of Fourier frames to generate appropriate gain and phases [1, 2]. Some use linear prediction , others map a parabola through the peak of a Fourier bin to better estimate the actual frequency . In this paper we develop a novel method that determines the gain and phase envelope directly from the spectrum.
The underlying idea is that a modulation of a signal in the time domain is expressed as a convolution in the frequency domain:
In our case we assume that we have a signal , which is a complex oscillation at frequency bin (thus and its spectrum is ). We further assume that has been modulated through an envelope , which we don't know yet.
Writing the product of and as a convolution of their spectra and then taking the Fourier transform of yields
Thus, if is known, then convolving with will merely shift to position (it is after all the only non-zero bin in ). Turning this around, when and are known, we can retrieve by shifting bins to the left.
The modulation equation above shows how an ampitude modulation is encoded as a convolution of the envelope spectrum with the spectrum of the sine. Yet that is not all there is to it. The equation also includes a phase component. When complex is, then the phase and gain of are
To demonstrate how we can modify the frequency of our sine model, let us assume that adds rad over n samples to . That is done by defining . The picture shows how the central peak is blurred into neighboring channels. In a sense, we modified the frequency of whilst preserving an envelope gain of 1.
We could as well have started with a frequency for of 5.5 bins (instead of 5 bins) and have an envelope modulation of 1. Theoretically, in such situation our assumption to deconvolve by shifting it to the origin is no longer valid, because there is no longer only 1 bin in non-zero. Yet it does not matter. We can still shift the peak to the origin, assume its frequency was an integer multiple of and then investigate the phase modulation to determine the correct frequency.
A complex oscillation can be efficiently phase and gain modulated in the frequency domain. When we work with real signals we should investigate its conjugate symmetry. That is: and since the end result is also real, it follows that . Yet that means that if , itself must be conjugate symmetric as well. That further means that must be real. In other words it would be impossible to introduce a phase modification if we were to merely convolve two conjugate symmetric peaks.
Luckily, it is possible if we only look at half of the spectrum. Assume that is a real valued signal that consists of a gain () and phase modulation () of a sine wave of frequency .
If we now define then
The Fourier transform of this is
While the first peak is modulated with the forward envelope. The second peak has to be modulated with its reverse conjugate.
That the envelope of a signal is a localized phenomenon in the spectrum, makes it possible to extract it, together with its sine wave, as a single peak from the spectrum. By shifting the peak to the origin, we get rid of the sine wave and solely model the envelope.
Assume, we find a peak at position in the spectrum of . If we extract a window from to , we get
models the envelope at a sample rate of . If we use only 1 bin, at a sample rate of and a window size , then we sample the envelope at 10 Hz. Using 4 bins allows us to sample the envelope at 40 Hz and so on. In practice we will not always choose a fixed modulation sample rate. Instead, the width of the extracted modulation spectrum will depend on the position of local peaks.
5 positions in the originating frame are defined.
When extracting a peak, we place it in a Fourier frame of appropriate size. Source bins and are the previous and next peak. They will receive a weight of 0, and thus don't need to be transferred. The right side of the peak, from to will be placed in the positive frequencies of the target spectrum. The left side of the peak, from to , will be placed in the negative frequencies of the target spectrum.
The middle bin of the target spectrum (with radian frequency ) cannot be used because it is both a positive and negative frequency at the same time. When synthesizing a new peak, this bin would be a problem because we would not be able to figure out whether its value belongs to the left or right side. From all this it follows that we need at least positive frequencies and at least negative frequencies. The smallest higher power of two larger than twice the maximum of these two values is the size of the envelope spectrum we need.
In a frame many peaks are present. The values between two peaks should be correctly assigned to each of them.
One method could be a linear weighing. That would mean that the middle between the two peaks is assigned 50% to the left peak and 50% to the right peak. That is however a slightly inappropriate approach because sines tend to behave as decaying exponentials when they do not perfectly align to a bin.
A better assignment is determined by obtaining the position of the minimum between peaks. The interpolation of the values between one peak and the other is then given as an exponential curve, which goes through the minimum position with a value of 0.5.
For the left flank we define and . For the right flank we define and . The following function, defines how to weigh the contribution of each value, given its location .
This curve is loosely based on our article 'Fast exponential envelopes' .
That this is a good model can be seen by generating sine waves of different frequencies and then plotting the contribution of each sine to the value of a particular bin. In parallel, we can plot our estimate . Such an experiment shows that the main inaccuracy of our model appears when the actual minimum position falls between bins, close to either peak. In that case the observed local minimum is not necessarily the exact location of the minimum.
With all this in mind we can now write down the extraction of a single peak. Values are extracted starting at the previous peak (excluding the peak itself) up to the next peak (again excluding the peak itself).
Once we have the spectrum of the envelope (), we can compute the backward Fourier transform to obtain a time domain representation (). Then we can create a new envelope , which is a time stretched version of . Because shrinking time might increase the frequency content of , it follows that and are not necessarily of the same size. The size of will be denoted . The size of denoted . and are nonetheless powers of two and might be larger than strictly necessary.
Assume the time stretch ratio is given by . When , the song plays slower, the song plays faster. The time stretch ratio can be obtained by comparing the input stride against the output stride.
The position mapping between the original envelope and the new envelope is given by calculating the distance to the middle position and then rescaling it. The following equation specifies the source position (in ), when a target position (in ) is known.
In the above, the factor retrieves the data from the correct position in the source envelope. is a window to smoothen out the edges.
“I have to find the edge of the envelope and put my stamp on it.” ― Terry Pratchett, Raising Steam
The first reason to apply a window is that might not always have a multiple of radians in total; nor do we have any guarantee that the amplitude at the end of the frame will match the amplitude at the beginning of the frame. In other words: in all likelihood will be discontinuous. That in turn might lead to strong oscilations of the gains/phases throughout the entire envelope. This problem is effectively mediated by means of a hann window.
A second reason is that we don't always have the information we need. When shrinking time we will have positions in that map from source positions outside the analysed frame. Because we cannot say much about such locations, it is useful to set them to zero, as picture above.
The intermediate window is thus defined as
With this definition we have that .
The last problem involves the resampling of . Although the target position will be an integer, the source position might not be. That means we that we must interpolate . The Fourier coefficients make this easy:
In equation [eq:resampling-integral], we directly wrote the Fourier integral as a summation of the positive and negative frequencies. From a computational point of view this is useful because we only need to calculate/retrieve one coefficient and compute two terms with it. A second reason we explicitly split positive and negative frequencies is because resampling works only correctly when falls within to . Multiplying an aliased frequency with a non integer number will result in the wrong frequency being generated. E.g: . Whilst . Yet is not an alias of .
We can now determine the value of
We can set up a matrix that will allow us to compute from as a straightforward matrix multiplication. For a fixed and we define the following coefficient matrix.
With this definition is defined as
|Demos of the c matrix for time stretch factors k=0.75, 0.99 and 1.5. White reflect values with a large norm. Dark are values with a small norm. The intermediate window I can be perceived as dark bands at the top and bottom of each image.
There exist 2 symmetry relations in that can help speed up resampling of into the time domain of .
The first symmetry property is that
For this is trivial since . For we have
For we have
The second symmetry relation is
In summary, given a fixed time stretch ratio , an input window size of and output window of , we only need to obtain , and to calculate and .
|fabrics doesn't make exquisite dresses, it is the stitches.
With the ability to extract a sinusoidal components from a frame, and synthesize them into a timestretched signal, we should now look at piecing together consecutive frames. Or in our particular case how to determine a previous peak to which to attach, and how to bring the phases of the new peak in alignment with the old one.
To determine a previous peak we merely look at the previous frame and find the nearest peak, irrespective of its phase or volume. With our time stretcher there is no need to set out a path or try to detect whether traces are dying or coming to live 
To align the synthesized waves, we must know the entry and exit phases of the analysis stage as well as those of the synthesis stage. To determine these, we will deal with 4 positions, each dependent on the relevant stride.
These positions are relative to the frame (they range from 0 to 1), so they must be multiplied with to obtain a correct sample position. They have been created such that when an input or output stride is applied, the exit position of the previous frame overlaps with the entry position of the new frame. The subscript denotes whether we deal with the analysis or synthesized frames.
To lighten the notation of the upcoming discussion, we will automatically use the appropriate analysis or synthesis positions: and
The phases of the translated peak are not the phases we would measure in itself. When the peak is repositioned to position , then the fully correct phase of the output requires us to add , multiplied with the entry/exit position, to the angle of . Therefore, the entry and exit phases of the analyzed peak and the synthesized peaks are:
Because we will now discuss how we make sure that the various waves align correctly, we need to discuss the entry and exit positions of the analysis and synthesis spectra of the previous and current frame. Current peak values are annotated with a superscript \'cur\'. Previous peak values are annotated with a superscript \'prev\'.
The start phase of the synthesized sine () must be in alignment with the exit phase () of the previously synthesized wave. Or to be more correct, it should have the same error as the entry and exit phases had during the analysis stage, which is given by .
The current entry phase of the synthesized sine is yet it should be . Thus we must phase shift the synthesized wave by
To expand the full phases, we need to know the positions of the previous peak () and that of the position of the current peak ()
Reorganizing the various terms and using the fact that exit=1-entry, gives us the required phase modification of the synthesized peak:
In words this reads as: the required phase modification is the sum of 1) the entry phase differences between the synthesized and analyzed envelopes of the current frame and 2) the exit phase differences between the synthesized and analyzed envelopes of the previous frame, and 3) the peaklocation difference multiplied with an appropriate phase trend.
The calculation of a time domain representation of from the spectrum of is straightforward, yet it requires a backward Fourier transform from to , before adding it to the output frame. It is possible to omit the conversions and completely.
This is done as follows. has been previously defined as
has rows and columns. has rows.
Taking the Fourier transform of both sides gives
Thus by computing the Fourier transform of each column in the matrix , we can directly compute . Interesting thereby is that since each column in is conjugate symmetric, the Fourier transformed matrix will consist entirely of real numbers.
|Some examples of the matrix C, given a window size of 64 bins and various tempo modifications. Dark are large values. White are small values. Only a quarter of the full matrix is shown. All other quarters are determined through various symmetry relations.
To compute the Fourier transform of we will only determine the transform for values of . How the other half can be ignored will be explained later.
Were we to include the intermediate hann window in the integral directly, then might not be able to solve it easily. Yet, since is a multiplication of two functions, their Fourier integrals will be a convolution. Therefore, we currently omit the hann window, and will later smear the bins appropriately.
Which further reduces to
Subtracting from the value we pass into the sine does not change the outcome, but might swap the sign, depending on the oddity of . Similarly, the oddity of also determines the sign of the outcome. Overall we get the following:
Application of window requires us to calculate the convolution of the neighboring bins. Luckily the convolution window is very small. With coefficients such as -0.25,0.5 and -0.25 we can easily smoothen with .
Because is a sparse matrix it makes parameter changes to easy to deal with. Only a selected few old coefficients must be removed and regenerated at a new position.
From the fact that , it follow that is real. From the symmetry relation , follows that because . These two symmetry relations are useful to reduce the size of the matrix and speed up computation:
Thus by obtaining coefficients and , and combining them with the input values and , we compute terms for and . When we can reduce this even further to
The described method recreates sound with high fidelity, yet it does not honor the phase content of a track. Time stretching a stereo track by processing each individual channel makes the sound appear phased and out of the room. Reverbs will sound wrong and the overall ambiance will be affected. To solve that, to a certain extent, it is useful to merge stereo tracks and then time stretch the merged and difference signal individually. Once time stretched, we can again calculate the left and right channel.
The joined and difference channel are defined as:
To obtain the Fourier transform of these two at the same time we can calculate the Fourier transform of . (This trick is taken from ).
When a new and have been created, we need to restore a left and right signal. This again can be done with 1 Fourier transform
The inverse transform of of has a real part that will be and an imaginary part that is
This entire process of joining and mangling the input channels and output channels introduces a gain of 4, which should be compensated for with the normalisation gain into .
Aside from the stereo convertion and the 2-for-the-price-of-one FFT tricks, other operations also affect the gain. In particular the various windows used by the time stretcher:
The first window applied is the analysis window. It is necessary to use a hamming window here because its first side lobe is removed. Compared to other windows, this one makes detecting peaks easier and makes interpolation between peaks more accurate. The necessary gain compensation for the analysis window depends on the time stretch ratio.
The second window we use is a modified hann window . It is applied to the synthesized peaks such that the Fourier representation can deal with the discontinuity at the edge of the frame.
The last window is the synthesis window. We choose a welch window, mainly to get rid of phase cancellation artifacts. Whenever we select a wrong peak or have an insufficiently large peak window, we find that we retain some waves that would ordinarily have been canceled out. These problems are mostly apparent at the edges of the synthesized window. A welch window offers a great solution to this whilst retaining as much overlap as possible between frames.
Another place where we omitted gains are the forward and backward Fourier transforms. A forward transform followed by a backward transform introduces a gain that matches the windowsize. This becomes especially confusing when matrix converts a smaller envelope to a larger one.
The latency calculations can be fairly tedious depending on where we anker ourselves. That is, when we input data into the timestretcher, do we measure the latency until the first sample of our input comes out, the middle or the last sample ? Contrary to most latency calculations, if we do not take the middle sample as a reference point, we will have a latency which is affected by the playbackspeed.
To get a grasp on the latency, we drew a number of scenarios: halve speed, double speed and normal speed. For each scenario we placed ourselve at time 0. At T0 we put frame E into the timestretcher and see how it comes out, and more importantly, how it interacts with previous and future frames provided to the timestretcher. Here it must be noted that allthough the graphics might suggest that the input frame stride is fixed, in reality it is the output stride that is fixed. Also the length of all segments is fixed, independent from the playbackspeed.
For a 0.5 playback speed, we see that the starts of the output segments, compared to the starts of the input segments, are stretched out in time. While each input sample is offered to ~5 frames, each output sample only comes from ~3 frames. The graphic shows how the beginning of segment E (drawn in green) is placed at 5 output positions, with a center in the middle of C. The middle of segment E, drawn in blue, appears in the output frames also at different positions. However, in this case, its center is still centered at E. Lastly, the end of segment E maps to the middle of G in the output. Practically, speaking, segment E will be stretched out in time, starting at N*(1-1/speed)/2, up to N*(1+1/speed)/2 (with N referring to the framesize, typically 4096 samples). These bounds define the latencies of the timestretcher.
Similar scenarios are plotted at normal and double speed.
In this article we presented the Zathras-1 time stretcher. The time stretcher detects and transfers all peaks from the source to the target frame. For each peak the envelope is extracted and recreated after time stretching. The phases of consecutive peaks are aligned with the same error they had in the input signal.
The presented algorithm works well with a window size of 4096 samples and an overlap of 4. It can be used to time stretch ~14.6 stereo tracks simultaneously on a 1.4GHz computer. That means that per frame we spend 1.59 milliseconds. On average we have about 630 sine waves per frame, which leaves us with a computational time of about 2.5 microseconds per peak. Results of the time stretcher can be heard here .
Potential improvements include a proper rescaling of the phase modulation. Currently we stretch the phase modulation, which introduces a small error. Secondly, preserving the phase-error as it appears in the source frame might be improved by tuning the error in function of the time stretch ratio. Another improvement would be to have some estimate on the probability that a local maximum is indeed a local maximum and if not switch to another time stretching method.
|Phase Vocoder Flanagan J.L. and Golden, R. M. Bell System Technical Journal 45: 1493–1509; 1966
|Speech Analysis/Synthesis Based on a Sinusoidal Representation R. McAulay and Th. Quatieri IEEE Transactions on Acoustics, Speech, and Signal Processing; August 1986
|Enhanced Partial Tracking using Linear Prediction M. Lagrange, S. Marchand, M. Raspaud, J.B. Rault Proc. of the 6th Int. Conference on Digital Audio Effects (DAFx-03), September 2003
|PARSHL: An Analysis/Synthesis Program for Non-Harmonic Sounds Based on a Sinusoidal Representation J. Smith III, X. Serra Center for Computer Research in Music and Acoustics (CCRMA). Department of Music, Stanford University; 1987 https://ccrma.stanford.edu/~jos/sasp/Quadratic_Interpolation_Spectral_Peaks.html
|Fast Exponential Envelopes Werner Van Belle Audio Processing; YellowCouch; April 2012 http://werner.yellowcouch.org/Papers/fastenv12/index.html
|FFT of pure real sequencey.
|Zathras-1 Werner Van Belle Yellowcouch Signal Processing http://werner.yellowcouch.org/log/zathras/