Style transfer using neural networks

The pytorch examples have a very nice explanation on how to do style transfer. That is: take a photo and ‘redraw’ it as if it were done by a certain artist. http://pytorch.org/tutorials/advanced/neural_style_tutorial.html

First the input image:

Then the style images:

And now Arwen, redone given the above styles:

Quiet amazing isn’t it ?

I just tested adadelta on the superresolution example that is part of the pytorch examples. The results are quiet nice and I like the fact that LeCun’s intuition to use the hessian estimate actually got implemented in an optimizer (I tried doing it myself but couldn’t get through the notation in the original paper).

Interesting is that the ‘learning-rate of 1’ will scatter throughout the entire space a bit more than what you would expect. Eventually it does not reach the same minima as a learning rate of 0.1.

In the above example we also delineated each epoch every 20 steps. That means, when step%20==0 we cleared all the gradients. Which feelsd a bit odd that we have to do so. In any case, without the delineation into epochs the results are not that good. I do not entirely understand why. It is clear that each epoch allows the optimizer to explore a ‘new’ direction by forgetting the garbage trail it was on, and in a certain way it regularizes how far each epoch can walk away from its original position. Yet _why_ the optimizer does not decide for itself that it might be time to ditch the gradients is something I find interesting.

Orthogonal weight initialization in PyTorch seems kinda weird

I recently gave deep learning another go. This time I looked into pytorch. At least the thing lets you program in a synchronous fashion. One of the examples however did not work as expected.

I was looking into the superresolution example (https://github.com/pytorch/examples) and printed out the weights of the second convolution layer. It turned out these were ‘kinda weird’ (similar to attached picture). So I looked into them and found that the orthogonal weight initialization that was used would not initialize a large section of the weights of a 4 dimensional matrix. Yes, I know that the documentation stated that ‘dimensions beyond 2’ are flattened. Does not mean though that the values of a large portion of the matrix should be empty.

The orthogonal initialisation seems to have become a standard (for good reason. See the paper https://arxiv.org/pdf/1312.6120.pdf), yet is one that does not work together well with convolution layers, where a simple input->output matrix is not stratight away available. Better is to use the xavier_uniform initialisation. That is, in the file model.py you should have an initialize_weights as follows:

``` def _initialize_weights(self):
init.xavier_uniform(self.conv1.weight, init.calculate_gain('relu'))
init.xavier_uniform(self.conv2.weight, init.calculate_gain('relu'))
init.xavier_uniform(self.conv3.weight, init.calculate_gain('relu'))
init.xavier_uniform(self.conv4.weight)

```

With this, I trained a model on the BSDS300 dataset (for 256 epochs) and then tried to upsample a small  image by a factor 2. The upper image is the small image (upsampled using a bicubic filter). The bottom one is the small picture upsampled using the neural net.

The weights we now get at least use the full matrix.

The output when initialized with “orthogonal” weights has some sharp ugly edges:

A demonstration on the difference between two crossfades. The first is a straightforward crossfade. The second one is a crossfade in which the partials of both tracks are detected, selected (the strongest wins) and then resynthesized.

To hear the difference listen to the middle of the two tracks (around 16″). While the normal crossfade sounds muddier, the second one retains the same volume and clarity as either track (at least to my ears).

A talk on timestretching at a hackerscamp

Time stretching of audio tracks can be easily done by either interpolating missing samples (slowing down the track), or by throwing away samples (speeding up the track). A drawback is that this results in a pitch change. In order to overcome these issues, we created a time stretcher that would not alter the pitch when the playback speed changed. In this talk we discuss how we created a fast, high quality time stretcher, which is now an integral part of BpmDj. We explain how a sinusoidal model is extracted from the input track, its envelope modeled and then used to synthesize a new audio track. The synthesis timestretches the envelope of all participating sines, yet retains the original pitch. The resulting time stretcher uses only a frame overlap of 4, which reduces the amount of memory access and computation compared to other techniques.

We assume the listener will have a notion about Fourier analysis. We do however approach the topic equally from an educational as well as from a research perspective.

High resolution slides are available at http://werner.yellowcouch.org/Papers/sha2017/index.html

Sleep debt, Oxygen deprivation and Resmed

I apologize at the start of this post. I never wanted to sound as someone who wants to document his snoring. Anyway, I do so because I feel I have some important things to share. I will try to stick to some facts that might help you without going too much in my personal situation.

Since half a year or so I try to get my snoring under control. To that end I got a Resmed Airsense 10 from docters/insurers together with a nasal mask.

Nostrils

The first big obstacle was learning to sleep with my mouth closed. Not much to do but to actually do it. This took some weeks.

Then getting up to speed was problematic because my nostrils would be more closed than open during the night. That lead to painful lungs and a not so optimal ‘therapy’. Two things were necessary to resolve this

a- got rid of the airfilter in the machine. The airfilter that was installed would actually pollute the air coming in (it hadn’t been changed in at least 6 months and the provider didn’t feel in a hurry to change it)

b- started using the humidifier at position 4. Every morning I would take it out of the machine and leave it open during the day. That would allow the water to breath. Once in a while I would replace it completely.

With those two tricks I got my nostrils somewhat under control.

Lack of oxygen

Although the headaches during the days vanished immediately after starting to use the machine, I now got symptoms of someone lacking oxygen. I felt really tired in the afternoon. Talking to my doctor did not help very much. He suggested that I could not be lacking oxygen because there was a positive pressure at the inlet.

It took me a couple of months to realize that he was wrong. To understand that just imagine the mask with no breathing holes. If you exhale air you will fill up the long tube, the humidifier and the rest of the machine with used air. The next breath you take will be first the old air, then the new. Now imagine, only 1 or two holes. The machine will be able to generate the necessary pressure, still no real air exchange will take place.

To analyze this further I set up a simulation in which the patient would inhale all the air he just exhaled. Here is what happens to the oxygen level then:

The above plots shows the initial oxygen concentration at about 21%. Each breath removes 1/4th of the oxygen, leaving you after 4 breaths with only 1/3th of the necessary oxygen ! That is quiet staggering.

Of course, the resmed machines do not have closed holes. The positive pressure replaces some of the old air with new air.  The question is now: how much ? This can be expressed as a percentage of the air that is swapped’ per breath cycle. For each percentage, we can calculate the amount of oxygen (compared to normal air) that would be available to you during the night.

From the above plot we can see that if the machine is able to swap out 80% of the air (during one breath-cycle),  you will have 93% of normal-air-oxygen. That is 7% less than you need.

Clearly we had found the culprit. I must have had an air exchange percentage that was sufficiently low, leading to a low oxygen availability. The question now was: what to do about it ?

Solution #1: turn of the ‘autoset’ feature of the resmed machine or increase its minimum pressure substantially – Initially, my machine was set in autoset mode, which means that it would try to determine  the best setting for your situation automatically. It will navigate itself between the minimum and maximum boundaries to minimize the number of blockages. Of course, that minimum might lower the average pressure that sends air out of the mask. Thus: it might be that, although it might minimize the apnoes, it no longer vents properly. An easy solution to solve that is to use a continuous positive pressure, or to increase the minimum pressure. That there is some truth to this can be read from online reports of people who went from a normal CPAP mask (resmed 8) to an automatic settings mask (resmed 10) and complained that they felt worse with this new machine.

How does the mathematics look ? Generally, air pressure through a hole can be modeled as the square root of the pressure difference divided by the air density. (ignoring friction and so on). We thus have sqrt(0.016666 P) describing the air speed velocity through the holes. Thus if we raise the minimum pressure by a factor x, we will push out around sqrt(x) more air.

Solution #2: turn off  Exhalation Pressure Release (The EPR setting) – online you often read that people sleep easier when the machine allows you to exhale easier (it does that by lowering the pressure when you exhale), yet in doing so, it drastically reduce the amount of air swapped out.

Solution #3: use a larger mask. Try to close the main hole of the mask and exhale through the little holes. Measure how long it takes to exhale and compare that against a normal exhalation. If you cannot generate enough pressure to get all the air out within one breath cycle, then your mask is too small. This problem can be solved by using a larger mask

Submissions to SHA2017 are out

Allthough SHA 2017 is too expensive for what it is (OHM 2013 was not particularly well organized), I decided that if they pay my entrance ticket I will still participate. I proposed two/three events.

First a talk on Zathras titled: “Time Stretching BpmDj – 8 Secrets the Audio Industry does not want you to know. Nr 5 will shock you.”

Time stretching of audio tracks can be easily done by either interpolating missing samples (slowing down the track), or by throwing away samples (speeding up the track). A drawback is that this results a pitch change. In order to overcome these issues, we created a time stretcher that would not alter the pitch when the playback speed changed.

In this talk we discuss how we created a fast, high quality time stretcher, which is now an integral part of BpmDj. We explain how a sinusoidal model is extracted from the input track, its envelope modeled and then used to synthesize a new audio track. The synthesis timestretches the envelope of all participating sines, yet retains the original pitch. The resulting time stretcher uses only a frame overlap of 4, which reduces the amount of memory access and computation compared to other techniques.

Demos of the time stretcher can be heard at http://werner.yellowcouch.org/log/zathras/
The paper that accompanies this talk is at http://werner.yellowcouch.org/Papers/zathras15/

We assume the listener will have a notion about Fourier analysis. We do however approach the topic equally from an educational as well as from a research perspective.

Then I proposed to play 2 DJ-sets with BpmDj. In itself interesting because I did not play anything the past 10 years. So I had to sell myself somehow.

Dr. Dj. Van Belle, a psyparty DJ who has his roots in the BSG Party Hall (Brussels/Belgium, 1998). After playing popular tunes for them students, he decided to throw in some psytrance… And absolutely no new style was born. He started his trend to be as inconspicuous as possible. In October 2006 he surfaced at the northern Norwegian Insomnia Festival. As an experiment, he played all songs at 85% of their normal speed. Every time he saw a camera, he inconspicuously hid behind the mixing desk. Since then he has done absolutely nothing. His career is as much a standstill as psytrance was between 2000 and 2016. And this makes him the perfect DJ. Bring in some of them good old beats. Some nostalgia for y’all. An academic approach to the real challenge on how to entertain them phone junkies.

Nowadays he plays anything he can get his hands on, mainly to test the DJ software he made. Some of his mixes can be found at https://www.mixcloud.com/5dbb/

I’m curious what they will accept (if anything).

Synthesizing waves

This image represents one of the remaining problems with the Zathras timestretcher I wrote. When synthesizing a new wave, we want to do that fast, so we use an FFT to generate on average 632 sines at the same time. The problem is that whenever a wave has a frequency that does not match any Fourier bin then we need to ‘fix’ it. That is done by applying a phase modulation to it. yet because the Fourier synthesis requires circular waves, the endpoints must match (that is be a multiple of 2Pi). When the wave is modulated with a non 2pi multiple this requirement is not satisfied. The result is that we set out an frequency path (the blue line, with only 8 points), and then assume that the final synthesized wave will be equally linear. The red line shows how this is not the case.

At the moment, to solve this we overlap a lot of these windows so the error fades away in the background. Yet a metallic ring remains.

A second solution is the application of an appropriate window (E.g: Kaiser Bessel); which will push the entire error into the endpoints.

Autoencoder identity mapping

Can an autoencoder learn the identity mapping ? To test that, I went to the extreme: let an optimalisation algorithm (SGD) find the best mapping when then visible units are 0-dimensional (a scalar) and the hidden units as well.

the first remarkable thing is that there is no solution that will have the perfect mapping ! There simply does not exist a relation that will map the input straight to the output when tied weights and sigmoids are used.

Anyway, because the problem is so non-dimensional, we can calculate the cost over an area and plot it in a surface. Two things are worth noting.

1. The minimum can be found as soon as we get into a very narrow valey… from the right angle… If we were to enter it from the back (B>40) then the value floor is not sufficiently steep to guide us quickly to the minimum.
2. If we were dropped on this surface at (W:40;B:-20) then the search algorithm would go down from one plateau to the next, blissfully aware of that nice crevasse that we laid out for it.

A small step into denoising autoencoders. Optimizers.

The following chart are some of the results of creating a denoising autoencoder.

The idea is that a neural network is trained to map a 1482 dimensional input space to a 352 dimensional space in such a way that it will recover 30% of randomly removed data. Once that first stage is trained, its output is used to train the second stage which maps the data to 84 dimensions. The last stage bring it further down to 21 dimensions. The advantage of this method is that such denoising autoencoders grab patterns in the input, which are then at a higher level combined into higher level patterns.

I have been testing various optimizers. The results below show how much of a signal can be recovered. To do that, we take the 1482 dimensional dataset map it through to 21 dimensions and then map it back to 1482 dimensions. After that we compare the original and recovered signal. The error we get is then compared against the most simple predictor; namely the average of the signal.

Now, the first thing we noticed is that although rmsprop style approaches go extremely fast, they do result in an average signal (literally, they just decode the signal by producing the average). Secondly, stochastic data corruption should of course not be combined with an optimizer that compensates for such noise (which the rmsprop and momentum methods do to a certain extend).

In the end, sgd turns out to retain the most ‘local patterns’, yet converges too slowly. Using adam improves the convergence speed. In this case, because mean-normalizing the data fucks up the results we actually modified adam to calculate the variance correctly.

This is of course all very beginner style stuff. Probably in a year or so I will look back at this and think: what the hell was I thinking when I did that ?

How did we come to these particular values ?

BpmDj represents each song with a 1482 dimensional vector. So I already had ~200000 entries of those and wanted to play with them. Broken down: the rhythm patterns contain 384 ticks per frequency band and we have 3 frequency bands. (thus 3*384). Aside from that we have loudness quantiles of 30 frequency bands (11*30). Which sums to about 1482 dimensions.

Then the second decission was made to stick to three layers. Mainly because google deepdream already finds high level patterns at the 3th level. No reason to go further than that I thought.

Then I aimed to reduce the same amount in each stage (/~4 as you noticed). So I ballparked the 21. And that was mainly because initially I started with an autoencoder that went 5 stages deep. It so happened I was happy at stage 21 and kept it like that so I could compare results between them.

Now, that 21 dimensional space might still be way too large. Theoretically, we can represent 2^21 classes with them if each neuron would simply say a yes/no. However, there is also a certain amount of redundancy between neurons. They sometimes find the same patterns back and so on.