MP3net

MP3net is a deep convolutional GAN which leverages techniques from MP3/Vorbis audio compression to produce long, high-quality audio tracks with long-range coherence.

The model uses a Modified Discrete Cosine Transform (MDCT) data representation, which includes all phase information. Phase generation is hence integral part of the model. We leverage the auditory masking and psychoacoustic perception limit of the human ear to widen the true distribution and stabilize the training process.

The model architecture is a deep 2D convolutional network, where each subsequent generator model block increases the resolution along the time axis and adds a higher octave along the frequency axis. The deeper layers are connected with all parts of the output and have the context of the full track. This enables generation of samples which exhibit long-range coherence. An additional benefit of the CNN-based model architecture is that generation of new songs is almost instantaneous.

We used MP3net to create 95s stereo tracks of classical piano music with a 22kHz sample rate after training for 250h on a single Cloud TPUv2. We also used the same model structure with a higher number of features to generate shorter, 5s samples which exhibit better audio quality and a clearer piano timbre.

Links: paper; source code

Short samples (120h training)	Long samples (250h training)	Long samples during Xh training	Real training samples
5s sample 1	95s sample 1	sample 207h	real sample 1
5s sample 2	95s sample 2	sample 200h	real sample 2
5s sample 3	95s sample 3	sample 192h	real sample 3
5s sample 4	95s sample 4	sample 109h	real sample 4
5s sample 5	95s sample 5	sample 42h	real sample 5
5s sample 6	95s sample 6	sample 28h	real sample 6