Autolume: Automating Live Music Visualisation - Technical Report
Autolume is an interactive music visualizer using Generative Adversarial Networks (GANs). The program explores the latent space of a trained model and fits the movement to music. The algorithm uses simple audio features, such as amplitude, onset timing and notes to influence the images generated by the model. Additionally, we incorporate a Midi-controller as an interactive tool to edit and manipulate the generation process using Network Bending and GANspace to create fitting imagery for the artists.
Prior works have found ways to use GANs to generate music videos offline and techniques to edit the output. Autolume incorporates these approaches and adapts them to automatically create suitable visuals for a live performance while using a common interface for VJing to facilitate a new way to interact with the medium.
We accommodate live performances by reducing the video's resolution. Especially in audio visualization, the time delay between the music and the visuals have to be low enough to be unnoticeable. With an NVIDIA Quadro RTX 5000, it takes 34 milliseconds/ 24fps to generate a single frame with a resolution of 512x512. VJing software tries to stay as close to 16 milliseconds/ 60 fps to create responsive and smooth visuals.
It is possible to use a lower resolution for the GAN (256px/40fps, 128px/60fps) or different less expensive models and then use VJing software (Touchdesigner or Resolume) to use post-processing to up-sample the stream or use effects to stretch the frames.
We are using StyleGAN2-ada, as it has seen the most research and exploration. We are using the disentangled space in our VJing interface and architecture-specific manipulations to allow an interactive program.
Although our implementation is based on the StyleGAN2 architecture, it is possible to extend it to other models.
Future work would be trying to find faster generative approaches or distilling the model for efficient inference.
Audio-Reactive Visualization
Using GANs for audio visualization allows for different ways to create audio-reactive images. Currently, there are two trends:
First, traversing the latent space using audio features to define the step in the high dimensional space. Second, mapping latent vectors to notes and weighting the individual vector by the note's amplitude. Outside of these main differences, both approaches can use additional musical features to create more complex reactions.
Latent Space Traversal
GANs learn a mapping from a lower-dimensional space into the image space. In its training process, neighbourhoods in the lower dimensional space are mapped to similar images, which leads to smooth transition morphing between images when interpolating from one vector to another. By mapping the audio to a walk in the latent space the images seem to react to the music. We have coupled the step size of a random walk to the change in amplitude and the overall magnitude of the current signal.
Chroma-based Interpolations
Another technique to create audio-reactive videos is to map positions in the latent space to notes. The algorithm interpolates between the positions according to the notes played. The idea of motifs in music is better represented using these presets because the visuals will stay in the same subspace of the latent space and return to previous states when the same notes are played.
Furthermore, we have incorporated the audio-reactive techniques described in (https://wavefunk.xyz/audio-reactive-stylegan), mapping onset strength to noise, using audio decomposition to map the harmony to certain layers of the model and the percussion to other layers of the model.
To expand the previous techniques to run live we keep a buffer of past frequency windows to compute the spectrogram, chromagram and other features locally. For the onset detection, we keep track of the average amplitude over a period of time and compare if the current amplitude exceeds a certain threshold to see if it is a significant peak.
With smaller windows and sequences for audio feature extraction, the resulting features exert noisiness that was not present in the previous offline approaches. To reduce the noisiness we apply a running Gaussian on the frequencies we use to compute our features.
VJing Interface
In addition to expanding the offline techniques to a live visualization tool, we also used findings from current interpretable GAN and Network Bending research to create an interactive tool for artists.
GAN Space
GAN Space identifies interpretable directions in the latent space using PCA. This method is lightweight and does not need any supervision. It finds the basis of the latent space in which the generated images have a high variance. We map these directions to the faders of the controller, allowing the user to adjust the visuals. We decide to use GAN Space instead of random vectors because we want the manipulations to be understandable and have a visible influence on the visuals.
Network Bending
Network bending is the idea of manipulating the network to influence the generated images. By translating, rotating or zooming into the weights of a layer we can perform these transformations on the visualisation. Furthermore, we allow the user to adjust the truncation value, which decides the sampling space from the latent space. A low truncation value reduces the diversity of the image, while a truncation value greater than one results in more abstract images. We map these transformations to the radio faders where a press of a slider activates the transformation and the fader adjust the strength.
Checkpoints
It is normal for VJs to store presets and positions in videos that they want to come back to at some point. To allow this to be done we simply assign every button of the controller to a latent vector that can be rewritten. Furthermore, it is also possible to take the average vector of multiple stored latent positions, to mix and match saved presets.