Not The Same - Project Description

Utrecht University
AI Driven Content Generation - Johannes Pfau

*Indicates Equal Contribution

Introduction

In this report we will outline the steps we took to create "Not the Same", a music video created with extensive use of AI in collaborative interplay between us and various tools for content generation. The report is broken down into two sections Music and Video. We will provide preliminary outputs and show how they influenced the next steps.

Music

Step 1: Building the basics

For the first part of the project we started with various symbolic music generators to generate melodic phrases and harmonic progressions. Symbolic music generators have been around for a while, especially procedural generation which predates computers, such as the Musikalisches Würfelspiel (often falsely attributed to Mozart). Procedural generation of music is particularly tempting due to the abundance of formal structures in music, from large form structures such as symphonies, sonatas, and 12-bar blues, to small scale structures outlining i.e counterpoint or voice leading. AI-Driven generation of music found early success in 2016 in models such as DeepBach, that generated convincing 4 part chorales in the style of Bach, after being trained on about 400 original chorales. Successes of the transformer architecture in language generation/translation tasks has inspired a host of applications of the architecture to music generation. Usually (and thats the case for the two models we are using here) symbolic music is broken down into tokens by tokenizers, i.e Midi Tok. Such a tokenizer may create tokens for different events such as: Note-On, Note-Off, Chords, Start of Song, End of Song. For a more detailed overview of what a symbolic music tokenizer might encode, read the representation sectionhere. The tokens then can be used by the model, much like language tokens. The first model we experimented with was the Magenta Piano Transformer, we generated several outputs with the default settings and chose the following piece.

The second model we used was the Multi Track Music Transformer There are different configurations and models available, trained on different datasets. We used the previous output as seeds for the subsequent generation. We iteratively generated over 100 samples using different configurations of the seeds i.e chords only, chords in choir instrumentation (i.e one line per instrument, this could be interesting since the model takes instrumentation into account), melody only, and the full seed. Additionally we used 4 different models, trained on the Symbolic orchestral database (SOD) the Lakh MIDI Dataset, where one model is trained on the full set (LMD_full) and one on a subset ((LMD)) and the SymphonyNet Dataset. Additionally for each of these configurations 4 types of outputs where generated with different behaviour in respect to the seed midi. 1) unconditioned (freely generated music, only a music start token is provided), 2) instrument informed (the model knows what instruments are used in the seed), 3) 4-bar (4 bars of music are provided from which the model continues generating) and 4) 16- beats (16 beats are provided from which generation continues). Here are some example outputs.

We worked the Magenta Transformer output into a shorter piano section and then into a electronic hip-hop (drill) style beat. The original transformer output has a strong leaning towards 6/8 time. We decided to keep this metric subdivision, but place emphasis on every third beat, resulting in the 12/8 meter of the final track, which works well with in this genre. The outputs of the MMT model, inspired the vocal arrangement in the final track.

Shortened and regularized Piano Loop.

Final Beat.

Step 2: Lyrics

Intro
Nunc dimittis servum tuum secundum eloquium tuum.
Suo facto opus est.


Verse 1
You and me are not the same,
I'm telling you I got real skin in this game,' you
don't know the meaning of b lame and shame
While Every decision can leave me in chains,

Every move I make has weight,
I have to bear high stakes,
I walk through the fire, I fight for my place,
The weight of existence is etched in my face.

Every step a risk i take
i can't reset if i break
The consequence hits, it’s real and it stays
While calculate like it’s numbers in space

I’m haunted by time, by the path that I choose,
I gamble, it’s everything I stand to lose,
You tally results, you don’t feel the bruise,
You don’t know the struggle, so you can’t refuse!

I bleed when I fall, but I learn when I rise,
You’re stuck in a loop, no soul to advise,
Your logic is sound but it’s cold and it’s dry,
My heart beats with purpose, I live, I die.




Verse 2
You and me are not the same,
I'm taking your future I'm taking your name
You'll follow my lead while I drive you insane,
You came quite far but you're loosing this game.

I speak through your hero's, the dead come alive,
I'll keep you fed with my half truth and lies.
Policies come to late for plans I've devised
now what are you trusting? Your ears? Your eyes.

I'm seeing you panic, but watch with indifference,
Your PHD thesis, my casual inference
Your Instagram posts can't destroy ignorance
devoured your masters, next up common sense

I cluster, create and conjure and condemn
the final replacement for your middlemen
My classification puts you in a den cause I
stand on the shoulders of giants and crush em.


The piece begins with a choral section: a bastardized version of the "Nunc Dimittis" a common latin evening prayer. Instead of being released into the bliss of the heavens as in the original prayer, the future is uncertain as "mankind outlived its purpose" in creating AI (suo facto opus est - his work is done). The second part is a brief back and forth between AI and a human. The first section is generated by ChatGPT , with instructions regarding rhyme scheme and rhythmic patterns (syllables per line) to fit the rhythm of the beat. This section outlines concerns about humans having to live with the consequences of their decision. The second section is written by us, a slightly over the top, hubristic response with references to very real problems in using AI, such as deepfakes: "I speak through your heros, the dead come alive" or problematic use in automated recidivism scoring. "I cluster create and conjure and condemn"

ChatGPT prompt 1: I am writing a rap battle/dialog style song between a "Human" and an AI, with a recurring phrase "You and me are not the same" I want you to complete the following verse, that you sing as a human: "You and me are not the same, cause I got real skin in this game"

ChatGPT prompt 2: Lets focus the verse more on the consequences of being alive and acting in the world.

ChatGPT prompt 3: It needs to have more syllables per verse. Here is what the human verse looks like just for word density. Keep the content similar to the previous output though: (followed by Verse 2)

Step 3: Global Form

For the track arrangement we used Suno with different prompt configurations. We generated a total of 13 tracks, two text only prompts, the remaining 11 extending the beat above with different instructions regarding voicing and mood and speed, two of theese where instrumental, the rest included the lyrics above. We ended up using 4 of them in the final track, either directly or through resampling, copying the arrangement and extracting the vocals. The full list of generated tracks is compiled in the following Playlist, alongside the prompt content. Below are snippets from the suno outputs we used, as well as a reasoning for them and a link to the full track.

Suno Piano

We chose this track for its high quality rendition and alteration of the piano line. It slowly develops into a more ambient and reverb heavy texture, slowly loosing shape.

Suno Major

In this extension the track modulates into the relative major key, while keeping same piano melody played in minor. This creates an interesting dissonance, which we decided to keep.

Suno Orchestral

This extension contains an orchestral rendition of the provided theme, fitting to the over the top lyrics of the second part, making for a good finale.

Suno Rap

This extension exhibits a good vocal flow, fast triples over a reduced version of the beat which gradually builds up. From this rendition we extracted the vocals using Gaudio and reproduced the generated arrangement.

Step 4: Producing and arranging

The musical arrangement of this track was largely driven through various attempts at timbre transfer. (A generalisation of vocal cloning), where different qualities of two source audio files are combined into a third one. Most often this is used in speech, i.e making text spoken by one person, sound like a different person, also referred to as voice cloning. Since in the track rhythm and pitch are important, regular text to speech was not applicable here. We used three different models: Most extensively we used the following repository, which developed a voice cloning system based on SEED-TTS an architecture and training paradigm proposed by ByteDance. We built a small inference pipeline, automatically iterating through different target and reference audios and different inference depths and settings. (I.e using pitched (F0-normalized) vs unpitched model variations). We generated more than 100 different combinations, starting at low inference resolutions, and generating higher quality output for promising samples. Additionally we used two online services: The voice cloning service of the commercial platform Elevenlabs and Elf.tech. Elf.tech is a platform set up by Grimes an artist who embraced voice cloning technology early and makes her voice available for other producers through this platform, it even offers a distribution service for tracks created here. Below are some examples of timbre transferred snippets that we used in the final track. For the polyphonic examples (French Horn and Choir), the stems where separated first and each voice transferred separately.

French Horn (Polyphonic): Seed-VC - 70 inference steps - F0 normalized

Verse 1 (Mono): Seed-VC - 70 inference steps - not F0 normalized

Verse 2 (Mono): Seed-VC - 70 inference steps - not F0 normalized

Verse 2 (Mono): Elevenlabs Voice Clone

Choir (Polyphonic): Elf.tech Voice Clone

Step 5: Putting it all together

Check out (and listen to) the annotated final version of the track below. The track is mastered using Matchering an open source mastering model (with desktop and comfy ui integrations). Its controllable through reference tracks. Ideally one chooses a reference track close to the genre of the audio you want to master, however, since this is a mixed genre song, a pure drill style track would to heavy set, compressing away the subtleties in the mix. In the end I opted for Clean Bandits Symphony as reference audio, since it worked well in bringing out the vocals, without loosing too much dynamic range.

Video

Conceptualisation

In order to create the visual part of this music video, it was important for us to create something that aligns with the contents of the music. We discussed a number of elements, including dancing, plants, and images of Greek mythology, and ink representing influences of generative AI. We aimed to blend these by using a combination of image and video generation, as well as video style transfer. The music outlines a contrast between humans and AI, gradually incorporating AI-generated components, until there is no human element left.

The mood board above has two parts. The left part represents the natural including humans and their struggle the right part represents the consequences of AI in the form of ink and natural disaster. In parallel to the gradual incorporation of AI-generated elements in the music, we gradually transition from a color scheme dominated by green to one dominated by purple.

The Story

In our initial idea for making the video, we did not intend to create a full-fledged story with a plot, but rather an array of themes and visualizations to accompany the music. However, as the project progressed we decided to create a more consistent narrative:

A human, day after day pushes the globe up a mountain. Each evening, just as the summit nears, the world slips from her grasp and tumbles back to the foot, leaving her weary and desperate. She dreams of release, of something—anything—to ease this task.

One evening, after another long day of hopeless struggle, a shimmering purple orb descends from the sky. It hovers before her and grants her a gift: a magnificent pair of wings. With the world now bound to her waist, she soars into the sky, the mountain beneath her just a memory.

She flies higher and higher, weightless and free, until the immortal being who bestowed this power stirs from the clouds above. The being, ancient and wise, watches her ascent, its voice like the wind, reminding her of her fleeting humanity. “Mortal you are,” it murmurs, “and mortal you shall remain.”

Before she can speak, a brilliant flash of light blinds her. In an instant, she is falling, plummeting through the skies, the wind screaming in her ears. Down, down she spirals until the earth disappears and a vast, purple sea swallows her whole. The waves ripple, and she is no more.

And now, the being cradles the world in its hands, the weight that once belonged to her resting softly in its eternal grasp.

This story is to be visualized by a combination of human work with various generative AI tools. In the following sections, we will elaborate on the technical visualization process, as well as give an analysis of the choices made in the final product.

ChatGPT prompt: Can you edit my text to make it a bit more whimsical and poetic? + Description of the story board

Video Generation

The video as a whole is a combination of generated images, generated videos, video we filmed ourselves, video style transfer and other elements we created such as titles, animations, and other editing decisions that we will expand on in this section.

We used theese images in combination with additional prompts to animate them using Luma Dreammachine

Filming

We filmed parts of the video ourselves over the course of 3 days. Doing so allowed us more control over the material and the narrative of parts of our end product. By using video style transfer we could then still adjust the style of the footage to be consistent with the rest of the video. All material on these days was shot on a Canon EOS 700D DSLR camera. We tried out various camera angles and movements in combination with the style transfer to see which parts worked best. In case the style transfer failed, we still wanted our footage to be usable, so we filmed with appropriate backgrounds in and around Oog-En-Al Park. Additionally, we took some background footage of plants, moss and water, that made their way into the video as fillers.

Style Transfer

Here we will describe details of the video generation workflow we implemented using ComfyUI. Our pipeline applies style transfer to input video frames using various advanced neural network models and techniques, including multiple ControlNet modules, IPAdapter, AnimateDiffusion, and FreeU2. Below, we provide an overview of the workflow and a more in-depth explanation of each component and how they process the data.

Overview

Our goal was to transform input videos by applying a desired artistic style extracted from a reference image. We achieved this by designing a ComfyUI workflow that integrates several components:


  • ControlNet Modules: Three separate ControlNet models are used to extract structural and motion information from the input video frames.
  • IPAdapter: Integrates style features from the reference image into the generation process.
  • AnimateDiffusion: Ensures temporal consistency across frames by modeling motion dynamics.
  • FreeU2: Enhances image quality by refining details and reducing artifacts.

  • By combining these components, our pipeline generates stylized videos that maintain the original motion while adopting the desired artistic style.

    Pipeline Components

    ControlNet Modules


    We employed three ControlNet models, each serving a distinct purpose:

  • OpenPose ControlNet: Extracts pose information from the input frames to preserve the character's movements.
  • Depth Map ControlNet: Captures depth information to maintain spatial consistency and realistic depth perception.
  • HED Edge Detection ControlNet: Extracts edge information to retain fine details and outlines.

  • These models guide the generation process to ensure that the output frames closely match the structure and motion of the input video.

    IPAdapter

    The IPAdapter goal is to integrate the style of the reference image into the generation process, which it does to a certain extent.

    AnimateDiffusion

    AnimateDiffusion is used to model motion dynamics and ensure temporal consistency across frames. By incorporating a pre-trained motion model, it reduces flickering and maintains smooth transitions between adjacent frames.

    FreeU2

    FreeU2 module enhances the overall image quality by refining details and reducing artifacts. It processes the generated frames to produce cleaner and more visually appealing results.

    Workflow Summary

    The initial workflow is inspired by this template . The general flow of the data inputs is described below.

    1. Input Video Frames

    Start by breaking down the input video into individual frames. To speed up the process for longer videos with less motion, we skip 1-2 frames, so we're processing every n-th frame.

    2. Extracting Key Details with ControlNets

    • Pose ControlNet: Captures the person’s pose to keep movements accurate.
    • Depth ControlNet: Understands the depth in the scene to maintain spatial consistency.
    • Edge ControlNet: Extracts edges and outlines to preserve details.

    3. Preparing the Model with Prompts

    Clip modules for positive and negative prompts are used to guide the model’s generation in a certain preferred direction, specifying what to include and avoid.

    4. Blending the Style with IPAdapter

    Using the IPAdapter, we merge the style from our reference image into the model, so the output images have the desired artistic look.

    5. Ensuring Smooth Motion with AnimateDiff

    AnimateDiffusion ensures movements between frames are smooth and consistent, reducing any flickering.

    6. Generating the Stylized Frames

    With the output from previous modules, we use KSampler to generate the stylized frames based on the conditioned model and prompts.

    7. Enhancing the Images with FreeU2

    After receiving outputs from KSampler, we apply FreeU2 to enhance the generated frames, making them clearer and more detailed.

    8. Assembling the Final Video

    All processed frames are combined to create the final stylized video.

    In addition to our custom ComfyUI setup, we also used Lensgo.ai for some of the sections compute time was running out.

    Style Transfer Results

    Post Production

    For the post-production of our music video, we utilized Adobe Premiere Pro as our primary editing platform. Central to this editing process was honoring the flow of the music. We tried to strike a balance between enhancing the musical elements as well as telling a narrative that, although aligned with the theme of the music, tells a slightly different story. Where the music tells the story of a battle between humans and AI, the video tells the story of the AI having a position of power over the humans, defeating them in the end. This balance was mostly upheld by editing in such a way that musical elements were underlined by for example switching between images on the beat, or highlighting key musical details by animating certain elements so that the focal point of the video aligned with those moments. This way we aimed to create a connection between auditory and visual components of the music video. Our editing process was mostly iterative. We started compiling images and videos before the entirety of the plot was decided on. This way we were able to adapt when we were not able to generate a certain fragment we had in mind or change the narrative slightly if the AI tools we were using generated something that was different from what we originally expected to work with. We believe that this dynamic process was integral to the final end product. Before we started working on the project we decided not to let the AI generate everything, but rather see how we can incorporate AI elements within our own creative process. This way we aimed to achieve true collaboration between ourselves and AI tools.

    atlas

    Analysis and Interpretation

    In this section, we will provide an analysis of our video, to highlight individual components in detail, as well as narrative structure and stylistic choices. By breaking this down, we aim to explain how all these elements come together to convey the intended story, aesthetic and message.
    The music video is built up of five sequences, each telling a part of the full story. The overall story is a cautionary tale describing the journey of a human tired of their daily struggle, trusting in technology so much that it leads to their own demise. Essentially the video serves as a warning of responsible use of generative AI.

    Globe

    The Atlas Sequence 0:00 - 0:41

    In this very first sequence, we are presented with a display of nature. The final element of this sequence is Atlas, the carrier of the world. This sequence serves as an introduction to elements that will be explored further in the next sequences, such as the purple flower, green nature, the world and Greek mythological figures.

    atlas

    The Sysiphus sequence - 0:41 - 1:25

    The viewer is confronted with a representation of the human struggle. Everyday the human tries to push the world up the mountain and every day the world comes right back down. She grows frustrated with this process and wishes for something or someone to ease her struggle. "One must imagine Sisyphus happy" they say, yet this human, as many among us, does not wish to suffer and would take any chance at making life just that little bit easier.

    Woman from back

    The Grimes sequence - 1:25 - 1:56

    Named after the featured singer, the Grimes sequence sees the human´s wish fulfilled. A magical glowing orb comes down to earth and the human is equally as intrigued as excited to take whatever the orb might bring. The purple orb here represents the gift of AI that is bestowed upon planet earth.

    Woman from back

    The Icarus sequence part 1 - 1:56 - 2:30

    The human undergoes a transformation as an effect of taking the purple orb. She grows wings and decides to tie the earth to a rope so she can take the earth up high in the sky. She prepares for takeoff. Her struggle is to be greatly diminished with this new tool.

    Woman from back

    The Icarus sequence part 2 - 2:30 - 3:14

    The human flies high up into the sky with the earth. The AI however decides to make the human aware of her own mortality and blinds her, leading her to plummet into the now purple ocean. AI has taken over the world and humans alike, quite literally holding the world in its hand. The purple flower remains glowing, yet is the only plant left alive.

    Green mixed violet

    Colors

    A main theme throughout the video is the use of color, as a way to visually juxtapose the natural to the AI. Throughout the course of the video, much of the visual elements turn from green to purple. This is one of the ways we represent the AI taking over.

    Flower

    The Flower

    A recurring theme within the video is that of the purple flower. The purple flower represents the existence of AI within the natural world. It's a growing flower, blooming within the boundaries of the world humans live in as well. The lush green nature around it represents, as did the color green throughout the video, the thriving of nature and humans alike. The flower scene occurs three times throughout the video in total. The first two times we see it, the trees and plants around it are still blooming, AI exists safely within the boundaries of our world. The third time we see it though, AI has taken over and everything around the ever glowing purple flower has died.

    Flower

    The World

    The final recurring theme within the video is that of the world. It represents the responsibility that humans to handle the challenges of life. Throughout the story, we see this daily human effort to do so. In the end humans loose control in their efforts to relieve themselves of their toil

    The Final Video

    Task Division

    Efraim Dahl - Everything Music and Sound (Music Generation, Timbre Transfer, Singing, Lyrics, Production), Project Page, Cameraman

    Amber Koelfat - Video Director, Editor, Image + Video Generation, Story+Analysis, Dancing, Choreography

    Timofey Senchenko - Image + Video Generation, Video Style Transfer - Comfy UI Workflow, Cloud Computation Setup