
I have been experimenting with various AI-based music generation systems for over a year now, and I hate to say it, but most of the music generated by AI sounds like junk. It’s either too complicated and rambling or too simple and repetitive. It almost never has a pleasing melody with a global structure to the composition.
We have witnessed recent strides in AI-based content creation for other forms of media, like images and text. So it is confounding that the quality of AI-generated music seems so far behind.
When I heard that OpenAI released an API for fine-tuning their GPT-3 text processing models [1], my first thought was, aha!, maybe I can use it to create music that doesn’t suck. It took a little work, but I managed to wrangle some training data from music in the public domain and used it to teach GPT-3 how to compose music. I will explain what I did fully in this article, but first, here is a sample that I cherry-picked from five generated songs.
Generated Song — Mr. Clean by The Dishes, Music by GPT-3, Trained by the Author
OK, it’s not great, but at least to my ears, it sounds interesting and, dare I say, good?
Up next is an overview of the system, which I dubbed AI-Tunes.
Overview
Here is a high-level diagram for AI-Tunes. After a brief discussion of each component, I will explain the processing steps in greater detail in the sections below.

Component Diagram of AI-Tunes, Image by Author
I started by downloading the OpenEWLD [2] database of over 500 songs in the public domain. Each song has a melody with corresponding chords. The songs are in a format called MusicXML, which is in plain text but the format has a lot of overhead. So I used an open-source script called xml2abc[3] to convert the songs to ABC format [4], which is more streamlined and therefore more conducive for training a text-based Machine Learning (ML) system.
I then used a library from MIT called music21 [5] to process the songs and transpose them to the key of C Major, making it easier for the machine to understand the music. The formatted songs were saved in a training file and uploaded to OpenAI’s fine-tuning service. I trained their GPT-3 Curie transformer model to generate a song after being prompted by a song title and band name.
Although the Curie model generates the music, I am using its big brother, Davinci, to automatically generate a song title and band name. I use these two pieces of data to prompt the fine-tuned Curie model to generate five candidate songs.
I use a package called music_geometry_eval [6] to analyze the tonal qualities of the five songs. The analysis is based on the music theory described in the book “A Geometry of Music” by Dmitri Tymoczko [7]. I then choose the song that has the statistically closest tonal quality to the songs in the training set. Finally, I show the songs as a visual piano-roll and play them.
Photo by Dayne Topkin on Unsplash
System Details
The sections below describe details of the components and processes used in AI-Tunes. Be sure to check out more generated songs in the Appendix below.
The OpenEWLD Dataset
One of the keys to getting good results from an ML system is having good training data. In researching music generating systems, I found that authors of papers often do not post their training datasets due to copyright restrictions. Although there are many sites that host user-generated MIDI files of popular songs, the copyrights to these songs are still owned by the authors/publishers. And using copyrighted material for training data is in a legal gray area if the results are to be used commercially.
I found a dataset called OpenEWLD that can be used freely. It is a pared-down version of the Enhanced Wikifonia Leadsheet Dataset (EWLD) [8], a collection of over 5,000 songs in MusicXML format [9]. Note that a “leadsheet” is a simplified song score with just the melody and chords. The OpenEWLD is an extraction of 502 songs in the EWLD that are in the public domain and can be used for training ML systems.
Fair notice: I did not check the provenance of all the songs in the OpenEWLD to make sure that they are actually in the public domain. But a quick perusal of the titles/composers shows a bunch of old-time tunes, like these:
- “Ain’t Misbehavin’ ” by Andy Razaf, Fats Waller, and Harry Brooks
- “Makin’ Whoopee!” by Gus Kahn and Walter Donaldson
- “Maple Leaf Rag” by Scott Joplin
- “Oh! Susanna” by Stephen Foster
- “We’re in the Money” by Al Dubin and Harry Warren
Here is an example of one of the songs in the OpenEWLD collection. You can see a “piano roll” of the song, which is a graph with pitch values of the notes on the y-axis and time on the x-axis. And you can play the song on SoundCloud below.

Real Song — “We’re in the Money” by Al Dubin and Harry Warren, Source: OpenEWLD
The ABC Format
As I mentioned above, songs in the OpenEWLD [8] are in MusicXML format, which works fine, but has a lot of extra formatting text. Although an ML system could learn all of the formatting commands and generate songs in this format, I found that it was better to reduce the musical notation to the bare minimum. The ABC format by Chris Walshaw [4] is a good match. For example, here is the first part of the song “We’re in the Money” in MusicXML and ABC formats.


We’re in the Money by Al Dubin and Harry Warren in MusicXML (left) and ABC formats (right), Source: OpenEWLD
You can see that the song is dramatically simplified when converted to the ABC format. The XML format requires 503 characters for the header and 443 for the first two notes. The ABC format specifies the same information using only 85 characters for the header and 8 characters for the first two notes.
MusicXML to ABC Conversion
I use an open-source script called xml2abc by Wim Vree [3] to convert the songs from MusicXML to ABC format. Here is the command I use for the conversion.
python xml2abc.py were-in-the-money.xml -u -d 4
This will read the file “were-in-the-money.xml”, convert it, and save it as “were-in-the-money.abc”. The -u option will “unroll” any repeated measures, and the -d 4 option will set the default note duration to be a quarter note. Using both of these options helps the machine learning process by standardizing the music scores.
Here’s the entire “We’re in the Money” song in ABC format.
X:1 T:We're In The Money C:Harry Warren L:1/4 M:2/2 I:linebreak $ K:C V:1 treble "C" z E G3/2 E/ |"Dm7" F"G7" G3 |"C" z E G3/2 E/ |"Dm7" F"G7" G3 "C" z e"C+" e3/2 c/ | %5 | "F" d c d"Ab7" c |"C" e c"Dm7" c"G7" d "C" c2 z2 |"C" z E G3/2 E/ |"Dm7" F"G7" G3 | %10 "C" z E G3/2 E/ |"Dm7" F"G7" G3 |$"C" z e"C+" e3/2 c/ "F" d c d"Ab7" c |"C" e c"Dm7" c"G7" d | %15 "C" c2 z2 |"C" c2 z2 |$ z"Cmaj7" e d/c/B/A/ |"Cmaj7" B B z/ c A/ "Cmaj7" B B2 c |"Cmaj7" B4 |$ %21 "Cmaj7" z e d/c/B/A/ |"Cmaj7" B B z/ B B/ |"Bb" _B B"A7" A A "Ab7" _A A"G7" G z |$ %25 | "C" z E G3/2 E/ |"Dm7" F"G7" G3 "C" z E G3/2 E/ |"Dm7" F"G7" G3 |$"C" z e"C+" e3/2 c/ | %30 "F" d c d"Ab7" c |"C" e c2"G7" d |"C" c3 z | %33
Prepping the Songs
I prepare the songs to be used as training data by performing these processing steps:
- Filter out songs that don’t use the 4/4 and 2/2 time signatures
- Transpose songs to the key of C Major
- Strip out lyrics and other unnecessary data
- Replace newlines with a “$” symbol
Unifying the time and key signatures makes it easier for the ML system to learn about the melody notes and timing. A majority of songs use either 4/4 or 2/2 time (381 of 502) so it is not worth the effort to get the system to understand other meters. And most of the songs are already in the key of C Major (303 of 502). I use a function in the music21 package [5] to transpose the songs to a unified key.
I strip out the lyrics to help the system to focus on just the notes. I converted the newlines to “$” in order to adhere to a quirk of the GPT-3 fine-tuning system. Otherwise, it converts the newlines to commas; this would create a problem for songs in ABC format as commas are already used to specify notes in different octaves.
Fine-tuning GPT-3
In order to fine-tune OpenAI’s GPT-3 models, the first step is to compile a text file that has pairs of prompts and completions in JSON format. For the prompt, I use the song title and the composer’s name. The completion is the prepped song in ABC format.
Here is what two of the prompt-completion pairs look like:
{"prompt": "X: 1 $ T: We're In The Money $ C: Harry Warren $ <song>", "completion": " `C` z E G3/2 E/ |`Dm7` F`G7` G3 |`C` z E G3/2 E/ |`Dm7` F`G7` G3 |`C` z e`C+` e3/2 c/ | $ `F` d c d`G#7` c |`C` e c`Dm7` c`G7` d |`C` c2 z2 |`C` z E G3/2 E/ |`Dm7` F`G7` G3 | $ `C` z E G3/2 E/ |`Dm7` F`G7` G3 |`C` z e`C+` e3/2 c/ |`F` d c d`G#7` c |`C` e c`Dm7` c`G7` d | $ `C` c2 z2 |`C` c2 z2 | z`Cmaj7` e d/c/B/A/ |`Cmaj7` B B z/ c A/ |`Cmaj7` B B2 c |`Cmaj7` B4 | $ `Cmaj7` z e d/c/B/A/ |`Cmaj7` B B z/ B B/ |`Bb` _B B`A7` A A |`G#7` ^G G`G7` =G z | $ `C` z E G3/2 E/ |`Dm7` F`G7` G3 |`C` z E G3/2 E/ |`Dm7` F`G7` G3 |`C` z e`C+` e3/2 c/ | $ `F` d c d`G#7` c |`C` e c2`G7` d |`C` c3 z | $ <end>"} {"prompt": "X: 1 $ T: They Didn't Believe Me $ C: Jerome Kern $ <song>", "completion": " z G A c |`Dm7` d2 d2- |`G7` d e d/>e/c/>d/ |`C` e4- |`Em7b5` e e`A7b9` (3e f g |`Dm7` d2 d2- | $ `G7` d d (3d e f |`C` G2 G2- |`Am7` G G A c |`Dm7` f A B c |`G7` d3/2 B/ c`G7/F` d | $ `Em7` g B c d |`Am` e2 e ^f |`Em` g e/>^f/ g2- |`B7` g ^f =e _e |`Em7``Bb9` e4- | $ `A7` e g`A7b9` f e |`Dm7` d2 d2- |`G7` d d/>e/ (3d/e/d/ c/>d/ |`C6` e2 e2- |`Am7` e c d e | $ `Dm` f2 f2 |`G7` f f e _e |`C6``Bb9` e4- |`A7` e e e/f/ g |`Dm` d2 d2- |`G7` d d d/e/ f | $ `C` G2`Dm7` G2- |`Em7` G G`A7` ^G A |`Dm7` f A B c |`G7` d2`G7b9` e2 |`C` c4- | c z z2 | $ <end>"}
OpenAI provides a utility that checks the validity of the training file. Here is the command I use to check my file.
openai tools fine_tunes.prepare_data -f songs.jsonl
Here is the result.
Analyzing... - Your file contains 374 prompt-completion pairs - More than a third of your `completion` column/key is uppercase. - All prompts end with suffix ` $ <song>` - All prompts start with prefix `X: 1 $ T: ` - All completions end with suffix ` | $ <end>` Based on the analysis we will perform the following actions: - [Recommended] Lowercase all your data in column/key `completion` [Y/n]: n
It only flags one problem, which really isn’t a problem. The prepare_data script noticed that a lot of the text in the completions is in uppercase. This is due to the fact that the root of the chords and many of the notes are specified with uppercase letters. This warning is probably meant for conversational text, not music, so I just ignored the warning.
Once the training file is in good shape, it’s easy to fine-tune GPT-3. Here is the command:
openai api fine_tunes.create -t songs.jsonl -m curie --n_epochs 5
I chose to use the largest of the available GPT-3 models for training, Curie. It’s not as big as Davinci, but it seems to work well. I also set the number of training epochs to 5, which indicates how many times to run through the dataset for training.
Here is the result of the training.
Created fine-tune: ft-Vk1UCsXpd65sXXayafTGAY0m Streaming events until fine-tuning is complete...(Ctrl-C will interrupt the stream, but not cancel the fine-tune) [2021-08-29 12:10:50] Fine-tune enqueued. Queue number: 0 [2021-08-29 12:10:53] Fine-tune started [2021-08-29 12:12:55] Completed epoch 1/5 [2021-08-29 12:13:41] Completed epoch 2/5 [2021-08-29 12:14:27] Completed epoch 3/5 [2021-08-29 12:15:13] Completed epoch 4/5 [2021-08-29 12:15:59] Completed epoch 5/5 [2021-08-29 12:17:09] Fine-tune succeededJob complete! Status: succeeded 🎉
As you can see, it only took about six minutes to run the training. The team at OpenAI gets extra credit for showing me a celebration emoji when the training is finished!
Before we check out how well the model creates songs, I will show you how I create new song titles and band names as prompts, using GPT-3 Davinci.
Generating New Song Titles and Band Names
As you saw in the training data samples above, each line has a prompt with the song title and the composer name followed by the song in ABC format. In order to generate new songs, I create a new prompt with a new song and band name. Any text would do, but I think that it’s fun to see if the system can create a song when prompted with new information, like a song title and fake band name. Here’s an example prompt.
"prompt": "X: 1 $ T: Expensive to Maintain $ C: Shaky Pies $ <song>"
In order to make a lot of prompts like this, I put the GPT-3 Davinci system to work. No fine-tuning is necessary. The Davinci model works happily given a prompt like this. Note that I filled in fake band names and songs for the prompt.
Create a new song title a new band name. Be creative!Band name: The Execs Song title: Company Meeting ###Band name: The One Chords Song title: Moving Down to Burlington ###
And here are some sample results from GPT-3 Davinci.
Band name: The Wizards Song title: I’ll Get There When I Get There ###Band name: The Undergrads Song title: I'm Just a Kid ###Band name: The Fortunes Song title: What Do You Want from Me? ###
Looks like some fun songs! By the way, some of these generated song titles and/or band names might exist out in the real world, but if so, it’s OK. I am just using these to prompt the song-writing model.
Creating New Songs
Now that we have some prompts, let’s see what the model can do. I generated five versions of the first song and chose the best one.

Generated Song — “I’ll Get There When I Get There” by The Wizards, Music by AI-Tunes
OK, the melody is fairly simple and it sounds pretty good. Note the interesting structure which seems to be ABABCB. You can see and hear more generated songs in the Appendix below.
Evaluating Song Tonality
For that last test, I got involved as a critic. I listened to all five generated versions of “I’ll Get There When I Get There” and chose the best one. Note that there were some clunkers in the batch. Some of them started off OK but veered into playing odd notes. And others simply repeated a phrase over and over again without much variation.
Given that the system could crank out many versions of these tunes, I looked into using statistics to maybe help weed out the clunkers. I found that a lot of research has been done on the subject of measuring the tonal qualities of music.
The book I mentioned above by Dmitri Tymoczko has the full title, “A Geometry of Music: Harmony and Counterpoint in the Extended Common Practice” [7]. In the book, Tymoczko discusses five features of music tonality.
[These] five features are present in a wide range of [music] genres, Western and non-Western, past and present, and … they jointly contribute to a sense of tonality:
1. Conjunct melodic motion. Melodies tend to move by short distances from note to note.
2. Acoustic consonance. Consonant harmonies are preferred to dissonant harmonies and tend to be used at points of musical stability.
3. Harmonic consistency. The harmonies in a passage of music, whatever they may be, tend to be structurally similar to one another.
4. Limited macroharmony. I use the term “macroharmony” to refer to the total collection of notes heard over moderate spans of musical time. Tonal music tends to use relatively small macroharmonies, often involving five to eight notes.
5. Centricity. Over moderate spans of musical time, one note is heard as being more prominent than the others, appearing more frequently and serving as a goal of musical motion.
– Dmitri Tymoczko in “A Geometry of Music”
I found an open-source project on GitHub called music-geometry-eval [6] that has Python code to assess three of Tymoczko’s tonal features, conjunct melodic motion, limited macroharmony, and centricity.
I ran all 374 songs in my training data through the code to find the average and standard deviation of the three metrics. Here are the results:
Conjunct Melodic Motion (CMM) : 2.2715 ± 0.4831 Limited Macroharmony (LM) : 2.0305 ± 0.5386 Centricity (CENT): 0.3042 ± 0.0891
And here are the stats from the five generated versions of “I’ll Get There When I Get There.” I also calculated a Normalized Distance to the Mean (NDM) value for each of the five songs, comparing the metrics from each generated song to the average metrics of the songs in the training dataset.
Generating Song Version 0 CMM : 2.3385 LM : 3.5488 CENT: 0.5213 NDM : 8.1677Generating Song Version 1 CMM : 3.828 LM : 2.3396 CENT: 0.2677 NDM : 10.7161Generating Song Version 2 CMM : 3.124 LM : 1.5614 CENT: 0.2244 NDM : 3.8996 <-- Closest tonality to the training dataGenerating Song Version 3 CMM : 2.0206 LM : 3.4195 CENT: 0.4869 NDM : 7.0639Generating Song Version 4 CMM : 3.2644 LM : 1.4132 CENT: 0.2436 NDM : 5.5533
And sure enough, the song I chose to be “the best” of the batch, Version 2, also happens to have the best NDM score. Note that this happens often, but it’s not always the case. After running this experiment about a dozen times, I find that sometimes the song with the second or third closest NDM score actually sounds the best.
Discussion
The AI-Tunes system works fairly well. Not every piece is good, but it often produces interesting music with recognizable themes and variations.
If you know about music composition, you may notice that some of the pieces are in need of a little “clean up”. For example, sometimes the system does not strictly adhere to 4/4 time due to extra eighth notes inserted here and there. (Hint: Try tapping your foot to the beat when listening to the generated music.)
The good news is that you can download the generated compositions as MIDI files and fix them up in notation software fairly easily. For example here is a cleaned-up version of the example song.

Cleaned up Example, Image by Author
As for the general quality of the compositions, there is definitely room for improvement. For example, if OpenAI makes fine-tuning available for their larger Davinci transformer, the resulting music would probably improve.
Also, training the system on a larger dataset of music would definitely help. And it would probably update the style of music to be something from this Millenium 😄.
Implementing Tymoczko’s other two tonal features, acoustic consonance and harmonic consistency, would help assess the generated results.
Next Steps
It would be fairly easy to extend the AI-Tunes model to include features like musical phrase completion and chord conditioned generation.
For phrase completion, the training set would need to contain song parts, with one or more measures from the original song in the prompt, and the response would contain one or more measures that pick up from that point in the song. When running the system to generate new parts, a set of preceding chords and melody would be passed in, and the system would complete the musical phrase.
For chord conditioning, the prompt for training would contain just the chords of the original song in ABC format, and the expected response would be the original melody. When generating music, just the chords would be passed in, and the system would generate a melody that would match the chords.
Source Code and Colab
All source code for this project is available on GitHub. You can experiment with the code using this Google Colab. This Colab only works if you have an account with OpenAI. If you don’t have an account you can sign up here. I released the source code under the CC BY-SA license.

Attribution-ShareAlike
Acknowledgments
I would like to thank Jennifer Lim and Oliver Strimpel for their help with this project. And I would like to thank Georgina Lewis at the MIT Libraries for helping me track down a copy of “A Geometry of Music.”
References
[1] OpenAI, GPT Fine-tunes API (2021)
[2] F. Simonetta, OpenEWLD (2017)
[3] W. Vree, xml2abc (2012)
[4] C. Walshaw, ABC Notation (1997)
[5] MIT, music21 (2010)
[6] S. G. Valencia, music-geometry-eval (2017)
[7] D. Tymoczko, A Geometry of Music: Harmony and Counterpoint in the Extended Common Practice (2010), Oxford Studies in Music Theory
[8] F. Simonetta, Enhanced Wikifonia Leadsheet Dataset, (2018)
[9] M. Good, MusicMXL (2004)
Appendix
Here you can find some more examples of songs generated by AI-Tunes.
Original post: https://towardsdatascience.com/ai-tunes-creating-new-songs-with-artificial-intelligence-4fb383218146