Bedtime stories generated by AI


While I was reading on Hacker News I stumbled upon this article (link) where the article describes how he used OpenAI’s GPT-s API to generate random stories and then use Microsoft Cognitive service to narrate the story. All those are great services, but at the end of day you just to pay to use them and they just works. So I thought why not try to build something similar using TensorFlow pre-trained models, and here we are today telling how it all come throught.

Enter GPT-Neo

To generate the stories for the podcast, I used a freely available 1.3B parameters pre-trained model called GPT-Neo which is an implementation of a GPT-3-style model that uses the mesh-tensorflow library for training. This model was developped and pre-trained by EleutherAI, which seems to be working on an even bigger model with 20 B parameters pre-trained on a massiveee 800 Gb corpus.

Because of the integration of GPT-Neo with the transformers library, all I needed to do is install the library with

$ pip install -q transformers

Then, load the GPT-Neo model in a text-generation pipeline like this

from transformers import pipeline
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-1.3B')

This will take few minutes to download and cache the massive model

Downloading: 100%  1.32k/1.32k [00:00<00:00, 26.6kB/s]
Downloading: 100%  4.95G/4.95G [02:39<00:00, 40.8MB/s]
Downloading: 100%  200/200 [00:00<00:00, 5.35kB/s]
Downloading: 100%  779k/779k [00:00<00:00, 648kB/s]
Downloading: 100%  446k/446k [00:00<00:00, 667kB/s]
Downloading: 100%  90.0/90.0 [00:00<00:00, 2.18kB/s]

And, finally, find a text prompt and let the model generate the rest of the story.

prompt = "It is raining heavily today."
story = generator(prompt, do_sample=True, min_length=50, max_length=2000)

On colab, this will take around 15mn to return.

Unfortunately, getting the model to generate a good quality long text is a no easy task. I could not get it to generate more that a 2000 token word because of this issue #273. And also, the model seems very often get stuck in a situation where it keeps generating the same sentence consequently over and over

I knew what I had to do. I had to go out there.
I knew I had to get wet.
I knew I had to get wet.
I knew I needed to get wet.
I knew I had to get wet for myself.
I knew I had to get wet for all of my fears.
I knew I had to get wet.
I knew I had to get wet.
I had to get wet.

In another case, the model generated what it seems to be extact copy of text it has seen during training like this paragraph

\n\nThe shelter is closed on Wednesday mornings between 12:00 and 16:00, so those who would like to stay there can, as long as we are at the venue, and then go to the nearby supermarket to buy a package of food.\n\nIt would also be great if we could get a place with all the facilities of the shelter so that we could start our preparation.\n\nTo follow our Facebook “I am a Victim” page you can:\n\nhttps://www.facebook.com/pages/IAMAWITHAVICTIM/42651820851097\n\nAnd to sign up for our newsletter you can:\n\nhttps://www.facebook.com/groups/IAMAWITHAVICTIM/\n\nAnd to sign up for our newsletter you can:\n\nhttps://www.facebook.com/groups/IAMAWITHAVICTIM/\n\nAnd for more information you can:\n\nhttp://www.imawithacheap.com\n\nhttps://www.facebook.com/groups/IAMAWITHAVICTIM/\n\nhttps://plus.google.com/+IAMAWITHACHOP/\n\nThanks very much for your help, all those who are ready to help me to stay warm and dry in Naples! And thanks to everyone who signed up for our newsletter.\n\nPlease visit our website:\n\nhttps://www.imawithacheap.com/\n\nAnd our Facebook page:\n\nhttps://www.facebook.com/pages/IAMAWITHAVICTIM/\n\nAnd our Twitter page:\n\nhttps://twitter.com/IAMAWITHAVP\n\nWe will do our best to try to keep that shelter open.\n\nI am very happy that my shelter is not closed. In fact, the weather is not the best, but so far everything is fine.\n\nI am thankful to the people who have already arrived.

Enter TensorFlowTTS

Once the story is generated, the next step is speech synthesis. I found a great repository calleed TensorFlowTTS. It hosts the implementations of many Text to Speech algorithms in TensorFlow and also provide ready to use pre-trained models. One caveat is the only available voice is one female voice. Hence the idea of naming the narrator Ex Machina.

This libray is easy to use, it can be install as follows:

pip install git+https://github.com/TensorSpeech/TensorFlowTTS.git
pip install git+https://github.com/repodiac/german_transliterate.git#egg=german_transliterate

Import the packages

import tensorflow as tf
from tensorflow_tts.inference import TFAutoModel
from tensorflow_tts.inference import AutoConfig
from tensorflow_tts.inference import AutoProcessor

Load the pre-trainded models

tacotron2 = TFAutoModel.from_pretrained("tensorspeech/tts-tacotron2-ljspeech-en", name="tacotron2")
melgan = TFAutoModel.from_pretrained("tensorspeech/tts-melgan-ljspeech-en", name="melgan")
processor = AutoProcessor.from_pretrained("tensorspeech/tts-tacotron2-ljspeech-en")

I define a helper function that takes text and output an audio stream

def text2speech(input_text, text2mel_model, vocoder_model):
    input_ids = processor.text_to_sequence(input_text)
    # text2mel part
    _, mel_outputs, stop_token_prediction, alignment_history = text2mel_model.inference(
        tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
        tf.convert_to_tensor([len(input_ids)], tf.int32),
        tf.convert_to_tensor([0], dtype=tf.int32)
        )
    # vocoder part
    audio = vocoder_model(mel_outputs)[0, :, 0]
    return mel_outputs.numpy(), alignment_history.numpy(), audio.numpy()

Then finally, I could generate the audio for the story as follows:

story = "One night, . . ."
_, _, audios = text2speech(story, tacotron2, melgan)

Putting it together

Once the narration is generated, I combine it with a background sound that matches the topic.

I use this website freesound.org which has a great collection of loyalty-free sounds that can be used as background.

For instance, for a story about storm and thunder I use this audio stream link.

To combine the two audio streams, I use the ffmpeg command line, which I call from a helper function

def add_background_track(episode_file, background_file, output, backgroun_volume_diff=20):
    tempbg = tempfile.mkstemp()[1]
    tempepisode = tempfile.mkstemp()[1]

    episode = AudioSegment.from_mp3(episode_file)
    background = AudioSegment.from_mp3(background_file)

    padded_episode = AudioSegment.silent(duration=7000) + episode + AudioSegment.silent(duration=8000)
    padded_episode.export(tempepisode, format='mp3')

    cut_bg = background[: padded_episode.duration_seconds * 1000].fade_in(3000).fade_out(5000)
    # Lower the background track volume.
    lower_volume_cut_bg = cut_bg - backgroun_volume_diff
    lower_volume_cut_bg.export(tempbg, format='mp3')

    subprocess.run(
        [
            "ffmpeg",
            "-y",
            "-i",
            tempbg,
            "-i",
            tempepisode,
            "-filter_complex",
            "amerge,acompressor=threshold=-21dB:ratio=12:attack=100:release=500",
            "-ac",
            "2",
            "-c:a",
            "libmp3lame",
            "-q:a",
            "4",
            output,
        ]
    )
    os.unlink(tempbg)
    os.unlink(tempepisode)

Now I can combine both tracks like this

add_background_track('voice.mp3', 'background.mp3', 'episode.mp3')

That’s all folks

You can give the podcast a try, all episodes are pulished here https://anchor.fm/exmachina

I would love to hear any feedack, suggestions or ideas for improvement. So feel free to leave a comment or reach out on twitter @bachiirc