• Travis

Audiobook Read-Along: Program Creates Videos From Audiobooks


Due to the COVID-19 pandemic, I found myself having more free time on my hands (especially the months between graduating and my job's start date). Instead of wasting that time binge-watching Netflix, I wanted to try to read more books. I like listening to audiobooks when I am doing something else like driving or cooking, but I prefer to read an actual book/eBook when I am not multi-tasking (ie. right before bed). However, it's hard to keep track of what page I'm on since my audiobook and my physical book/ebook are not synced up.

Amazon now offers a feature called "Whispersync" in the Kindle app where if you buy both the Kindle eBook and the Audible audiobook you can sync your progress between the two. This can be pretty pricey since Kindle eBooks are often priced at $15 and the narration, if it's offered, can be an additional $10. I especially didn't want to pay ~$25/book if the books are in the public domain since they should be free.

End Product:

(note: the video renderer is kinda wonky when the video is played non-full-screen. So you should full-screen it to see better quality)

My goal was to take free eBooks offered as part of the Gutenberg Project and free audiobook recordings created by LibriVox volunteers and merge them together into a video so that the viewer can:

  • Read-along with the video and the audio

  • Treat it like an audiobook by turning off their screens

  • Treat it like an eBook by muting the video

  • Synchronize their audiobook and eBook reading

Also, I was thinking about starting a YouTube channel and just posting a bunch of these videos on it so everyone can have access to free audiobook read-along videos.

But, I wanted a program to scrape, merge, and create the videos autonomously so that I didn't have to manually intervene.

How I Made It:

NOTE: I know the coding practices shown below aren't the best. I was just trying to get this done as soon as possible and I didn't spend time cleaning it up.

Libraries used:

  • BeautifulSoup - Webscraping Gutenberg and Librivox

  • Selenium - To take a HUGE zoomed-in screenshot of the eBook webpages.

  • PyTesseract - Optical Character Recognition for the webpage screenshots.

  • Deepspeech - Speech-to-text for the Librivox audio

  • Difflib - Fuzzy string matching

  • MoviePy - Programatically create videos

1. Webscrape the top 100 eBooks of the past month on Gutenberg

To make videos that people would actually watch/listen to, I needed to find which eBooks are the most popular. Luckily enough, Gutenberg has a list of them near the bottom of this page.

I wrote a tiny Python script to scrape the ebook_id from the hyperlink of each link and store them in a list.

2. Get information about the book

Luckily enough, Librivox has a beta REST API that accepts queries using the title of the book.

Then, I stored it into a database.

3. Download audiobook zip file from Librivox and unzip it

Python has built-in packages for zipping and unzipping files.

4. Screenshot the Gutenberg eBook webpage at VERY high resolution

At this point, I had audiobook files and I needed to synchronize the speaking in the audio with the text on the screen. I had two options:

  • I could use the Gutenberg API to get the text and then create frames of what the viewer would see in the video. This would require me to figure out how to format the text which is a non-trivial task since the text that the API returns is just plain text (like it would be impossible to figure out where indents and italicized text should go).

Text formatting like this would be impossible because the API returns plain text
  • I could simply just screenshot the webpage. However, although I would fix the text formatting issue, it would introduce two new issues: 1. Since the text is stored as an image, I cannot synchronize the scrolling of the video with the audio because I do not know where the phrase that the speaker is saying is located on the webpage screenshot. 2. Since books can be very long, the full-page screenshot would be VERY large and very difficult to work with since many programs break if the PNG is larger than 2^16 pixels in any direction.

Despite its drawbacks, I went with the second option and dealt with the side effects later.

I probably should have split this into multiple functions, but again, this was just a quick side project. The code above uses Selenium to open the webpage, then it removes the table of contents using a short 2-line Javascript snippet, and then it zooms in 250%. From here, it goes into a loop where it takes a screenshot, scrolls down, takes another screenshot and appends it to the previous screenshot, and then repeats. However, the OCR package, Tesseract, that I use later only supports images up to 2^15 - 1 pixels in any direction (hence the constant "TESSERACT_MAX_HEIGHT = 32767"). So, I had to save the stitched image whenever it was close to exceeding that max height.

5. Use Optical Character Recognition to locate the location of the text

I used the pytesseract package (a python wrapper over Google's Tesseract OCR engine) to locate the pixel location of the text within the screenshots captured above. I stored this information as JSON in a file. Was this the best storage mechanism? Definitely not, but I didn't feel like adding another column to my database.

6. Use speech-to-text on the audio to figure out when words are spoken.

Now that I knew where the text is located on the screenshots, I needed to know when the words were spoken within the audio. To do that, I needed to know what the speaker was saying. I could have used a cloud solution for this which would provide more accurate results, but I had a lot of audio files and didn't want to pay to use an API.

I first checked out CMUSphinx, an open source speech recognition toolkit. Unfortunately, the results were pretty inaccurate.

Eventually, I found Project Deepspeech by Mozilla that uses Google's Tensorflow to generate DNNs to convert speech to text. This was much more accurate than the Sphinx implementation.

7. Match the speech-to-text results with the OCR results

Now that I have where eBook text is located within a screenshot and when audiobook words are said, I need to synchronize the two so that the video can be scrolling down and the words that the speaker says are roughly located in the middle of the screen.

This was another tricky task because the OCR has its own errors associated with it (especially where the formatting is weird like italics) and the speech-to-text has errors too (especially when the speaker used a bad microphone). I was thinking about using the Levenshtein distance to match strings that are alike, but that usually assumed two fixed-length strings. Instead, I needed to find the closest substring match within the corpus of text.

I ended up using an algorithm I found on Stack Overflow.

8. Create the video

There's surprisingly few utilities to programmatically create videos. I ended up using moviepy which is well-documented and fairly fast.

The code I wrote below takes the timings we got from the previous functions and the locations of the text in the screenshot from the OCR and then scrolls to that position within the screenshot over a certain period of time that's determined from the speech-to-text.


This project was much more difficult than I expected because before I started this project, I didn't realize I needed to use things like OCR and string matching algorithms to just make a video where the text scrolls with what the speaker is saying. There's a lot more code that wasn't included in this blog post (mostly the boring stuff like creating an intro image using PIL, adding background music, database access object, etc).

Overall, I'm pretty happy about the results. If you take a look at the video I posted above, you can see that the words that the speaker says stays mostly within the center of the screen.

I think I want to clean the code up before I post the entire code base.