How programming helped me to learn German language

Recently I get serious about improving my German language skill. So, I started to follow a youtube channel named Easy German. I liked there style as they talk with people on the street on different topics which I really found helpful to improve my listening, vocabulary and speaking from real life conversation. But the best thing about their video is they put the conversation as sub-title both in German and English which is very helpful to see the words in German and the English equivalent.

I felt that it would be a great help if I had those sub-title with me when I watch the video. I also thought it will be helpful for me to do the speaking practice by myself. I tried to download the auto-generated captions from youtube through copy-pasting as well as different third party online tools. But all I was getting was a stub of huge texts without any punctuations and line breaks. Also, the auto generated captions were only available in German.

So, I decided to write a small script to get the sub-titles in both language side by side so that it will be helpful for my practicing. This is what I did to extract the subtitles:

  1. Download the youtube video.
  2. Extract images from the video.
  3. Extract text using OCR from each images.

Step 1: Download the youtube video

I used the python tool youtube-dl for this purpose. Their github repo is here. To download a video, I used this command:

youtube-dl -o '%(title)s.%(ext)s' --restrict-filenames https://www.youtube.com/watch?v=ZqObBG-NYPI

This will download the video and give the name of the file as the title of the video without any special chars and spaces, hence the param --restrict-filenames.

Step 2: Extract images from the video

I wrote the follow script to extract images for every second from the downloaded video file:

import cv2
import math
import os

filename_template = 'easy_german_138'

ROOT_PATH = '/Users/sparrow/Learning/python/tesseract-ocr'
IMAGE_DIR = os.path.join(ROOT_PATH, 'images')
VIDEO_DIR = os.path.join(ROOT_PATH, 'videos')
video_file = 'How_to_learn_a_new_language_with_Luca_from_The_Polyglot_Dream_Easy_German_138.mkv'

cap = cv2.VideoCapture(os.path.join(VIDEO_DIR, video_file))
frame_rate = cap.get(cv2.CAP_PROP_FPS) #frame rate

i = 0
while(cap.isOpened()):
    frame_id = cap.get(1)
    status, frame = cap.read()
    if (status != True):
        break

    if (frame_id % math.floor(frame_rate) == 0):
        filename = os.path.join(IMAGE_DIR, filename_template+'_'+str(i)+'.jpg')
        print('writing file', filename)
        cv2.imwrite(filename, frame)
        i += 1

cap.release()
print (i, 'files written')

An example extracted image

Step 3: Extract text from images using OCR

Now, as I have the images extracted from video, I need to extract the texts from the images. For this purpose I used a python OCR called PyTesseract. The github repo is here.

The I wrote the following script to extract texts from images and export them as csv in a file.

from PIL import Image
import pytesseract
import argparse
import cv2
import os

# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--image_path", required=True, help="path to input images to be OCR'd")
ap.add_argument("-p", "--preprocess", type=str, default="thresh", help="type of preprocessing to be done")
args = vars(ap.parse_args())

image_files = [f for f in os.listdir(args["image_path"]) if (os.path.isfile(os.path.join(args["image_path"], f)) and ('.jpg' in f))]

# Using this dictionary to avoid saving same 
# sentences as subtitle might be visible 
# in the video for more than 1 second
conversations = {}
csv_conversations = ''

# How many conversations will be parsed
limit = 0

total_lines = 0

for image_file in image_files:
    # load the example image and convert it to grayscale
    image = cv2.imread(os.path.join(args["image_path"], image_file))
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    # check to see if we should apply thresholding to preprocess the
    # image
    if args["preprocess"] == "thresh":
        gray = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]

    # make a check to see if median blurring should be done to remove
    # noise
    elif args["preprocess"] == "blur":
        gray = cv2.medianBlur(gray, 3)

    # write the grayscale image to disk as a temporary file so we can
    # apply OCR to it
    filename = "{}.png".format(os.getpid())
    cv2.imwrite(filename, gray)

    # load the image as a PIL/Pillow image, apply OCR, and then delete
    # the temporary file
    text = pytesseract.image_to_string(Image.open(filename))
    os.remove(filename)
    lines = text.split('n')

    if len(lines) < 2:
        continue

    for index, line in enumerate(lines):
        line = line.replace('[W]+', '')
        # If too short line, it is not a part 
        # of conversation.
        if len(line) < 16:
            continue

        if (index < len(lines) - 1) and (line not in conversations):
            next_line = lines[index+1].replace('[W]+', '')
            if len(next_line) < 16:
                continue

            conversations[line] = next_line
            csv_conversations += line+"|"+next_line+"n"
            break

    if limit > 0 and len(conversations) >= limit:
        break

    if len(conversations) % 10 == 0:
        total_lines += 10
        print('Total', total_lines, 'has been written')

with open('output/How_to_learn_a_new_language_with_Luca_from_The_Polyglot_Dream_Easy_German_138.csv', 'w') as csv_file:
    csv_file.write(csv_conversations)

This is some sample sub-titles as I got after the script finished executing:

German English
auf der StraBe und bei uns ist Luca. Hallo Luca! — Hallo! on the street and with us is Luca. Hello Luca! - Hello!
die man in der Schule lernt. - Englisch, ja? that you learn in school. - English, yeh?
Englisch, Franzésisch hab ich gelernt und Spanisch. I‘ve learned English, French and Spanish.
Hast du vor auch andere Fremdsprachen zu Iernen Are you planning to learn other foreign languages too

The OCR is not always perfect as the background of the images are very noisy. But it is okay. Luckily I have enough German proficiency to figure out when I found something is missing or a word does not make sense. Then I find that word by myself which I found is also helpful for my learning.

I got huge help from this blog of Adrian Rosebrock.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back To Top