Find movie overview similarity using gensim

thumbnail

I am working on to build a movie recommendation system which will understand my taste for movies and will recommend accordingly. The typical movie recommendation system available now does not work very good for me. I guess it is because they mostly matches the genre of movie I watched and recommend movies from that same genre.

I feel that looking into genre's are not enough to recommend movies. It should include multiple features like:

  1. Director and castings
  2. Studios
  3. Genres
  4. Overview (short plot)
  5. ...

I think overview can play an important role to recommend what kind of movie a person might like. Lets have a look at some short overviews of three different movies copied from IMDB:

A team of explorers travel through a wormhole in space in an attempt to ensure humanity's survival. (Interstellar)

An astronaut becomes stranded on Mars after his team assume him dead, and must rely on his ingenuity to find a way to signal to Earth that he is alive. (The Martian)

Two astronauts work together to survive after an accident which leaves them stranded in space. (Gravity)

Any intelligent movie recommendation system should be able to look into those short movie overviews and recommend me the movie Europa Report which has overview:

An international crew of astronauts undertakes a privately funded mission to search for life on Jupiter's fourth largest moon.

I don't know how much I can do but it sounds really fun! So, as a part of that movie recommendation system, I have developed finding the similarity in movie overview using gensim https://radimrehurek.com/gensim/.

Before this I have downloaded some movie information from The Movie DB https://www.themoviedb.org/. They have free available api.

These following gensim tutorials helped a lot to understand the whole process:

Import all necessary modules

import os
import gensim
import pandas as pd
import traceback
from datetime import datetime

Import and initialize pymongo and mongodb connectivity

We will use MongoDB to save the training movie infos in watched and favorited movie collection for future reference.

import pymongo
from pymongo import MongoClient

client = MongoClient('localhost', 27017)

db = client.tmdb

Some MongoDB helper methods

def select_movies(conditions):
    return db.movies.find(conditions)

def insert_favs(favs):
    try:
        for fav in favs:
            fav['created_on'] = datetime.now()

            keys = {'movie_id': fav['movie_id']}
            values = {
                '$set': {'updated_on': datetime.now()},
                '$setOnInsert': fav
            }
            db.favs.update_one(keys, values, upsert=True)
    except KeyError:
        print(fav)
        traceback.print_exc()

def insert_watched(watched):
    try:
        for watch in watched:
            watch['created_on'] = datetime.now()

            keys = {'movie_id': watch['movie_id']}
            values = {
                '$set': {'updated_on': datetime.now()},
                '$setOnInsert': watch
            }
            db.watched.update_one(keys, values, upsert=True)
    except KeyError:
        print(watch)
        traceback.print_exc()

Text clean up helper method

import string
from nltk.corpus import stopwords

stoplist = ['–']

def text_process(text):
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = text.split()
    text = [word.lower() for word in text if ((word.lower() not in stopwords.words('english')) and (word.lower() not in stoplist))]

    return ' '.join(text)

Initialize necessary paths and variables

PROJECT_DIR = '/Users/sparrow/Learning/machine-learning/tf-movie-detection-from-trailer'
ROOT_DIR = os.path.join(PROJECT_DIR, 'tmdb-api')
DATASET_DIR = '/srv/downloads/moshfiqur-ml-datasets/tmdb'
MODELS_DIR = os.path.join(ROOT_DIR, 'models')

The genre list was collected from TMDB API

genres_list = [{'id': 28, 'name': 'Action'},
               {'id': 12, 'name': 'Adventure'},
               {'id': 16, 'name': 'Animation'},
               {'id': 35, 'name': 'Comedy'},
               {'id': 80, 'name': 'Crime'},
               {'id': 99, 'name': 'Documentary'},
               {'id': 18, 'name': 'Drama'},
               {'id': 10751, 'name': 'Family'},
               {'id': 14, 'name': 'Fantasy'},
               {'id': 36, 'name': 'History'},
               {'id': 27, 'name': 'Horror'},
               {'id': 10402, 'name': 'Music'},
               {'id': 9648, 'name': 'Mystery'},
               {'id': 10749, 'name': 'Romance'},
               {'id': 878, 'name': 'Science Fiction'},
               {'id': 10770, 'name': 'TV Movie'},
               {'id': 53, 'name': 'Thriller'},
               {'id': 10752, 'name': 'War'},
               {'id': 37, 'name': 'Western'}]

genre_dict = {}
for genre in genres_list:
    genre_dict[genre['id']] = genre['name']

List of genre ids of my favorite movies

fav_genres = [12, 16, 14, 27, 9648, 878, 53]

Prepare training dataset

I created separate list for each genre I like on IMDB: https://www.imdb.com/user/ur67961488/lists

Later I have exported those lists and use them as reference to get data from MongoDB. In the sections below, we will process those downloaded lists, get data from MongoDB accordingly and prepare the training documents.

csv_headers = ['Position', 'Const', 'Created', 'Modified', 'Description', 'Title',
               'URL', 'Title Type', 'IMDb Rating', 'Runtime (mins)', 'Year', 
               'Genres', 'Num Votes', 'Release Date', 'Directors', 'Your Rating', 'Date Rated']
for genre_id in fav_genres:
    genre = genre_dict[genre_id]

    df_favs = pd.read_csv(os.path.join(DATASET_DIR, 'favs', genre+'.csv'), encoding='latin-1')
    movie_names = df_favs['Title'].tolist()

    movie_overviews = ''

    for movie_name in movie_names:
        conditions = {'original_title': movie_name}
        movie_info = select_movies(conditions)
        if movie_info.count() == 0:
            print('No info found for movie:', movie_name, 'genre:', genre)
            continue

        insert_favs([{'movie_id': movie_info[0]['_id']}])
        insert_watched([{'movie_id': movie_info[0]['_id']}])

        overview = text_process(movie_info[0]['overview'])
        movie_overviews += ' '+overview

    if not os.path.isdir(os.path.join(DATASET_DIR, 'data', 'train', genre)):
        os.mkdir(os.path.join(DATASET_DIR, 'data', 'train', genre))

    with open(os.path.join(DATASET_DIR, 'data', 'train', genre, 'overview.txt'), 'wb') as f:
        f.write(movie_overviews.strip().encode('utf-8'))
/Users/sparrow/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:12: DeprecationWarning: count is deprecated. Use Collection.count_documents instead.
  if sys.path[0] == '':

No info found for movie: Spirited Away genre: Animation
No info found for movie: My Neighbor Totoro genre: Animation
No info found for movie: Howl's Moving Castle genre: Animation
No info found for movie: Castle in the Sky genre: Animation
No info found for movie: Harry Potter and the Sorcerer's Stone genre: Fantasy
No info found for movie: Tenkû no shiro Rapyuta genre: Fantasy
No info found for movie: Busanhaeng genre: Horror
No info found for movie: Poirot genre: Mystery
No info found for movie: Star Trek: Discovery genre: Science Fiction
No info found for movie: Artificial Intelligence: AI genre: Science Fiction

Load the training documents

movie_overviews = []
genre_doc_id = []
for genre_id in fav_genres:
    genre = genre_dict[genre_id]

    with open(os.path.join(DATASET_DIR, 'data', 'train', genre, 'overview.txt'), 'rb') as f:
        movie_overviews.append(f.read().decode('utf-8').strip())
        genre_doc_id.append(genre)

Lets have a look at the generated movie_reviews, from now on we will call them documents.

if False:
    print(movie_overviews)

We also saved the genre-document mapping so that we know which document id belongs to which genre

genre_doc_id
['Adventure',
 'Animation',
 'Fantasy',
 'Horror',
 'Mystery',
 'Science Fiction',
 'Thriller']

Tokenize the documents

texts = [[word for word in document.split()] for document in movie_overviews]

# remove words which appear only once
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1] for text in texts]

Lets have a look at the generated tokens

if False:
    print(texts)

Convert the tokens into dictionary

Here we convert the generated tokens into dictionary and save it on the disk for future use.

dictionary = gensim.corpora.Dictionary(texts)
dictionary.save(os.path.join(MODELS_DIR, 'movie_overviews.dict'))
print(dictionary)
Dictionary(762 unique tokens: ['17thcentury', 'able', 'aboard', 'accepts', 'achieved']...)

If interested, we can have a look at the generated tokens and their associated ids

if False:
    print(dictionary.token2id)

Convert tokenized documents into vectors

Here we simply convert the documents in bag-of-words representation, where each document is represented by a vector where each vector element represents a word by its id (dictionary.token2id above) word and how many times it appeared in that document.

The function doc2bow() simply counts the number of occurrences of each distinct word, converts the word to its integer word id and returns the result as a sparse vector (list of list of vectors).

corpus = [dictionary.doc2bow(text) for text in texts]

If interested, lets have a look at the generated corpus

if False:
    print(corpus)

Save the corpus on disk for future reference

gensim.corpora.MmCorpus.serialize(os.path.join(MODELS_DIR, 'movie_overviews.mm'), corpus)

Further vector transformation

We have a vectorized representation (bag-of-words) of our documents loaded in corpus above. Now, we will apply further vector tranformation on that. The reason for that as explained in the gensim website:

  1. To bring out hidden structure in the corpus, discover relationships between words and use them to describe the documents in a new and (hopefully) more semantic way.
  2. To make the document representation more compact. This both improves efficiency (new representation consumes less resources) and efficacy (marginal data trends are ignored, noise-reduction).

Transform BOW corpus into TF-IDF vector representation

We convert here our bag-of-words corpus above into a tf-idf corpus. We do it simply initializing (aka training) a tf-idf model with our bow corpus. The "training" consists simply of going through the supplied corpus and computing the TF-IDF for all of its features.

tfidf = gensim.models.TfidfModel(corpus)
tfidf
<gensim.models.tfidfmodel.TfidfModel at 0x11a2cfbe0>

Here we apply the transformation to whole corpus

corpus_tfidf = tfidf[corpus]

Transform TF-IDF corpus into LSI vector representation

As target dimentionality of 200-500 recommended as "golden standard" in the gensim website, I used 300.

lsi = gensim.models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=300)
corpus_lsi = lsi[corpus_tfidf]
lsi.save(os.path.join(MODELS_DIR, 'movie_overviews.lsi'))

To prepare for similarity queries, we need to enter all documents which we want to compare against subsequent queries and index them

index = gensim.similarities.MatrixSimilarity(lsi[corpus_tfidf])

Save the index on disk.

index.save(os.path.join(MODELS_DIR, 'movie_overviews.index'))

Now we are ready to query

We prepare a test document (simply a movie overview) to query it against our LSI model to see which document it matches most.

test_overview = "The Dragon Warrior has to clash against the savage Tai Lung as China's fate hangs in the balance: However, the Dragon Warrior mantle is supposedly mistaken to be bestowed upon an obese panda who is a tyro in martial arts."
test_overview = text_process(test_overview)
test_overview
'dragon warrior clash savage tai lung chinas fate hangs balance however dragon warrior mantle supposedly mistaken bestowed upon obese panda tyro martial arts'

Convert our query document into bow corpus

test_overview_bow_vec = dictionary.doc2bow(test_overview.split())

Convert BOW corpus into TF-IDF corpus

test_overview_tfidf_vec = tfidf[test_overview_bow_vec]

Convert TF-IDF corpus into LSI corpus

test_overview_lsi_vec = lsi[test_overview_tfidf_vec]
test_overview_lsi_vec
[(0, 0.04310030812220524),
 (1, -0.12306087771725557),
 (2, -0.09102392668367769),
 (3, 0.15222073367305888),
 (4, 0.052704251788643394),
 (5, -0.024454669609764647),
 (6, -0.011606140471226973)]

Load the index from disk

index = gensim.similarities.MatrixSimilarity.load(os.path.join(MODELS_DIR, 'movie_overviews.index'))
index
<gensim.similarities.docsim.MatrixSimilarity at 0x11a2cf240>

To query the given movie overview above, we do:

sims = index[test_overview_lsi_vec]
print(list(enumerate(sims)))
[(0, 0.05690968), (1, 0.9924693), (2, 0.12976697), (3, 3.8184226e-08), (4, 0.030632496), (5, 0.08933523), (6, 9.313226e-09)]

So, document with id 1 has the highest similarity.

Lets see which is the document with id 1. We will get the document index and similarity measure as a tuple:

result = (list(sims).index(max(sims)), max(sims))
result
(1, 0.9924693)
print(genre_doc_id[result[0]])
Animation

As the queried document

The Dragon Warrior has to clash against the savage Tai Lung as China's fate hangs in the balance: However,
the Dragon Warrior mantle is supposedly mistaken to be bestowed upon an obese panda who is a tyro in 
martial arts.

is from the movie Kung Fu Panda and we have no doubt that it is an Animation movie, so we see the similarity is working quite good.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back To Top