Sunday, August 7, 2022

Colab Word2Vec Using Google News dataset

.

Import the required libraries:

import gensim
from gensim.models import word2vec
from gensim.models import KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity

.

Download Word2Vec GoogleNews 300 dataset using Gensim downloader:

import gensim.downloader as api
wv = api.load('word2vec-google-news-300')
vec_king = wv['king']
print(vec_king.shape)

>>>>(300,)

.

Limit the vocabulary size to 50,000 words:

EMBEDDING_FILE = '/root/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz'
word_vectors = KeyedVectors.load_word2vec_format(EMBEDDING_FILE, binary=True,limit=50000)

.

Find cosine similarity:

v_apple = word_vectors["apple"
v_mango = word_vectors["mango"]
print(v_apple.shape)
print(v_mango.shape)
cosine_similarity([v_mango],[v_apple])

>>>>(300,)

(300,)

array([[0.57518554]], dtype=float32)

.

Unfortunately, the model is unable to infer vectors for unfamiliar words. This is one limitation of Word2Vec: if this limitation matters to you, check out the FastText model.

.

try:
    vec_cameroon = wv['cameroon']
except KeyError:
    print("The word 'cameroon' does not appear in this model")

>>>The word 'cameroon' does not appear in this model


.

Reference:

https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html


No comments: