What is it?

Also called Term-Frequency Inverse-Document-Frequency, or TF-IDF, for short, is a way of representing sentences in Vectors, widely used in Natural Language Processing and Large Language Models for Deep Learning purposes.


How does it work?

TF-IDF is another way of applying the standard Term-Frequency vectorization. In the latter, common words are given higher weights and end up being more important to the model. TF-IDF solves this problem by assigning higher weights to rare words, using a log function to calculate the weight of any given token, or word.

Given as the IDF weight of a token , being the total number of data observations, and being the number of observations with present:

Now, to calculate TF-IDF, one just multiply the known TF value of a token by its IDF:

Now, one can use the resulting value as the weight of the corresponding token.


Applying in Python

One can also apply this in Python, using the Machine Learning framework scikit-learn:

from sklearn.feature_extraction.text import TfidfVectorizer
 
data = [
    'This is the first document.',
    "This document is the second document.",
    'And this is the third one.',
    'Is this the first document?',
]
 
vectorizer = TfidfVectorizer()
 
X = vectorizer.fit_transform(data)
print(f"Vector Features: {vectorizer.get_feature_names_out()} \n")
 
print(f"Vectorized data: \n {X.toarray()}")