TF-IDF with TextVectorization
TextVectorization
is an experimental layer for raw text preprocessing: text normalization/standardization, tokenization, n-gram generation, and vocabulary indexing.
This layer can also be used to calculate the TF-IDF matrix of a corpus.
TF-IDF is a score that intended to reflect how important a word is to a document in a collection or corpus.
First, import TextVectorization
class which is in an experimental package for now.
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
Second, define an instance that will calculate TF-IDF matrix by setting the output_mode
properly.
tfidf_calculator = TextVectorization(
standardize = 'lower_and_strip_punctuation',
split = 'whitespace',
max_tokens = MAX_TOKENS,
output_mode ='tf-idf',
pad_to_max_tokens=False)
Third, we build the vocab.
tfidf_calculator.adapt(text_input)
Finally, we call the layer on the text to get a dense TF-IDF matrix.
tfids = tfidf_calculator(text_input)
Example notebook here.