TF-IDF with TextVectorization

TextVectorization is an experimental layer for raw text preprocessing: text normalization/standardization, tokenization, n-gram generation, and vocabulary indexing.

This layer can also be used to calculate the TF-IDF matrix of a corpus.

TF-IDF is a score that intended to reflect how important a word is to a document in a collection or corpus.

First, import TextVectorization class which is in an experimental package for now.

from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

Second, define an instance that will calculate TF-IDF matrix by setting the output_mode properly.

tfidf_calculator = TextVectorization(
  standardize = 'lower_and_strip_punctuation',
  split       = 'whitespace',
  max_tokens  = MAX_TOKENS,
  output_mode ='tf-idf',
  pad_to_max_tokens=False)

Third, we build the vocab.

tfidf_calculator.adapt(text_input)

Finally, we call the layer on the text to get a dense TF-IDF matrix.

tfids = tfidf_calculator(text_input)

Example notebook here.

Related tips

Track your TF model GPU memory consumption during training

English Text to speech with TensorFlowTTS

Improve read performance with TFRecordDataset