TF-IDF:
TF-IDF stands for Term frequency and inverse document frequency and is one of the most popular and effective Natural Language Processing techniques. This technique allows you to estimate the importance of the term for the term (words) relative to all other terms in a text.
CORE IDEA: If a term appears in some text frequently, and rarely in any other text – this term has more importance for this text.
TF-IDF stands for Term frequency and inverse document frequency and is one of the most popular and effective Natural Language Processing techniques. This technique allows you to estimate the importance of the term for the term (words) relative to all other terms in a text.
CORE IDEA: If a term appears in some text frequently, and rarely in any other text – this term has more importance for this text.
This technique uses TF(Term frequency) and IDF(Inverse document frequency) algorithms:
- TF – shows the frequency of the term in the text, as compared with the total number of the words in the text.
- IDF – is the inverse frequency of terms in the text. It simply displays the importance of each term. It is calculated as a logarithm of the number of texts divided by the number of texts containing this term.
TF-IDF algorithm:
- Evaluate the TF-values for each term (word).
- Extract the IDF-values for these terms.
- Get TF-IDF values for each term: by multiplying TF by IDF.
- We get a dictionary with calculated TF-IDF for each term.
The algorithm for TF-IDF calculation for one word is shown on the diagram.
The results of the same algorithm for three simple sentences with the TF-IDF technique are shown below.
The advantages of this vectorization technique:
- Unimportant terms will receive low TF-IDF weight (because they are frequently found in all texts) and important – high.
- It is simple to evaluate important terms and stop-words in text.