Once the term-document matrix is built, it is cleaned of stop words pronouns, articles, function words and some word forms are truncated so-called stemming is done, although it may not be necessary for the language. The terms are now represented in a bag-of-words model.
The entries in the term-document matrix are typically converted into weights according to their estimated importance e., by the TF-IDF method, which will be described further.
Then, SVD is performed on the matrix to decompose it into three other matrices. Each term and document gets a vector representation in an orthogonal matrix, and the diagonal matrix shows the singular values georgia mobile database in descending order. Only the largest values are retained, and the remaining values are set to zero. The choice of the k factor for matrix reduction is empirical and related to the size of the collection. Therefore, SVD reduces the matrix size while preserving the main semantic structure.
The data are then compared by taking the cosine of the angle between two vectors formed by any two columns there are other ways to compare, e., by Euclidean distance.
These calculations identify co-occurring knowledge in the body of text and help reveal common concepts across multiple documents in a text collection. and transforms a very sparse TDM matrix into a low-rank approximation matrix that reveals common structure. The disadvantage of LSI is that it is computationally complex.