TopicModel主题模型,topicmodel模型


http://blog.csdn.net/pipisorry/article/details/45307369

LDA的应用

The very best ways to sort large databases of unstructured text is to use a technique called Latent Dirichlet allocation (LDA)


LDA的缺陷和改进

LDA limitations: what’s next?

Although LDA is a great algorithm for topic-modelling, it still has some limitations, mainly due to the fact that it’s has become popular and available to the mass recently.

One major limitation is perhaps given by its underlying unigram text model: LDA doesn’t consider themutual position of the words in the document. Documents like “Man, I love this can” and “I can love this man” are probably modelled the same way. It’s also true that for longer documents, mismatching topics is harder. To overcome this limitation, at the cost of almost square the complexity, you can use 2-grams (or N-grams)along with 1-gram.

Another weakness of LDA is in the topics composition: they’re overlapping. In fact, you can find thesame word in multiple topics(the example above, of the word “can”, is obvious). The generated topics, therefore, are not independent andorthogonal(正交的) like in a PCA-decomposed basis, for example. This implies that you must pay lots of attention while dealing with them (e.g. don’t usecosine similarity).

For a more structured approach - especially if the topic composition is very misleading - you might consider thehierarchical variation of LDA, named H-LDA, (or simply Hierarchical LDA). In H-LDA, topics are joined together in a hierarchy by using a Nested Chinese Restaurant Process (NCRP). This model is more complex than LDA, and the description is beyond the goal of this blog entry, but if you like to have an idea of the possible output, here it is. Don’t forget that we’re still in theprobabilistic world: each node of the H-DLA tree is a topic distribution.

[http://engineering.intenthq.com/2015/02/automatic-topic-modelling-with-lda/]


big data text analysis inconsistent, inaccurate

LDA is also inaccurate enough at some tasks that the results of any topic model created with it are essentially meaningless, according to Luis Amaral.

Applied to messy, inconsistently scrubbed data from many sources in many formats – the base of data for which big data is often praised for its ability to manage – the results would be far less accurate and far less reproducible.

"Our systematic analysis clearly demonstrates that current implementations of LDA have low validity," the paper reports (full text PDF here).

改进:TopicMapping

1. breaks words down into bases (treating "stars" and "star" as the same word), then eliminates conjunctions, pronouns and other "stop words" that modify the meaning but not the topic, using a standardized list.

2. Then the algorithm builds a model identifying words that often appear together in the same document and use the proprietary Infomap natural-language processing software to assign those clusters of words into groups identified as a "community" that define the topic. Words could appear in more than one topic area.

The new approach delivered results that were 92 percent accurate and 98 percent reproducible, though, according to the paper, it only moderately improved the likelihood that any given result would be accurate.

The best way to improve those analyses is to apply techniques common in community detection algorithms – which identify connections among specific variables and use those to help categorize or verify the classification of those that aren't clearly in one group or another.

[Test shows big data text analysis inconsistent, inaccurate]from:http://blog.csdn.net/pipisorry/article/details/45307369

ref:


相关内容