Data-driven overview of the CEMO research publications contents
Just for fun, I made a data-driven overview of the contents of the CEMO articles published in international journals based on some simple text mining algorithms and models.
A total of 45 pdf articles form the so-called corpus (i.e., Latin for body) on which the overview is based. Each of these pdfs were converted to simple text documents. Numbers, common stopwords, and sparsely distributed words were removed from the text, and the remaining words were stemmed (to avoid that for instance students and student are seen as different content entities). Analyzing this corpus of research articles gives some interesting insights in what my CEMO colleagues and I are working on and writing about! The following 21 word stems are for example used in all 45 papers:
Assess - Data - Develop - Discuss - Effect - First - General - Import - Level - Measur - Method - Model - Provid - Refer - Relat - Research - Result - Studi - Test - Univers - Use.
A simple wordcloud can give an idea about which words are central to our work (i.e., the larger the font, the more frequently this word stem appeared in the collection of papers).
Ofcourse, given that I work with latent variable models, I thought I should include one such model here as well. Probabilistic topic models try to "write" each document in the collection as a combination of latent underlying topics, where the topics are characterized by words that have a high chance of occuring together. The results of such a latent topic model are summarized in the figure below, where larger dots indicate a higher "relevance" of the word for the specific topic.
Quantitative analyses on words, it's something different but then again perhaps not that different :)