Data-driven overview of the CEMO research publications contents

Just for fun, I made a data-driven overview of the contents of the CEMO articles published in international journals based on some simple text mining algorithms and models.

Johan Braeken

A total of 45 pdf articles form the so-called corpus (i.e., Latin for body) on which the overview is based. Each of these pdfs were converted to simple text documents. Numbers, common stopwords, and sparsely distributed words were removed from the text, and the remaining words were stemmed (to avoid that for instance students and student are seen as different content entities).  Analyzing this corpus of research articles gives some interesting insights in what my CEMO colleagues and I are working on and writing about! The following 21 word stems are for example used in all 45 papers:

Assess - Data - Develop - Discuss - Effect - First - General - Import - Level - Measur - Method - Model - Provid - Refer - Relat - Research - Result - Studi - Test - Univers - Use.

A simple wordcloud can give an idea about which words are central to our work (i.e., the larger the font, the more frequently this word stem appeared in the collection of papers).

Ofcourse, given that I work with latent variable models, I thought I should include one such model here as well. Probabilistic topic models try to "write" each document in the collection as a combination of latent underlying topics, where the topics are characterized by words that have a high chance of occuring together. The results of such a latent topic model are summarized in the figure below, where larger dots indicate a higher "relevance" of the word for the specific topic.

Topic1 Topic2 Topic3 Topic4 Topic5 Topic6 Topic7 Topic8 Topic9 Topic10
estim languag teacher patient use model problem student infant educ
data time mathemat time teacher order assess model parent develop
variabl abil teach item ict data compet school mother learn
condit task knowledg anxieti factor effect solv achiev behavior selfefficaci
use factor educ partner comput hypothes model relat preterm effect

Quantitative analyses on words, it's something different but then again perhaps not that different :)

Published Feb. 2, 2016 9:24 AM - Last modified Apr. 20, 2017 9:14 AM