The argument for a “Data Cube” for large-scale psychometric data
Alina von Davier
Thursday, September 13, 2018; 8:30 - 9:30, Room: FORUM
In the recent years, the work with educational testing data changed due to the affordances provided by the technology, the availability of large data sets, and by advances made in data mining and machine learning. Consequently, the data analysis moved from traditional psychometrics to advanced psychometrics to computational psychometrics. In the computational psychometric framework, the psychometric theory is blended with the data-driven knowledge discovery. Despite the advances in the methodology and the availability of the large data sets collected at each administration, the way the data (from multiple tests at multiple times) are collected, stored and analyzed by the testing organizations is not conducive to these real-time, data intensive computational psychometrics and analytics methods that can reveal new patterns and information about the students.
In this presentation, I am proposing a new way to label, collect, and store data from large scale educational learning and assessment systems (LAS) using the concept of the “data cube” introduced by data scientists about 10 years ago to deal with stratification problems in big data in marketing contexts. However, applying the concept to the educational data is quite challenging: The challenges are due to the lack of coherence of the traditional content tagging, of an identity management across testing instruments, of collaboration between the psychometricians and data scientists, and most recently, the lack of validity of the newly proposed machine learning methods for measurement. Currently data for psychometrics is stored and analyzed as a two-dimensional matrix – item by examinee. The items’ content, the standards or taxonomies are usually stored as narratives in various systems, of various sophistication, from Excel spreadsheets to OpenSalt. In the time of Big Data, the expectation is not only that one has access to large volumes of data, but also that the data can be aligned and analyzed on different dimensions in real time – including various item features like content standards.
I am proposing that we rewrite the taxonomies and standards as mathematical vectors, and that we add these vectors as dimensions to the “data cube.” Similarly, we should vectorize the items’ metadata and/or item models and align them on different dimensions of the “cube.” The idea of a “data cube” evolved over time, but the paradigm is easy to communicate and describe. Psychometricians and data scientists can interactively navigate their data and visualize the results through slicing, dicing, drilling, rolling, and pivoting.
Obviously, the “data cube” is not a cube, given that the different data-vectors are of different length. A data cube is designed to organize the data by grouping it into different dimensions, indexing the data, and precomputing queries frequently. Because all the data are indexed and precomputed, a data cube query often runs significantly faster than the standard queries. Once a data cube is built and precomputed, intuitive data projections can be applied to it through a number of operations. Also, the traditional psychometric models can be applied at scale and in real time in ways in which was not possible before.
At ACT we are building a Learning Analytics Platform (LEAP) for which I am proposing an updated version of this data-structure: the in-memory database technology that allows for newer interactive visualization tools to query a higher number of data dimensions interactively. In this presentation I will use large-scale examples to illustrate possible alignments based on machine learning tools across multiple testing instruments taken by millions of students.