library(dcmdata)
We are so happy to announce the release of a new package, dcmdata. The goal of dcmdata is to provide easy access to data sets that can be used for demonstrating and testing diagnostic classification models (DCM; also called cognitive diagnostic models [CDMs]).
You can install dcmdata from CRAN with:
install.packages("dcmdata")
This blog post will highlight the major features and plans for future development.
Data sets
dcmdata contains both real and simulated data sets. All data sets include both response data and a Q-matrix. The real data sets include the MacReady and Dayton (1977) multiplication data (MDM) and the Examination for the certificate of proficiency in English (ECPE), as described by Templin & Hoffman (2013).
The MDM data are a small data set of four items that measure a single attribute, multiplication. As such, this data is useful for use cases where you are interested in a fairly short estimation time. For example, this data could be used to quickly interate while testing model code, or in training workshops where time is limited.
mdm_data#> # A tibble: 142 × 5
#> respondent mdm1 mdm2 mdm3 mdm4
#> <fct> <int> <int> <int> <int>
#> 1 m8qre 1 1 1 1
#> 2 8wMPc 1 1 1 1
#> 3 xdbT8 1 1 1 1
#> 4 Ee9ob 1 1 1 1
#> 5 0tyTA 1 1 1 1
#> 6 L4bzq 1 1 1 1
#> 7 QTW1v 1 1 1 1
#> 8 w4NOH 1 1 1 1
#> 9 t9sIe 1 1 1 1
#> 10 FDa7I 1 1 1 1
#> # ℹ 132 more rows
mdm_qmatrix#> # A tibble: 4 × 2
#> item multiplication
#> <chr> <int>
#> 1 mdm1 1
#> 2 mdm2 1
#> 3 mdm3 1
#> 4 mdm4 1
In contrast, the ECPE data are perhaps a more representative of data you might gather in practice. These data consist of 28 items measuring 3 attributes and taken by 2,922 respondents. Because multiple attributes are measured, this data can be used to demonstrate how different attributes interact on a single item when using different compensatory and noncompensatory DCMs. Additionally, Templin & Bradshaw (2014) demonstrated that the three attributes follow a linear hierarchy, such that respondents are typically demonstrate proficiency on lexical, cohesive, and morphosyntactic rules, in that order. That is, the earlier skills represent precursor knowledge necessary for proficiency on the later skills. The hierarchy makes the ECPE data excellent for demonstrating the effect of different structural model specifications in a DCM.
ecpe_data#> # A tibble: 2,922 × 29
#> resp_id E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11
#> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 1 1 1 1 0 1 1 1 1 1 1 1
#> 2 2 1 1 1 1 1 1 1 1 1 1 1
#> 3 3 1 1 1 1 1 1 0 1 1 1 1
#> 4 4 1 1 1 1 1 1 1 1 1 1 1
#> 5 5 1 1 1 1 1 1 1 1 1 1 1
#> 6 6 1 1 1 1 1 1 1 1 1 1 1
#> 7 7 1 1 1 1 1 1 1 1 1 1 1
#> 8 8 0 1 1 1 1 1 0 1 1 1 0
#> 9 9 1 1 1 1 1 1 1 1 1 1 1
#> 10 10 1 1 1 1 0 0 1 1 1 1 1
#> # ℹ 2,912 more rows
#> # ℹ 17 more variables: E12 <int>, E13 <int>, E14 <int>, E15 <int>, E16 <int>,
#> # E17 <int>, E18 <int>, E19 <int>, E20 <int>, E21 <int>, E22 <int>,
#> # E23 <int>, E24 <int>, E25 <int>, E26 <int>, E27 <int>, E28 <int>
ecpe_qmatrix#> # A tibble: 28 × 4
#> item_id morphosyntactic cohesive lexical
#> <chr> <int> <int> <int>
#> 1 E1 1 1 0
#> 2 E2 0 1 0
#> 3 E3 1 0 1
#> 4 E4 0 0 1
#> 5 E5 0 0 1
#> 6 E6 0 0 1
#> 7 E7 1 0 1
#> 8 E8 0 1 0
#> 9 E9 0 0 1
#> 10 E10 1 0 0
#> # ℹ 18 more rows
Finally, one simulated data set is included. This data set is based on the diagnosing teachers’ multiplicative reasoning (DTMR) data presented by Bradshaw et al. (2014), and consists of 990 responses to 27 items that collectively measure 4 attributes. Consistent with the results presented by Bradshaw et al. (2014), the data was generated using the loglinear cognitive diagnostic model (LCDM; Henson et al., 2009; Henson & Templin, 2019), and item and attribute names are included as reported in their Table 1.
dtmr_data#> # A tibble: 990 × 28
#> id `1` `2` `3` `4` `5` `6` `7` `8a` `8b` `8c` `8d` `9`
#> <fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 0008… 1 1 0 1 0 0 1 1 0 1 1 0
#> 2 0009… 0 1 0 0 0 0 0 1 1 1 0 1
#> 3 0024… 0 1 0 0 0 0 1 1 1 1 0 0
#> 4 0031… 0 1 0 0 1 0 1 1 1 0 0 0
#> 5 0061… 0 1 1 0 0 0 0 0 0 1 0 0
#> 6 0087… 0 1 1 1 0 0 0 1 1 1 1 0
#> 7 0092… 0 1 1 1 1 0 0 1 1 1 0 0
#> 8 0097… 0 0 0 1 0 0 0 1 0 1 0 0
#> 9 0111… 0 1 1 0 0 0 0 1 0 1 1 0
#> 10 0121… 0 1 0 0 0 0 0 1 1 1 1 0
#> # ℹ 980 more rows
#> # ℹ 15 more variables: `10a` <int>, `10b` <int>, `10c` <int>, `11` <int>,
#> # `12` <int>, `13` <int>, `14` <int>, `15a` <int>, `15b` <int>, `15c` <int>,
#> # `16` <int>, `17` <int>, `18` <int>, `21` <int>, `22` <int>
dtmr_qmatrix#> # A tibble: 27 × 5
#> item referent_units partitioning_iterating appropriateness
#> <chr> <dbl> <dbl> <dbl>
#> 1 1 1 0 0
#> 2 2 0 0 1
#> 3 3 0 1 0
#> 4 4 1 0 0
#> 5 5 1 0 0
#> 6 6 0 1 0
#> 7 7 1 0 0
#> 8 8a 0 0 1
#> 9 8b 0 0 1
#> 10 8c 0 0 1
#> # ℹ 17 more rows
#> # ℹ 1 more variable: multiplicative_comparison <dbl>
Because the data is simulated we have access to the “true” values for respondents and items. The class probabilities of respondents belonging to a given proficiency pattern are reported in Izsák et al. (2019). Using these probabilities, we generated a class for each of the 990 respondents, which is available in dtmr_true_profiles
.
dtmr_true_structural#> # A tibble: 16 × 5
#> referent_units partitioning_iterating appropriateness multiplicative_compar…¹
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0 0 0 0
#> 2 1 0 0 0
#> 3 0 1 0 0
#> 4 0 0 1 0
#> 5 0 0 0 1
#> 6 1 1 0 0
#> 7 1 0 1 0
#> 8 1 0 0 1
#> 9 0 1 1 0
#> 10 0 1 0 1
#> 11 0 0 1 1
#> 12 1 1 1 0
#> 13 1 1 0 1
#> 14 1 0 1 1
#> 15 0 1 1 1
#> 16 1 1 1 1
#> # ℹ abbreviated name: ¹multiplicative_comparison
#> # ℹ 1 more variable: class_probability <dbl>
dtmr_true_profiles#> # A tibble: 990 × 5
#> id referent_units partitioning_iterating appropriateness
#> <fct> <dbl> <dbl> <dbl>
#> 1 039517 0 0 0
#> 2 500170 0 0 0
#> 3 104795 1 1 1
#> 4 113558 1 1 1
#> 5 564266 0 1 1
#> 6 039726 1 1 1
#> 7 075968 0 1 1
#> 8 375846 0 0 0
#> 9 032129 0 0 0
#> 10 138501 0 1 0
#> # ℹ 980 more rows
#> # ℹ 1 more variable: multiplicative_comparison <dbl>
The item parameters for the LCDM are reported in Table 1 of Bradshaw et al. (2014). Using the item parameters and the true profiles, we can calculate the probability that each simulated respondent provides a correct response to each item. These probabilities are then used to generate the simulated item responses in dtmr_data
.
dtmr_true_items#> # A tibble: 27 × 7
#> item intercept referent_units partitioning_iterating appropriateness
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 -1.12 2.24 NA NA
#> 2 2 0.59 NA NA 1.27
#> 3 3 -2.07 NA 1.7 NA
#> 4 4 -1.19 0.65 NA NA
#> 5 5 -1.67 1.52 NA NA
#> 6 6 -3.81 NA 2.08 NA
#> 7 7 -0.73 1.2 NA NA
#> 8 8a -0.62 NA NA 4.25
#> 9 8b -0.09 NA NA 2.16
#> 10 8c 0.28 NA NA 0.87
#> # ℹ 17 more rows
#> # ℹ 2 more variables: multiplicative_comparison <dbl>,
#> # referent_units__partitioning_iterating <dbl>
For a complete description of how the data was simulated, see ?dtmr
.
Future work
Future work will focus on providing tools for simulating data from a variety of DCMs. The goal is to provide a tool that will make it easier to quickly simulate a large number a data sets, as is often required for simulation studies.
We also plan to continue adding more real data sets (e.g., from the Item Response Warehouse). If you know of any data sets that would be a good fit, or have a data set you’d like to contribute yourself, please open an issue!
Acknowledgments
The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grants R305D210045 and R305D240032 to the University of Kansas Center for Research, Inc., ATLAS. The opinions expressed are those of the authors and do not represent the views of the the Institute or the U.S. Department of Education.
Featured photo by Scott Graham on Unsplash.
References
Citation
@online{thompson2025,
author = {Thompson, W. Jake},
title = {Dcmdata 0.1.0},
date = {2025-08-21},
url = {https://r-dcm.org/blog/2025-08-dcmdata-0.1.0/},
doi = {10.59350/gmy4x-20093},
langid = {en}
}