dcmdata 0.1.0 | r-dcm

We are so happy to announce the release of a new package, dcmdata. The goal of dcmdata is to provide easy access to data sets that can be used for demonstrating and testing diagnostic classification models (DCM; also called cognitive diagnostic models [CDMs]).

You can install dcmdata from CRAN with:

install.packages("dcmdata")

This blog post will highlight the major features and plans for future development.

library(dcmdata)

Data sets

dcmdata contains both real and simulated data sets. All data sets include both response data and a Q-matrix. The real data sets include the MacReady and Dayton (1977) multiplication data (MDM) and the Examination for the certificate of proficiency in English (ECPE), as described by Templin & Hoffman (2013).

The MDM data are a small data set of four items that measure a single attribute, multiplication. As such, this data is useful for use cases where you are interested in a fairly short estimation time. For example, this data could be used to quickly interate while testing model code, or in training workshops where time is limited.

mdm_data
#> # A tibble: 142 × 5
#>    respondent  mdm1  mdm2  mdm3  mdm4
#>    <fct>      <int> <int> <int> <int>
#>  1 m8qre          1     1     1     1
#>  2 8wMPc          1     1     1     1
#>  3 xdbT8          1     1     1     1
#>  4 Ee9ob          1     1     1     1
#>  5 0tyTA          1     1     1     1
#>  6 L4bzq          1     1     1     1
#>  7 QTW1v          1     1     1     1
#>  8 w4NOH          1     1     1     1
#>  9 t9sIe          1     1     1     1
#> 10 FDa7I          1     1     1     1
#> # ℹ 132 more rows

mdm_qmatrix
#> # A tibble: 4 × 2
#>   item  multiplication
#>   <chr>          <int>
#> 1 mdm1               1
#> 2 mdm2               1
#> 3 mdm3               1
#> 4 mdm4               1

In contrast, the ECPE data are perhaps a more representative of data you might gather in practice. These data consist of 28 items measuring 3 attributes and taken by 2,922 respondents. Because multiple attributes are measured, this data can be used to demonstrate how different attributes interact on a single item when using different compensatory and noncompensatory DCMs. Additionally, Templin & Bradshaw (2014) demonstrated that the three attributes follow a linear hierarchy, such that respondents are typically demonstrate proficiency on lexical, cohesive, and morphosyntactic rules, in that order. That is, the earlier skills represent precursor knowledge necessary for proficiency on the later skills. The hierarchy makes the ECPE data excellent for demonstrating the effect of different structural model specifications in a DCM.

ecpe_data
#> # A tibble: 2,922 × 29
#>    resp_id    E1    E2    E3    E4    E5    E6    E7    E8    E9   E10   E11
#>      <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#>  1       1     1     1     1     0     1     1     1     1     1     1     1
#>  2       2     1     1     1     1     1     1     1     1     1     1     1
#>  3       3     1     1     1     1     1     1     0     1     1     1     1
#>  4       4     1     1     1     1     1     1     1     1     1     1     1
#>  5       5     1     1     1     1     1     1     1     1     1     1     1
#>  6       6     1     1     1     1     1     1     1     1     1     1     1
#>  7       7     1     1     1     1     1     1     1     1     1     1     1
#>  8       8     0     1     1     1     1     1     0     1     1     1     0
#>  9       9     1     1     1     1     1     1     1     1     1     1     1
#> 10      10     1     1     1     1     0     0     1     1     1     1     1
#> # ℹ 2,912 more rows
#> # ℹ 17 more variables: E12 <int>, E13 <int>, E14 <int>, E15 <int>, E16 <int>,
#> #   E17 <int>, E18 <int>, E19 <int>, E20 <int>, E21 <int>, E22 <int>,
#> #   E23 <int>, E24 <int>, E25 <int>, E26 <int>, E27 <int>, E28 <int>

ecpe_qmatrix
#> # A tibble: 28 × 4
#>    item_id morphosyntactic cohesive lexical
#>    <chr>             <int>    <int>   <int>
#>  1 E1                    1        1       0
#>  2 E2                    0        1       0
#>  3 E3                    1        0       1
#>  4 E4                    0        0       1
#>  5 E5                    0        0       1
#>  6 E6                    0        0       1
#>  7 E7                    1        0       1
#>  8 E8                    0        1       0
#>  9 E9                    0        0       1
#> 10 E10                   1        0       0
#> # ℹ 18 more rows

Finally, one simulated data set is included. This data set is based on the diagnosing teachers’ multiplicative reasoning (DTMR) data presented by Bradshaw et al. (2014), and consists of 990 responses to 27 items that collectively measure 4 attributes. Consistent with the results presented by Bradshaw et al. (2014), the data was generated using the loglinear cognitive diagnostic model (LCDM; Henson et al., 2009; Henson & Templin, 2019), and item and attribute names are included as reported in their Table 1.

dtmr_data
#> # A tibble: 990 × 28
#>    id      `1`   `2`   `3`   `4`   `5`   `6`   `7`  `8a`  `8b`  `8c`  `8d`   `9`
#>    <fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#>  1 0008…     1     1     0     1     0     0     1     1     0     1     1     0
#>  2 0009…     0     1     0     0     0     0     0     1     1     1     0     1
#>  3 0024…     0     1     0     0     0     0     1     1     1     1     0     0
#>  4 0031…     0     1     0     0     1     0     1     1     1     0     0     0
#>  5 0061…     0     1     1     0     0     0     0     0     0     1     0     0
#>  6 0087…     0     1     1     1     0     0     0     1     1     1     1     0
#>  7 0092…     0     1     1     1     1     0     0     1     1     1     0     0
#>  8 0097…     0     0     0     1     0     0     0     1     0     1     0     0
#>  9 0111…     0     1     1     0     0     0     0     1     0     1     1     0
#> 10 0121…     0     1     0     0     0     0     0     1     1     1     1     0
#> # ℹ 980 more rows
#> # ℹ 15 more variables: `10a` <int>, `10b` <int>, `10c` <int>, `11` <int>,
#> #   `12` <int>, `13` <int>, `14` <int>, `15a` <int>, `15b` <int>, `15c` <int>,
#> #   `16` <int>, `17` <int>, `18` <int>, `21` <int>, `22` <int>

dtmr_qmatrix
#> # A tibble: 27 × 5
#>    item  referent_units partitioning_iterating appropriateness
#>    <chr>          <dbl>                  <dbl>           <dbl>
#>  1 1                  1                      0               0
#>  2 2                  0                      0               1
#>  3 3                  0                      1               0
#>  4 4                  1                      0               0
#>  5 5                  1                      0               0
#>  6 6                  0                      1               0
#>  7 7                  1                      0               0
#>  8 8a                 0                      0               1
#>  9 8b                 0                      0               1
#> 10 8c                 0                      0               1
#> # ℹ 17 more rows
#> # ℹ 1 more variable: multiplicative_comparison <dbl>

Because the data is simulated we have access to the “true” values for respondents and items. The class probabilities of respondents belonging to a given proficiency pattern are reported in Izsák et al. (2019). Using these probabilities, we generated a class for each of the 990 respondents, which is available in dtmr_true_profiles.

dtmr_true_structural
#> # A tibble: 16 × 5
#>    referent_units partitioning_iterating appropriateness multiplicative_compar…¹
#>             <dbl>                  <dbl>           <dbl>                   <dbl>
#>  1              0                      0               0                       0
#>  2              1                      0               0                       0
#>  3              0                      1               0                       0
#>  4              0                      0               1                       0
#>  5              0                      0               0                       1
#>  6              1                      1               0                       0
#>  7              1                      0               1                       0
#>  8              1                      0               0                       1
#>  9              0                      1               1                       0
#> 10              0                      1               0                       1
#> 11              0                      0               1                       1
#> 12              1                      1               1                       0
#> 13              1                      1               0                       1
#> 14              1                      0               1                       1
#> 15              0                      1               1                       1
#> 16              1                      1               1                       1
#> # ℹ abbreviated name: ¹multiplicative_comparison
#> # ℹ 1 more variable: class_probability <dbl>

dtmr_true_profiles
#> # A tibble: 990 × 5
#>    id     referent_units partitioning_iterating appropriateness
#>    <fct>           <dbl>                  <dbl>           <dbl>
#>  1 039517              0                      0               0
#>  2 500170              0                      0               0
#>  3 104795              1                      1               1
#>  4 113558              1                      1               1
#>  5 564266              0                      1               1
#>  6 039726              1                      1               1
#>  7 075968              0                      1               1
#>  8 375846              0                      0               0
#>  9 032129              0                      0               0
#> 10 138501              0                      1               0
#> # ℹ 980 more rows
#> # ℹ 1 more variable: multiplicative_comparison <dbl>

The item parameters for the LCDM are reported in Table 1 of Bradshaw et al. (2014). Using the item parameters and the true profiles, we can calculate the probability that each simulated respondent provides a correct response to each item. These probabilities are then used to generate the simulated item responses in dtmr_data.

dtmr_true_items
#> # A tibble: 27 × 7
#>    item  intercept referent_units partitioning_iterating appropriateness
#>    <chr>     <dbl>          <dbl>                  <dbl>           <dbl>
#>  1 1         -1.12           2.24                  NA              NA   
#>  2 2          0.59          NA                     NA               1.27
#>  3 3         -2.07          NA                      1.7            NA   
#>  4 4         -1.19           0.65                  NA              NA   
#>  5 5         -1.67           1.52                  NA              NA   
#>  6 6         -3.81          NA                      2.08           NA   
#>  7 7         -0.73           1.2                   NA              NA   
#>  8 8a        -0.62          NA                     NA               4.25
#>  9 8b        -0.09          NA                     NA               2.16
#> 10 8c         0.28          NA                     NA               0.87
#> # ℹ 17 more rows
#> # ℹ 2 more variables: multiplicative_comparison <dbl>,
#> #   referent_units__partitioning_iterating <dbl>

For a complete description of how the data was simulated, see ?dtmr.

Future work

Future work will focus on providing tools for simulating data from a variety of DCMs. The goal is to provide a tool that will make it easier to quickly simulate a large number a data sets, as is often required for simulation studies.

We also plan to continue adding more real data sets (e.g., from the Item Response Warehouse). If you know of any data sets that would be a good fit, or have a data set you’d like to contribute yourself, please open an issue!

Acknowledgments

The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grants R305D210045 and R305D240032 to the University of Kansas Center for Research, Inc., ATLAS. The opinions expressed are those of the authors and do not represent the views of the the Institute or the U.S. Department of Education.

Featured photo by Scott Graham on Unsplash.

References

Bradshaw, L., Izsák, A., Templin, J., & Jacobson, E. (2014). Diagnosing teachers’ understandings of rational numbers: Building a multidimensional test within the diagnostic classification framework. Educational Measurement: Issues and Practice, 33(1), 2–14. https://doi.org/10.1111/emip.12020

Henson, R. A., Templin, J. L., & Willse, J. T. (2009). Defining a family of cognitive diagnosis models using log-linear models with latent variables. Psychometrika, 74(2), 191–210. https://doi.org/10.1007/s11336-008-9089-5

Henson, R., & Templin, J. L. (2019). Loglinear cognitive diagnostic model (LCDM). In M. von Davier & Y.-S. Lee (Eds.), Handbook of diagnostic classification models (pp. 171–185). Springer International Publishing. https://doi.org/10.1007/978-3-030-05584-4_8

Izsák, A., Jacobson, E., & Bradshaw, L. (2019). Surveying middle-grades teachers’ reasoning about fraction arithmetic in terms of measured quantities. Journal for Research in Mathematics Education, 50(2), 156–209. https://doi.org/10.5951/jresematheduc.50.2.0156

MacReady, G. B., & Dayton, C. M. (1977). The use of probabilistic models in the assessment of mastery. Journal of Educational Statistics, 2(2), 99–120. https://doi.org/10.2307/1164802

Templin, J., & Bradshaw, L. (2014). Hierarchical diagnostic classification models: A family of models for estimating and testing attribute hierarchies. Psychometrika, 79(2), 317–339. https://doi.org/10.1007/s11336-013-9362-0

Templin, J., & Hoffman, L. (2013). Obtaining diagnostic classification model estimates using Mplus. Educational Measurement: Issues and Practice, 32(2), 37–50. https://doi.org/10.1111/emip.12010

Citation

BibTeX citation:

@online{thompson2025,
  author = {Thompson, W. Jake},
  title = {Dcmdata 0.1.0},
  date = {2025-08-21},
  url = {https://r-dcm.org/blog/2025-08-dcmdata-0.1.0/},
  doi = {10.59350/gmy4x-20093},
  langid = {en}
}

For attribution, please cite this work as:

Thompson, W. J. (2025, August 21). dcmdata 0.1.0. https://doi.org/10.59350/gmy4x-20093