Guide to Statistics and Methods
Surgical Education Research
January 3, 2024
Practical Guide to Assessment Tool Development for Surgical Education Research
Mohsen M. Shabahang, Todd A. Schwartz, Liane S. Feldman
JAMA Surg. Published online January 3, 2024. doi:10.1001/jamasurg.2023.6696
Introduction
Rigorous assessment tools are a crucial part of the learning process.1,2 Assessment is the process of collecting evidence to evaluate the performance of the learner and establish the level of competency. An assessment tool needs to have the context and conditions of assessment, tasks to be administered, and criteria used to judge the performance. A competency-based education framework requires use of assessment methods that can determine residents’ acquisition and demonstration of specific competencies.3,4 The concept of competency-based education puts a new emphasis on the importance of rigorous assessment tools and establishing validity of these tools. These assessments can be formative and used for feedback purposes, or they can be summative and used for high-stakes evaluation (Box). Surgical training has relied on such tools. Examples include the Fundamentals of Laparoscopic Surgery and the Fundamentals of Endoscopic Surgery.
Box.
Summary
- Learner assessment is essential to establish competency.
- Validity of instruments had previously been established in a binary fashion, but now it may be established as a hypothesis with evidence gathered to evaluate its validity. The process is not a finite one.
- Both the Messick and Kane frameworks allow us to provide evidence for validity.
- Assessment tools can combine quantitative and qualitative elements.
- As we move toward competency-based education, psychometrically sound learner assessments take on a new level of significance.
- The psychometric rigor of the assessment tool is essential, and it is advisable to use the services of a consulting expert.
Using the Methodology
Establishing the validity of an assessment tool is an involved process that incorporates gathering evidence and applying it to the validity of the tool in the context in which it is being used. The purpose is to collect evidence to establish the appropriateness of the interpretations and uses.5 Validity evidence provides information not only regarding the trustworthiness of the tool but also about its feasibility, cost, and practicality. In most cases, there is a checklist to be completed by the assessor measuring competency based on the elements of the task. These may be associated with or enhanced by a global rating.
It is also noteworthy that the validity of an assessment tool will vary among different groups of learners. This explains why validation is not seen as a binary outcome (ie, the assessment tool is either valid or not) but a process and pathway to gathering evidence regarding all of these factors. In fact, validation of assessment tools should be viewed as a repeating loop that may need to be revisited periodically. In the past, overall validity was separately addressed through content validity (ie, the test is relevant and representative of the domains of the task), criterion validity (ie, there is correlation between the test score and a standard score), and construct validity (ie, there is correlation with another measure of the same construct). The more contemporary view of validation may be based on the framework proposed by Messick, described by Cook and Hatala.5 This is seen more as establishing a hypothesis about the ability of the tool to measure a task. Validation is the process of gathering evidence to support or refute the hypothesis. The evidence is then divided into 5 sources (Table). The standards for educational and psychological measurement support the focus on the sources of evidence, not necessarily the types of evidence. These have some commonality with the prior framework. A point worth making is that the validity evidence relates to the positive evidence showing that the assessment measures what it is supposed to measure but also that it is not measuring what it is not supposed to measure.
The framework for validity as proposed by Kane complements that of Messick and is based on 4 premises.5 These are quite different from the ones mentioned earlier. Elements include scoring, generalization, extrapolation, and implication:
- Scoring: score or written narrative from the observation captures key aspects of the performance.
- Generalization: total score reflects performance across the test domain (internal consistency, interrater reliability).
- Extrapolation: score on the test reflects meaningful performance in a real-life setting (factor analysis, correlation with tests measuring similar constructs).
- Implication: measurement provides rational basis for meaningful decisions.
We have thus far emphasized some of the evolutions in validity science. Some other considerations regarding assessment tools in surgical education include the following. A balance of quantitative and qualitative elements can be used. Checklists, global ratings, or subjective comments are all methods of assessment. They each provide a different value and may each be important to measuring the desired performance domain. Another consideration is the establishment of a cutoff score in a quantitative assessment. The cutoff score will have implications regarding remediation. It may be essential in a summative assessment, especially if the outcome is binary (ie, pass or fail). Another consideration is that any alteration of an assessment tool must involve a revalidation process. Finally, in surgical education, assessment tools have been used mainly for evaluation of technical skills (ie, Objective Structured Assessment of Technical Skills) but have also been used to evaluate nontechnical skills. For example, the Non-Technical Skills for Surgeons system and the NOTECHS system evaluate performance of learners in nontechnical skills for trauma evaluation.
Statistical Considerations
In designing an assessment tool, reporting the steps undertaken in the validation process is of great importance. Herein, we have described the process of validation and gathering validity evidence.6 Some of the statistical methods that can contribute to the validation process include the following:
- Borderline regression or logistic regression (through receiver operating characteristic curves) is used to establish the cutoff scores. The borderline regression method is a linear regression model based on the examinee’s scores that establishes a score above which the candidate is considered to be at an acceptable performance level.
- Internal consistency reliability refers to the consistency with which different items in an assessment measure similar constructs. This is usually calculated using Cronbach α.
- Intraclass correlations can be used to measure interrater reliability. This refers to the consistency between scores given by different raters using the same assessment tool.
Where to Find More Information
Additional information on assessment tool development can be found in articles by Hamstra and Yamazaki7 and Cook et al,8 as well as in resources from the American Educational Research Association, such as Standards for Educational and Psychological Testing.9