How to Develop a TPP Evaluation
How to Develop a TPP Evaluation
Purposes for TPP Evaluations
TPP evaluations serve three basic purposes and the policy challenge is to select the system or approach that is best suited for a defined purpose.
Ensuring accountability, which involves monitoring program quality and providing reliable information to the general public and policy makers.
Providing information for consumers, which includes giving prospective teachers data that can help them make good choices from among the broad array of preparation programs, and giving future employers of TPP graduates information to help with hiring decisions.
Enabling self-improvement by teacher preparation programs, which entails providing institutions with information to help them understand the strengths and weaknesses of their existing programs and using this information
Guiding Questions for Developing a TPP Evaluation System
A rational approach to designing TPP evaluations is to consider their likely direct and indirect, and positive and negative, impacts on teacher education and the broader educational system. Asking and attempting to answer the following questions in the early stages of evaluation design can increase the likelihood that an evaluation system will be coherent, serve its intended purposes, and lead to valid inferences about TPP quality.
Question 1: What is the primary purpose of the TPP evaluation system?
The TPP evaluation design process should begin with a clear statement about intent: what does the system aim to accomplish? Clearly, evaluation systems will often serve more than one purpose. But designers should be able to articulate the primary purpose, and then perhaps one or two secondary purposes. Is the primary goal accountability? (If so, to whom or what authority is the TPP being held accountable?) Is the system intended primarily to provide consumers with information about the quality of individual TPPs? Or is the primary purpose to provide the program itself with information for self-improvement?
Once the central purpose is determined, can a more specific statement be made about what the system is intended to accomplish? For instance, a federal evaluation for accountability may aim specifically to accomplish its main purpose through public reporting of a large variety of state and national data about the quality of teacher preparation. A national accreditation system, also generally aimed at accountability, may more specifically be intended to spur reform in teacher education by implementing rigorous standards that TPPs must meet to earn accreditation.
Being explicit about the purpose of the evaluation is important for at least two reasons. First, the purpose will guide many of the subsequent design decisions. Second, it will be important to communicate the purpose of the evaluation to end users in order to guard against the tendency to use the evaluation results for purposes for which they were not intended and which may not be appropriate.
Question 2: Which aspects of teacher preparation matter the most?
Attributes of teacher preparation that may not be directly observable could be of interest to TPP evaluators. These attributes, which are not necessarily amenable to direct measurement, might include the following:
- Qualifications of students admitted
- Quality and substance of the postsecondary instruction provided to TPP students by all faculty in the program
- Quality of student teaching experience
- Expertise of faculty
- Effectiveness in preparing new teachers who are employable and stay in the field
- Success in preparing teachers who are effective in the classroom
This is not an exhaustive list, and its elements may shift as curricular and instructional reforms are implemented. The point is that no single evaluation, given the reality of limited resources, will be sufficient to measure every aspect of teacher preparation. Choices will have to be made about which attributes are of most interest, based on the anticipated uses of the evaluation and the values of the organization doing the evaluating. Evaluators interested in using accountability to spur reform might focus on rigor in the substance of instruction, while those wanting to hold TPPs to a certain minimum standard might focus on course hour requirements and pass rates on licensure tests. Evaluators who are interested in accountability but do not want to prescribe the elements of a high-quality TPP may choose to focus on the extent to which TPPs produce graduates who are employable and demonstrate effectiveness in the classroom.
It is important to maintain a flexible and adaptive approach to evaluation, especially in an era of reforms motivated by changing conceptions about the most valued outcomes of education. Evaluators will face a familiar dilemma: while changing measures to align with new definitions of teaching quality is logical, it reduces the validity of the results as estimates of program improvement over time.
Question 3: What sources of evidence will provide the most accurate and useful information about the aspects of teacher preparation that are of primary interest?
TPP evaluation designers should examine the types of evidence available and decide which will give them the most useful information about the attributes of interest. Because any single type of evidence will give an incomplete picture of TPP quality and because each type of evidence has limitations, a good evaluation system will carefully select the combination of measures that makes the best use of available resources. Evaluators should carefully consider each type of evidence that might be used and what can be learned from that data, while attending to its advantages and disadvantages.
These are the most important considerations in addressing this question:
- How much effort is involved in collecting the data?
- Will double-checking for accuracy be feasible?
- How prone is the measure to gaming and corruption?
- Is there empirical evidence tying the measure to future teacher performance? Or can a rationale for using the measure be based on its face validity—that it subjectively appears to measure an important construct?
Given limited resources, investing in one measure will likely mean giving less attention to TPP attributes that are not included. The question then becomes whether, on balance, the types of evidence included will lead to the desired inferences about overall TPP quality. Review the page on sources of evidence for the strengths and limitations of the most commonly used measures of TPP quality.
Question 4: How will the measures be analyzed and combined to make a judgment about program quality?
Evidence or data do not automatically translate into evaluation results. Decisions must be made about how the data will be analyzed and interpreted. If admissions or licensure tests are used, evaluation designers will need to decide whether to employ average scores and/or pass rates. The latter implies the need to determine cut scores (passing scores), which is a somewhat complex technique in itself. (There is a body of literature from the field of testing about cut scores. See, e.g., Cizek and Bunch, 2007.) For other types of evidence, scoring rubrics (guidelines) will need to be developed. For instance, if course syllabi are collected to assess the substance of instruction, rubrics will be needed to specify the characteristics that must be evident in the syllabus to demonstrate that it meets a certain standard. Here, too, designers need to be aware of the subtleties and complications of establishing the rubrics. In any event, raters will need to be trained and monitored to ensure that they code documents reliably.
If the goal is to come up with a single indicator of TPP quality, as is often the case with evaluations for accountability or consumer information purposes, evaluation designers must make additional decisions about how to weight and combine the various sources of evidence. The single indicator of quality may be a pass/fail decision, a ranking of programs from highest to lowest quality, or some other sort of summary rating. Several questions should be considered. For example, will TPPs be required to meet a certain level on each measure (referred to as a conjunctive model)? Or will a high score on one measure be allowed to compensate for a low score on another (a compensatory model)? Does each piece of evidence carry the same weight, or is one measure more important than another? Or will the measures each be reported on separately, leaving it to users to make their own summary judgments?
In order to earn CAEP accreditation, for example, a TPP must demonstrate to the review team that it meets each of five major standards. Based on documentation and site visits, review teams rate the TPP in one of three levels on each standard: unacceptable, acceptable, or target (Council for the Accreditation of Educator Preparation, 2013b). TPPs must meet at least the acceptable level on each standard to earn accreditation. With their consumer-oriented rankings, NCTQ/U.S. News gives each TPP a score on each standard, while weighting some standards more heavily than others in computing the overall ratings (National Council on Teacher Quality, 2013). The overall score consists of a weighted sum of the component ratings; this is a compensatory model because a high score on one standard can help make up for a low score on another. In contrast, Ohio produces a set of “performance reports” on each of the state’s TPPs. The reports seek to give the public information about how well the states’ TPPs are operating. They report on a number of variables separately and intentionally avoid the assignment of an overall score or grade (Bloom, 2013).
Some flexibility may need to be built into the analysis of data for the sake of equity. Ideally, evidence will be interpreted within the context of program participants, resources, and communities served by the TPPs. This may include, but not be limited to, demographics, ecological/environmental context, and policy climate. To yield an overall judgment about TPP quality, a compensatory model might give TPPs credit for seeking diversity in their candidate population or being located in a disadvantaged community; these can make up for lower scores on other indicators.
Question 5: What are the intended and potentially unintended consequences of the evaluation system for TPPs and education more broadly?
Consequences of evaluation should be determined with the overall goal of improving teacher preparation, rather than punishing or embarrassing low-performing programs. The results of a TPP evaluation aimed at program improvement might be shared and discussed only among internal users to enable them to identify steps for improvement. Systems aimed at producing consumer information will publicize the results more broadly. Evaluations for accountability may be publicized and may also carry higher stakes that could include directives, mandates, or even program closures. If the results trigger serious consequences, then ideally the initial evaluation should be followed up by a more in-depth one to ensure that the TPP was not wrongly identified as low performing. This is especially important when relying on measures like VAMs, which are a few steps removed from the actual training taking place in a TPP and have problems of measurement error.
Decision makers should also try to anticipate unintended negative consequences of the system. Is the evaluation likely to identify a disproportionate number of TPPs in disadvantaged communities as failing? If those TPPs are closed or sanctioned, what impact will that have on the production of minority teachers? And how will this closure affect the supply of teachers in the community where the TPP is located? Can decision makers avoid these negative consequences by thinking early in the process about how the results of an evaluation will be used? If, as we assume, the overarching goal is to improve the quality of teacher preparation, a first step could involve anticipating the likely need to allocate extra resources to TPPs that need them to make improvements.
Question 6: How will transparency be achieved? What steps will be taken to help users understand how to interpret the results and use them appropriately?
Transparency, or open communication, is crucial if users are to trust the results of an evaluation. Those who design and implement TPP evaluations have the responsibility to clearly communicate the purpose for the evaluation and the methods used to collect and analyze the data. It is also important to communicate appropriate interpretations of the results, along with the limitations in what one can infer from the data. One caution, for example, is that while an evaluation system may be adequate for approximating the general quality of an entire program, the result may not pertain to the quality of specific individual graduates. This is one example of what is known as classification error in measurement: good teachers may come from programs that are labeled as poor or substandard, and inferior teachers may come from programs that received an overall high rating. All of the information about the evaluation should be easily accessible on the Internet or otherwise and communicated in a way that is easily understood by users and the public.
Transparency is especially important for technically sophisticated measures like VAMs. Research in the neurosciences and mathematics suggests that people tend to believe data that they do not understand very well (Weisberg, Keil, Goodstein, Rawson, and Gray, 2008; Eriksson, 2012) because of the way the data are presented. Sperber (2010) calls this the “guru effect,” which occurs when readers judge technical information as profound without really understanding how it was derived. VAMs, like many contemporary measurement systems, rely on complex statistical models that lead to a heightened perception of their scientific accuracy. Admonitions from psychometricians, who know the most about the potential for error in these systems and who caution against their overuse, are often ignored or dismissed by policy and education communities eager to treat the quantitative data as scientifically valid and therefore trustworthy. Thus, evaluation designers must make special efforts to convey the limitations of VAM results in terms of the validity of interpretations that can be drawn from them.
But transparency is important with all types of measures, quantitative and qualitative, even those that seem more intuitively understandable. Users should be reminded, for instance, that syllabi may not reflect the actual content of instruction as delivered, that licensure tests are not designed to predict future teacher performance, and that hiring and placement results are something TPPs generally have little control over. Developing innovative and effective ways to promote transparency should become a research priority, as discussed below.
Question 7: How will the evaluation system be monitored?
One should not assume that an evaluation system is functioning as envisioned and producing the intended impacts on teacher preparation. Consequences of the system, both intended and unintended, should be studied. For the program improvement purpose of evaluation, for example, key issues are whether the evaluation promotes increased communication among faculty about how they can improve teacher training at their institution; whether evaluation results encourage specific actions to improve the program; the extent to which the evaluation creates incentives for opportunistic behavior that distort the meaning of the results; and whether different groups of teacher educators are affected differently and perhaps unfairly by the application of evaluation results.
In addition to monitoring consequences of the system, evaluation leaders should arrange for ongoing studies of the accuracy and reliability of the measures and analytic methods being used in the system. If documents are being coded, auditing studies can be conducted to check on rater agreement. To the extent possible, validity studies should be conducted to see if the ratings that result from the evaluation correlate with other, independent indicators of TPP quality. Are the results of the evaluation corroborated by other evidence not used in the evaluation? States that rely heavily on VAM results, for instance, might conduct surveys of graduates to see if their perceptions support the conclusions drawn from the VAMs about highest- and lowest-performing TPPs.
Evaluation systems should be flexible and adaptable. Earlier we noted that changing standards for K-12 STEM education will require changes in TPPs, as they align their programs with the new expectations for teacher training and recruitment. Likewise, evaluations of TPPs will need to adapt to measure how well programs are meeting the new STEM goals, according to an appropriate timeline that allows TPPs adequate time to adjust.
Of course, there is a tension between adaptability and stability in an evaluation system. Keeping measures and analytic methods stable is important to allow results to be compared from one year to the next for purposes of tracking trends in teacher preparation. Thus, decisions will have to be made about whether a certain change to the system will have enough positive impact on teacher preparation to counterbalance some loss of comparability in the data.
Holding evaluation systems accountable is necessary for building trust in the communities most likely to use and be affected by their results (Feuer, 2012b). Ultimately, a major purpose of evaluation is to contribute to the improvement of student learning and other valued educational outcomes. For this goal to be advanced, designers and operators of teacher preparation program evaluations need to consider the extent to which they build or erode trust among the professionals who prepare future educators and among the participants in those programs.
Chief among these principles is validity, i.e., the requirement that an evaluation system’s success in conveying defensible conclusions about a TPP should be the primary criterion for assessing its quality. Validity refers both to the quality of evidence and theory that supports the interpretation of evaluation results and to the effects of using the evaluation results; the consequences of evaluation matter.
Validity is defined in the literature of measurement and testing as “the extent to which evidence and theory support the interpretations of test scores” (Messick, 1989; American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999). There is a vast literature about the concept of test validity that goes back many decades (in addition to Messick, 1989, see, for example, Cronbach and Meehl, 1955; Shepard, 1993).
Evidence and Inferences
Evidence refers to the measures or data collected. For instance, the current federal TPP evaluation system emphasizes results on teacher certification tests, while the consumer information system emphasizes selectivity and academic content.
By inferences, we mean interpretations or findings based on the evidence. For example, in evaluations conducted to meet federal requirements, users of data on certification test pass rates may draw inferences about the degree to which TPPs prepare teacher candidates to pass the tests. Others may infer that pass rates are more of a reflection of the general ability of the students who entered the program.