Measuring Teachers' Instruction with Multilevel Item Response Theory

The purpose of this study was to describe an approach for measuring teachers' uses of instruction as it relates to students' achievement through classroom observations. Despite significant work on the substantive content of observation systems chronicling teachers' instruction, literature has largely relied on simple counts of instructional features or the average of quality indicators to describe teachers' instruction. However, such coarse summaries generally do not reflect current theories of instruction, prior empirical evidence, and the framework of most observation systems. The approach presented in this paper builds on evidence that teachers' instruction varies across lessons and that instructional features or quality indicators do not necessarily contribute equally to our understanding of effective instruction. To align theory, data and methods, this study applied multilevel item response theory to the study of early literacy instruction as it relates to students' achievement. This model provided a more complex, but more precise and theoretically grounded, view of instruction by linking components of instruction theory to model parameters. Empirical results suggested that multilevel item response models encouraged precision in the specification of theory, data collection, and models that is absent in simpler models.


ISSN 2278-7690
1038 | P a g e M a r c h , 2 0 , 2 0 1 5 Introduction:Valid measurement of teaching is essential to the advancement of quantitative research in education and understanding teacher effectiveness (Raudenbush & Sadoff, 2008). The recent emphasis on teacher effectiveness by federal education initiatives in the United States such as Race to the Top, district teacher evaluation, and compensation programs and funding agencies has underscored the centrality of teaching in advancing education (Measures of Effective Teaching, 2012). Although in principle valid assessment of the impact of teachers or educational interventions on students' cognitive development can be had without the direct measurement of teaching (e.g., value-added models), the results do little to explain the actions effective teachers take to produce student gains thereby eliminating inroads to teacher improvement. The measurement of teaching addresses critical questions involving the mechanisms through which teachers' practice shapes students' development and requires researchers to make precise their theories of teaching and interventions and helps them to mount specific and focused empirical investigations of these theories in falsifiable ways (Raudenbush & Sadoff, 2008).
There is a long history of quantitative assessments of classroom processes, however, this line of inquiry has largely been atheoretical and has produced inconsistent relationships among teacher, teaching and student variables (Hoffman, 1991). For instance, reviews of the literature on literacy instruction have suggested that research in this area has largely been based on theories of literacy (e.g., comprehension of texts) rather than theories of teaching literacy (Hoffman, 1991). Despite consensus that the nature and quality of teachers' instruction should impact their students' achievement, researchers are still far from understanding the characteristics of instruction that distinguish more and less effective teachers.
To this end, direct classroom observation has emerged as a key measurement strategy for understanding teacher effectiveness and has become a central feature in the advancement of teaching (Gitomer, 2009). Classroom observation, carried out for the purpose of studying instructional practices as they relate to students' achievement, offers a promising way to unpack the black box of teaching and arrive at a deeper understanding of effective practice. Recent studies have developed a number of classroom observational systems which measure teaching by recording or rating the quality of features or strategies thought to reflect critical dimensions of effective teaching (e.g., Pianta & Hamre, 2009). Targeted features are thought to be instances of underlying and unifying teacher quality constructs rather than single disjoint features.
Although much attention has been placed on the content development of classroom observation systems, the measurement of the data they produce has received far less attention. Research capturing observed differences in instruction has predominantly relied on simple counts of measured features and has further tended to collapse these counts across multiple observations using simple averages (e.g., Cirino, Pollard-Durodola, Foorman, Carlson, & Francis, 2007). Yet, reviews of classroom observation research have noted that describing observed instruction using simple counts and averages is largely atheoretical and has produced inconsistent relationships among teacher, teaching and learning variables (Hoffman, 1991).
Within most theoretical frameworks guiding these observations systems, instruction is framed as more than simple counts of enacted features. Effective instruction requires skillful blends of strategies, such that teachers' instructional strategies and actions build on each other, are designed for a specific purpose, and are responsive to various contextual factors (McGhie, Underwood, & Jordan, 2007;Pacheco, 2009). Such orchestration suggests that the patterns of instructional features hold information beyond their simple frequencies of use.
Comprehensive theories of teaching also suggest that instruction is a dynamic process that varies intentionally from one lesson to the next and that understanding this variation is central to understanding and improving the quality of teachers' instruction (Pacheco, 2009;Rimm-Kaufman & Hamre, 2010). Theoretical frameworks for instruction have delineated lesson to lesson variation in instruction within a teacher to understand how and why a teacher establishes a structure or pattern across all lessons and why, in any given lesson, a teacher's current choice of strategies may deviate from his/her established pattern (Soar & Soar, 1979). For example, research has indicated that teachers' vary in their use of instructional strategies depending on the purpose of a lesson (Stodolsky, 1990).
The dynamic nature of instruction suggests that the patterns of instructional strategies teachers' choose in a specific lesson and the persistence use of these strategies across lessons potentially provide significant insight into the teaching profiles of more and less effective teachers (Carlisle, Kelcey, & Berebitsky, 2013). However, empirical analyses of instruction have largely stood in opposition to such theoretical frameworks because they have typically collapsed indices of instruction across observations ignoring within teacher variability. This approach leaves us with no basis for understanding the dynamics and variability of instruction because within-teacher variation is lost. The problem, then, is finding approaches to capture both within teacher variation and teachers' stable preferences for certain teaching strategies.

Purpose
The purpose of this study was to improve the measurement of instruction as it relates to students' achievement by applying analytical methods better suited to the structure of classroom observations and the theory of instruction in early elementary literacy. Of particular relevance in this study is how teachers' choices of instructional strategies within and across lessons and how these choices associate with their students' achievement. As a preliminary investigation this paper applies multilevel item response theory to the measurement of teachers' engagement in teacher-directed instruction (e.g., coaching) in the area of early literacy. M a r c h , 2 0 , 2 0 1 5 Teacher-directed instruction or active instruction describes teachers' use of instructional strategies and actions that promote literacy concepts and development in ways that require active processing by students. Teacher-directed instruction characterizes the nature in which teachers deliver literacy content to students and is described by the strategies teachers take to ensure effective learning and practice of literacy skills (Seidel & Shavelson, 2007). In previous work synthesizing research in literacy instruction, teacher-directed instruction has been highlighted as a key component of effective instruction. In particular, instructional regimes that skillfully blend teacher-directed strategies such as modeling or asking evaluation questions are thought to engage students in higher-level thinking so that they make connections between new and prior knowledge (Taylor, Pearson, Peterson, & Rodriguez, 2003). This study looks, in part, to extend this line of inquiry by measuring teachers' uses of teacher-directed instruction in early elementary literacy lessons as they relate to students' literacy achievement.
The paper addresses seven research questions: (1) Do teachers' uses of instructional features contribute equally to their instructional profiles in early elementary literacy lessons? (2) To what extent is the variation in teachers' instruction attributable to persistent differences among teachers versus differences among lessons within teachers? (3) To what extent are instructional strategies invariant across grades two and three? (4) To what extent does the topic of a lesson associate with teachers' lesson-specific deviations in teacher-directed instruction? (5) To what extent does teachers' knowledge associate with teachers' persistent levels of teacher-directed instruction? (6) To what extent does teachers' stable engagement in teacher-directed instructional strategies explain students' achievement gains in literacy? (7) To what extent do multilevel item response models improve our ability to test theories about teachers' choices of instructional strategies over simpler methods? If we are to develop a science of effective teaching, we need measurement tools that allow us to precisely test our theories of teaching. To this end, the utility of multilevel item response models is examined through the lens of being able to precisely test features underlying theories of effective instruction as they relate to students' achievement.
Below, the paper first outlines the sample used in this study and then describes the classroom observation system and measures used in the study. The paper goes on to explore the application of multilevel item response models to the measurement of teacher-directed instruction. In turn, the paper finishes by examining links between teacher-directed instruction and students' literacy achievement.

Sample
The sample used in this study was drawn from classrooms which were in Reading First schools in the Midwest region of the United States; to qualify for funding, participating school districts met criteria for high levels of poverty, and the districts usually selected the schools with high poverty and low literacy achievement to participate in the Reading First program.
The work reported in this study focused on a subpopulation of this original group. In particular, the sample consisted of 1638 lessons taught in 87 early elementary classrooms drawn from 19 different schools across 6 districts. Of the 87 teachers who participated in the study, 44 taught second grade and 43 taught third grade; 19% were non-White, 11% had a master's degree in reading, and the average years of experience was 13. Classrooms had on average 23 students, of which roughly 45 percent were minority, 21 percent were in special education, and over three quarters were eligible for free/reduced lunch.

Instruction.
A major issue in the study of instruction is clarifying the mechanisms through which effective instruction operates. As a result, a formative step in measuring instruction is identifying observable features that are theoretically reflective of a dimension and can be reliably assessed through observations. Our conceptualization draws heavily on earlier work describing the salience of instructional strategies teachers take to provide effective instruction ( There are many other qualities of effective teaching and the literature has suggested that literacy is achieved from a combination of teacher-directed instruction and scaffolded opportunities. Rather, measured instructional strategies were chosen to emphasize the structure and delivery of teaching and learning and represent common strategies teachers' use to place differing levels of cognitive demand on students (Seidel & Shavelson, 2007).
Although extant literature has demonstrated evidence supporting the implications of individual instructional strategies, there is a need for clearer articulations of higher-level theories of teaching and teacher-directed instruction (Douglas, 2009). For instance, theory suggests that effective instruction involves use of high and low cognitive demand instructional strategies but there is little empirical evidence supporting such combinations with teacher-directed instruction (Pressley et al., 2001). For these reasons, there is a need for methods that provide empirical evidence toward how strategies differentially relate to the teacher-directed instruction construct. M a r c h , 2 0 , 2 0 1 5 Although prior research has tied these four strategies together and indicated their value, the same research has also suggested that they differ in the amount of cognitive demand they place on students. For instance, modeling and asking questions for evaluation are thought to engage teachers and students interactively in higher-level thinking about text so that students actively make deep connections with its meaning and their prior knowledge (Taylor, Pearson, Peterson, & Rodriguez, 2003). As a result, these types of interactions tend to place increased cognitive demand on students. In contrast, evidence has suggested that simply telling students information requires less cognitive demand because it requires lower levels of active participation and processing by students (Taylor, Pearson, Peterson, & Rodriguez, 2003). Accordingly, theory suggests that strategies form a certain involvedness scale whereby use of more cognitive demanding strategies requires skillful levels of teacher-directed instruction.
To carry out the study, observations were conducted using the Automated Classroom Observation System for Reading (Carlisle, Kelcey, Berebitsky, & Phelps, 2011). This system was specifically designed to have a narrow focus on the delivery and structure of early literacy instruction. Observations of classroom instruction were coded in real time using a tablet computer by trained observers. Observers underwent multiple training sessions and practice visits to classrooms and final inter-rater agreement across all possible fields and combinations exceeded 87%. To capture teachers' uses of teacher-directed instruction observers recorded teachers' dichotomous use of each of targeted instructional strategy for each observed lesson.
Teachers' Reading Knowledge. Research in the area of early literacy has argued that teachers' knowledge about teaching language and literacy is a critical factor in the quality of their instruction (Snow, Griffin & Burns, 2005). This literature has argued that this knowledge must be anchored in not only language and literacy knowledge but also in an understanding of how students' develop literacy skills (Moats, 2009;Snow, Griffin & Burns, 2005). Similar research in other areas, such as mathematics teaching, has progressively demonstrated that this type of knowledge guides teachers' instruction in ways that make them more effective (Hill et al., 2008). However, the evidence identifying a relationship between teachers' knowledge and practice in literacy has been less consistent than in other substantive areas (Moats, 2009). This paper advances this line of inquiry by drawing on a new generation literacy knowledge measure designed to assess the types of content problems that teachers encounter in practice.
where ai, is the discrimination parameter for action i, θk is teacher k's stable level of teacher-directed instruction across all lessons, γjk is teacher k's lesson-specific deviation for lesson j, and bi is the difficulty parameter for action i. Both where 2   is the teacher level variance and 2   is the lesson level variance.
Valid comparisons of teachers' use of teacher-directed instructional strategies necessitate the properties of the measured strategies to be invariant across subgroups so as to form a common measurement scale ). An important feature of these data is that they are split across grades two and three. Although prior literature has identified the value of our measured strategies as they relate to students' achievement within several early elementary grades, it has not directly compared teachers' use of teacher-directed instruction across grades and has not assessed the extent to which these strategies function similarly by grade. To this end, the invariance of empirical relations between strategies and the latent dimension across grades was examined by contrasting the fits of the model in equation (1) with a model that allowed strategy (item) parameters to differ by grade where g indicates the grade level of teachers' classrooms and remaining notation is the same as in equation (1).
Comprehensive theories of teaching also describe effective instruction as a process that is responsive to the purpose of a lesson and informed by the knowledge a teacher brings to the classroom (Pacheco, 2009; Rimm-Kaufman & Hamre, 2010). One feature potentially relevant to lesson variation in early literacy instruction is the topic of a lesson . Early literacy instruction is generally thought to include five main topics: phonics, fluency, writing, comprehension and vocabulary (National Reading Panel, 2000). As an exploratory analysis, the extent to which the variation in teachers' use of teacher-directed instructional strategies across lesson is, in part, explained by lesson topic was investigated by linking equation (1) with an explanatory component using latent regression where Xsjk represents four indicator variables for five lesson topics for lesson j in teacher k with βs as the coefficient for covariate s, and ujk represent the adjusted lesson deviations for lesson j in teacher k conditional upon X. Similarly, the association between teachers' literacy knowledge and their uses of teacher-directed instruction was assessed by expand equation (1) so that where Tk represents the knowledge level of teacher k with coefficient πT and rk represent the adjusted teacher levels of teacher-directed instruction for teacher k conditional upon grade.
In formulating the analysis as a multilevel item response model, the approach potentially provides specific ways to test the mechanisms of our theory. We can first assess the extent to which patterns of teachers' choices of instructional strategies hold important information by evaluating the equality of the discrimination parameters, ai, across strategies. Disparities among the discrimination parameters suggest that strategies differentially reflect teacher-directed instruction and that patterns of the strategies hold information beyond simple counts of strategies. Similarly, using the difficulty parameters, we can empirically examine the extent to which teachers' use of strategies align with the theoretical expectations concerning the cognitive demand strategies require. If the difficulty parameters, bi, are related to the cognitive demand strategies require, results should demonstrate that strategies requiring higher cognitive demand are more difficult.
Third, we can describe the stability of teachers' uses of teacher-directed instruction by examining the intraclass correlation coefficient. Fourth, we can examine whether we can place grade two and three classrooms on a common scale measuring teacher-directed instruction using the given strategies. Finally, through the latent regression we are able to explore the extent to which teachers' use of teacher-directed instruction varies as a function of lesson topic and their knowledge.

Model for Achievement.
To understand the value of teacher-directed instruction, the associations between teachers' regular uses of teacher-directed instruction and their students' achievement gains in reading comprehension and vocabulary were assessed. Student achievement for each subtest outcome was modeled using separate hierarchical linear models (Raudenbush & Bryk, 2002) with adjustments for students' prior achievement on both subtests such that where 00  is the average adjusted achievement level, and 01  is the association between teachers' increased use of teacher-directed instruction (θk) and achievement. Finally, 0k r is the random effect of teacher k and has a normal distribution with mean zero and variance τ.

Results
The results from the two-parameter multilevel item response model first suggested that the measured instructional strategies differed in terms of their difficulty ( Table 1). The most difficult action was modeling/coaching, which was followed by providing practice or review activities and telling and the easiest was asking evaluation questions. Similarly, the results demonstrated that the discrimination parameters among the strategies differed (Table 1). For instance, the instructional strategy of 'coaching/modeling' loaded more heavily on the teacher-directed instruction dimension than did the 'asking evaluation questions' strategy. The disparity in discrimination parameters suggests that strategies relate differently to teacher-directed instruction in ways that may privilege patterns over counts. At the same time, the low discrimination value for the asking evaluation may question the fit of this strategy to teacher-directed instruction. To more formally assess the extent to which strategies differentially reflected the underlying latent trait, the fits of a oneand two-parameter multilevel item response models were contrasted by holding the discrimination parameters, ai, constant across items in equation (1). The likelihood ratio test (LRT) and information criteria indicated the two-parameter model outperformed the one-parameter (Table 2) suggesting that strategies discriminate differently.
The second research question investigated the stability of teachers' instructional choices across the observed lessons. The results of the variance decomposition indicated a strong dependency among teachers' choices of instructional strategies within the same lesson. About 65% of the variation in observed instruction was attributable to differences among lessons within a teacher while only 35% was attributable to stable differences among teachers.
Viewed from a perspective of local item/strategy dependence, this variance decomposition suggested substantial local dependence of strategies within lessons. Comparisons of the model in equation (1) with a model which constrained γjk =0 in equation (1) also indicated that the model with lesson-specific effects fit the data better (Table 2). Similar results were found in assessing the Q3 statistic (Yen, 1984). The resulting Q3 statistic for equation (1) was 0.08 from the expected value whereas the Q3 statistic was 0.40 when dropping γjk in equation (1). Such dependence suggests simpler models ignoring the fact that teachers' choices of strategies are situated within lessons (i.e., assuming γjk =0 in equation (1)), may violate the local item independence assumption (Yen, 1984). Additional item-fit assessments also suggested that the model fit well (Table 1). M a r c h , 2 0 , 2 0 1 5 Note: -2LL refers to negative two times the log-likelihood, AIC refers to the Akaike information criterion and BIC refers to the Bayesian information criterion.
The third research examined the extent to teachers' uses of teacher-directed instruction in grades two and three could be put on a common scale given the targeted strategies. The results contrasting multigroup multilevel item response model with that of a single group model suggested that the fit of the single group model was not improved upon by allowing grade two and three to have different item/strategy parameters ( Table 2). In each instance, parameter estimates for strategies resulting from the multigroup model were nearly identical across grades.
Following evidence that teachers' uses of teacher-directed instruction can be placed on a common scale in these grades and that instruction varies both within and across teachers, the fourth and fifth research questions investigated the extent to which this variance was related to systematic differences in lesson topic and teachers' literacy knowledge. Latent regression results demonstrated that teachers' tended to use teacher-directed instruction similarly in phonics, writing, comprehension and vocabulary lessons but tended to use it significantly less than in fluency lessons (Table 3). Finally, the results indicated that teachers' levels of literacy knowledge were not significantly associated with increased levels of teacher-directed instruction.

Relation of Instruction to Achievement
The sixth research question focused on the extent to which teachers' uses of teacher-directed instruction were associated with students' achievement. After adjusting for baseline differences through the two measures of prior achievement, results suggested that teachers' uses of teacher-directed instruction were significantly and positively associated with both subtest outcomes (Table 4). For example, students in classrooms with teachers who tended to use teacher-directed instruction one standard deviation above average tended to achieve about 0.06 standard deviations more on the ITBS comprehension subtest. Table 4 Standardized regression coefficients (and standard errors) for students' achievement Note: TDI is short for teacher-directed instruction

Comparison to a More Conventional Approach
The complexity added by an item response model and its multilevel formulation raises questions of whether conventional methods which ignore the aforementioned features provide similar conclusions regarding the relationship between teacher-directed instruction and students' achievement. To examine differences, the multilevel item response model was contrasted with the average number of strategies a teacher used in his/her lessons 11 1 JI k ijk ji YY J    (8) where Yijk was one if strategy i was used in lesson j for teacher k. Comparative results suggested that in both cases the standardized coefficient and its corresponding t-ratio were smaller when using the simple averages. Moreover, for both of the subtests considered, the relationship between the averages and the students' achievement was statistically insignificant whereas indices from the multilevel item response model were significantly correlated to each subtest. Note: t-ratio is the ratio of the standardized coefficient to the standard error

Discussion
The current study drew attention to several general issues in the measurement of instruction and put forth a proposition; in order to advance our understanding of effective instruction, we need to align theories of instruction, data collection methods and analytic approaches in ways that allow us to systematically investigate theories of teaching. By aligning theory, data and methods we are able to systematically test our theories of instruction to advance and develop a science of teaching. In line with this proposition, the paper proposed multilevel item response models as a flexible methodological tool to align theory, data collection and methods in the measurement of instruction.
The paper offered an initial investigation into this proposition by studying the measurement of teacher-directed instruction in literacy lessons as it relates to students' achievement. The application of a multilevel item response model provided a more complex, but more precise and theoretically grounded, view of instruction. Specifically, the current investigation set out to examine four commonly accepted theories of literacy instruction for which there is little empirical evidence supporting. These theories suggested that strategies differ in complexity, that patterns of strategies matter, that teachers' uses of these strategies varied by lesson and this variation was associated with lesson and teacher features, and that teachers' engagement in teacher-directed instruction would promote student achievement.
To test the merits of these theories, the paper applied a multilevel item response model. This model provided a more complex, but more precise and theoretically grounded, view of instruction by linking the aforementioned theories to model parameters. The results largely supported these hypotheses. Results suggested that strategies differed in their difficulties and that patterns held information beyond simple counts, and teacher-directed instruction varied considerably from one lesson to the next. The results also suggested that the degree to which teachers' used of teacher-directed instruction was fairly similar in most topics but was less in fluency lessons. Further, contrary to expectation, the results indicated that teachers' knowledge was not significantly associated with teachers' uses of teacher-directed instruction perhaps suggesting that the type of knowledge captured by this measured was not related to this dimension of instruction. Finally, results indicated that teacher-directed instruction was associated with students' achievement gains but that these gains were relatively small.
In what follows, the practical value of this approach in improving theories of instruction, classroom observation instruments and the practice of measuring teaching as it relates to students' achievement is discussed. A first practical utility of applying item response theory to the measurement of instruction is the careful assessment of the properties of measured instructional strategies. Theories of instruction frequently suggest that instructional strategies or features differ in the difficulty and that patterns of instruction may be more meaningful than their sum. A key function of item response models is the ability to carefully assess the difficulty and discrimination of items/strategies. Applied to the measurement of M a r c h , 2 0 , 2 0 1 5 instruction, the practical value of such assessments allow is twofold. First, item parameters allow us to formally test theories of instruction. For instance, although there is general consensus that simple counts of the number of strategies teachers' use are insufficient in describing teaching, empirical analyses have not advanced along with this theory (Hoffman, 1991). Second, and more pragmatically, item parameters provide substantial insight to researchers in refining their classroom observation instruments to improve the measurement of instruction (e.g., measure new strategies). For example, consider the asking questions for evaluation strategy in the current investigation. Both the unexpected easiness of this strategy and its weak relationship with the teacher-directed instruction construct might suggest that this strategy did function as hypothesized and that it provides relatively little information about teachers' uses of teacher-directed instruction. As a result, subsequent investigations might consider replacing this strategy. Without the careful assessment of how the measured strategies relate to the construct of interest (e.g., through averages), this information would be lost.
Second, multilevel item response models add value by allowing researchers to decompose the variation in observed instruction into stable differences across teachers versus variation within a teacher across lessons (Carlisle, Kelcey, & Berebitsky, 2013). Theoretically, this decomposition adds value because it helps us to understand the nature and variability of teachers' instruction and the extent to which teacher effects are potentially due to stable differences among teachers. More practically, this decomposition can be used to test theories linking instruction with key lesson and teacher variables. In the current example, consider the topic of a lesson. It was hypothesized that teachers' uses of teacherdirected instruction would vary by topic. However, our results, suggested that the within teacher variation in instruction was not largely influenced by lesson topic with the exception of fluency lessons. Further theoretical investigation on the potential differences between fluency and other lessons suggests that these differences might be due to the highly prescribed nature of fluency lessons. Based on these results, subsequent studies might examine not only the topic of the lesson but also the extent to which specific lessons are formulaic. The important point is that collapsing across lessons would have obscured this information.
Third, use of item response models can help us understand the extent to which we can place teachers' instruction across multiple grades on a common measurement scale . Theoretically, such invariance facilitates comparisons among groups and correlations with external variables because it ensures that differences are not an artifact of measurement non-invariance. More practically, such invariance investigations help to understand the stability of instructional strategies as they relate to theoretical dimensions of teaching across grades. Again, an important point is that analyses which draw upon simple averages to describe the latent construct miss this information and potentially draw misleading conclusions Finally, applying multilevel item response models to the study of instruction has the potential to better differentiate among levels of the latent instructional trait. As a result, we can better differentiate among teachers and are more likely to likely to detect relationships between teachers' instruction and external variables if they exist. In current investigation, for example, analyses were better able to detect relations between teachers' stable levels of teacher-directed instruction and student's achievement as compared to a more conventional approach. Such increased capacity may be particularly important for studies of instruction because, to date, studies of instruction have largely found relatively small effects.
Although measurement of instruction with the current application of a multilevel item response model showed promise, the study and the application were not without drawbacks. Notably, the current study considered only a single dimension of instruction and did so only within the context of early literacy. More generally, there are also some uncertainties as to the extent to which the measurement of teaching is amenable to the theoretical assumptions underlying item response theory. For instance, applications of item response theory assume that levels of the targeted construct are meaningfully associated with monotonic changes in the probability of employing items. This assumption was operationalized in the current application by outlining a theory that the measured strategies gave rise to a certain cognitive demand or involvedness scale whereby use of more cognitive demanding strategies required higher teacher levels of teacher-directed instruction. The amenability of teaching to the assumptions of item response theory will, in general, be dependent on the specific teaching theories and dimensions being studied.
In addition, there are also questions concerning estimation methods. Conventionally, dependable item response models require a substantial number of respondents because maximum likelihood estimates of two-parameter models are known to be downwardly biased in small samples. Most studies of teaching, however, tend to proceed with small sample sizes because of the steep cost associated with observing teachers multiple times across a year. Even in the largest of studies, samples rarely reach 500 teachers. However, a key difference in conventional applications of item response methods (e.g., to student achievement) and the application to the measurement of teaching is that teachers are generally measured multiple times. Because the current implementation (i.e., equation (1)) forces the teacher dimension, θk, and lesson dimension, γjk, to share the same discrimination parameter, ai, item/strategy parameters are estimated using both lessons and teachers. For example, the current application estimated item/action parameters using over 1500 lessons (but only 87 teachers). In this way, the nesting of items within lessons and lessons within teachers conceptually parallels the nesting of items within students and students within schools. The performance of item response methods under constraints salient to teaching is fundamental to understanding the amenability of item response methods to the measurement of teaching.
Similarly, while the current investigation studied teacher-directed instruction through only four strategies/items, more common classroom observation systems also tend to represent teaching domains through only a few items.

Author' biography
Ben Kelcey is an assistant professor in the College of Education, Criminal Justice, & Human Services at the University of Cincinnati. His research interests include the development and application of measurement and quantitative research methods to understand effective teaching and teachers.