24 April 2015

Assessment

'Evaluating students’ evaluations of professors' by Michela Braga, Marco Paccagnella and Michele Pellizzari in (2014) 41 Economics of Education Review 71 contrasts
measures of teacher effectiveness with the students’ evaluations for the same teachers using administrative data from Bocconi University. The effectiveness measures are estimated by comparing the performance in follow-on coursework of students who are randomly assigned to teachers. We find that teacher quality matters substantially and that our measure of effectiveness is negatively correlated with the students’ evaluations of professors. A simple theory rationalizes this result under the assumption that students evaluate professors based on their realized utility, an assumption that is supported by additional evidence that the evaluations respond to meteorological conditions.
The authors state
The use of anonymous students’ evaluations of professors to measure teachers’ performance has become extremely popular in many universities (Becker and Watts, 1999). They normally include questions about the clarity of lectures, the logistics of the course, and many others. They are either administered during a teaching session toward the end of the term or, more recently, filled on-line.
The university administration uses such evaluations to solve the agency problems related to the selection and motivation of teachers, in a context in which neither the types of teachers, nor their effort, can be observed precisely. In fact, students’ evaluations are often used to inform hiring and promotion decisions (Becker & Watts, 1999) and, in institutions that put a strong emphasis on research, to avoid strategic behavior in the allocation of time or effort between teaching and research activities (Brown and Saks, 1987 and De Philippis, 2013).
The validity of anonymous students’ evaluations rests on the assumption that, by attending lectures, students observe the ability of the teachers and that they report it truthfully when asked. While this view is certainly plausible, there are also many reasons to question the appropriateness of such a measure. For example, the students’ objectives might be different from those of the principal, i.e. the university administration. Students may simply care about their grades, whereas the university cares about their learning and the two might not be perfectly correlated, especially when the same professor is engaged both in teaching and in grading. Consistent with this interpretation, Krautmann and Sander (1999) show that, conditional on learning, teachers who give higher grades also receive better evaluations. This finding is confirmed by several other studies and is thought to be a key cause of grade inflation (Carrell and West, 2010, Johnson, 2003 and Weinberg et al., 2009).
Measuring teaching quality is complicated also because the most common observable teachers’ characteristics, such as qualifications or experience, appear to be relatively unimportant (Hanushek et al., 2006, Krueger, 1999 and Rivkin et al., 2005). Despite such difficulties, there is evidence that teachers’ quality matters substantially in determining students’ achievement (Carrell and West, 2010 and Rivkin et al., 2005) and that teachers respond to incentives (Duflo et al., 2012, Figlio and Kenny, 2007 and Lavy, 2009). Hence, understanding how professors should be monitored and incentivized is essential for education policy.
In this paper we evaluate the content of the students’ evaluations by contrasting them with objective measures of teacher effectiveness. We construct such measures by comparing the performance in subsequent coursework of students who are randomly allocated to different teachers in their compulsory courses. We use data about one cohort of students at Bocconi University – the 1998/1999 freshmen – who were required to take a fixed sequence of compulsory courses and who where randomly allocated to a set of teachers for each of such courses.
We find that, even in a setting where the syllabuses are fixed and all teachers in the same course present exactly the same material, professors still matter substantially. The average difference in subsequent performance between students assigned to the best and worst teacher (on the effectiveness scale) is approximately 23% of a standard deviation in the distribution of exam grades, corresponding to about 3% of the average grade. Moreover, our measure of teaching quality is negatively correlated with the students’ evaluations of the professors: teachers who are associated with better subsequent performance receive worst evaluations from their students. On the other hand, teachers who are associated with high grades in their own exams rank higher in the students’ evaluations.
These results question the idea that students observe the ability of the teacher during the class and report it (truthfully) in their evaluations. In order to rationalize our findings it is useful to think of good teachers – i.e. those who provide their students with knowledge that is useful in future learning – as teachers who require effort from their students. Students dislike exerting effort, especially the least able ones, and when asked to evaluate the teacher they do so on the basis of how much they enjoyed the course. As a consequence, good teachers can get bad evaluations, especially if they teach classes with a lot of bad students.
Consistent with this intuition, we also find that the evaluations of classes in which high-skill students are over-represented are more in line with the estimated quality of the teacher. Additionally, in order to provide evidence supporting the intuition that evaluations are based on students’ realized utility, we collected data on the weather conditions observed on the exact days when students filled the questionnaires. Assuming that the weather affects utility and not teaching quality, the finding that the students’ evaluations react to meteorological conditions lends support to our intuition.  Our results show that students evaluate professors more negatively on rainy and cold days.
There is a large literature that investigates the role of teacher quality and teacher incentives in improving educational outcomes, although most of the existing studies focus on primary and secondary schooling (Figlio and Kenny, 2007, Jacob and Lefgren, 2008, Kane, and Staiger, 2008, Rivkin et al., 2005, Rockoff, 2004, Rockoff and Speroni, 2010 and Tyler et al., 2010). The availability of internationally standardized test scores facilitates the evaluation of teachers in primary and secondary schools (Mullis et al., 2009 and OECD, 2010). The large degree of heterogeneity in subjects and syllabuses in universities makes it very difficult to design common tests that would allow to compare the performance of students exposed to different teachers, especially across subjects. At the same time, the large increase in college enrollment occurred in the past decades (OECD, 2008) calls for a specific focus on higher education.
Only very few papers investigate the role of students’ evaluations in university and we improve on existing studies in various dimensions. First of all, the random allocation of students to teachers differentiates our approach from most other studies (Beleche et al., 2012, Johnson, 2003, Krautmann and Sander, 1999, Weinberg et al., 2009 and Yunker and Yunker, 2003) that cannot purge their estimates from the potential bias due to the best students selecting the courses of the best professors. Correcting this bias is pivotal to producing reliable measures of teaching quality (Rothstein, 2009 and Rothstein, 2010).
The only other study that exploits a setting where students are randomly allocated to teachers is Carrell and West (2010). This paper documents (as we do) a negative correlation between the students’ evaluations of professors and harder measures of teaching quality. We improve on their analysis in two important dimensions. First, we provide additional empirical evidence consistent with an interpretation of such finding based on the idea that good professors require students to exert more effort and that students evaluate professors on the basis of their realized utility. Secondly, Carrell and West (2010) use data from a U.S. Air Force Academy, while our empirical application is based on a more standard institution of higher education. The vast majority of the students in our sample enter a standard labor market upon graduation, whereas the cadets in Carrell and West (2010) are required to serve as officers in the U.S. Air Force for 5 years after graduation and many pursue a longer military career. There are many reasons why the behaviors of both teachers, students and the university/academy might vary depending on the labor market they face. For example, students may put higher effort on subjects or activities particularly important in the military setting at the expenses of other subjects and teachers and administrators may do the same.
More generally, this paper is also related and contributes to the wider literature on performance measurement and performance pay. One concern with the students’ evaluations of teachers is that they might divert professors from activities that have a higher learning content for the students (but that are more demanding in terms of students’ effort) and concentrate more on classroom entertainment (popularity contests) or change their grading policies. This interpretation is consistent with the view that teaching is a multi-tasking job, which makes the agency problem more difficult to solve (Holmstrom & Milgrom, 1994). Subjective evaluations can be seen as a mean to address such a problem and, given the very limited extant empirical evidence (Baker et al., 1994 and Prendergast and Topel, 1996), our results can certainly inform also this area of the literature.
The paper is organized as follows. Section 2 describes the data and the institutional setting. Section 3 presents our strategy to estimate teacher effectiveness and shows the results. In Section 4 we correlate teacher effectiveness with the students’ evaluations of professors. Robustness checks are reported in Section 5. In Section 6 we discuss the interpretation of our results and we present additional evidence supporting such an interpretation. Finally, Section 7 concludes.