|Assessment and classroom learning
By Black, Paul, Wiliam, Dylan, Assessment in Education: Principles, Policy & Practice
Mar1998, Vol. 5, Issue 1
This article is a review of the literature on classroom formative assessment. Several studies show firm evidence that innovations designed to strengthen the frequent feedback that students receive about their learning yield substantial learning gains. The perceptions of students and their role in self-assessment are considered alongside analysis of' the strategies used by teachers and the formative strategies incorporated in such systemic approaches as mastery learning. There follows a more detailed and theoretical analysis of the nature of feedback, which provides a basis for a discussion of the development of theoretical models for formative assessment and of the prospects for the improvement of practice.
One of the outstanding features of studies of assessment in recent years has been the shift in the focus of attention, towards greater interest in the interactions between assessment and classroom learning and away from concentration on the properties of restricted forms of test which are only weakly linked to the learning experiences of" students. This shift has been coupled with many expressions of hope that improvement in classroom assessment will make a strong contribution to the improvement of learning. So one main purpose of this review is to survey the evidence which might show whether or not such hope is justified. A second purpose is to see whether the theoretical and practical issues associated with assessment for learning can be illuminated by a synthesis of the insights arising amongst the diverse studies that have been reported.
The purpose of this Introduction is to clarify some of the key terminology that we use, to discuss some earlier reviews which define the baseline from which our study set out, to discuss some aspects of the methods used in our work, and finally to introduce the structure and rationale for the subsequent sections.
Our primary focus is the evidence about formative assessment by teachers in their school or college classrooms. As will be explained below, the boundary for the research reports and reviews that have been included has been loosely rather than tightly drawn. The principal reason for this is that the term formative assessment does not have a tightly defined and widely accepted meaning. In this review, it is to be interpreted as encompassing all those activities undertaken by teachers, and/or by their students, which provide information to be used as feedback to modify the teaching and learning activities in which they are engaged.
Two substantial review articles, one by Natriello (1987) and the other by Crooks (1988) in this same field serve as baselines for this review. Therefore, with a few exceptions, all of the articles covered here were published during or after 1988. The literature search was conducted by several means. One was through a citation search on the articles by Natriello and Crooks, followed by a similar search on later and relevant reviews of component issues published by one of us (Black, 1993b), and by Bangert-Drowns and the Kuliks (Kulik et al., 1990; Bangert-Drowns et al., 1991a,b). A second approach was to search by key-words in the ERIC data-base; this was an inefficient approach because of a lack of terms used in a uniform way which define our field of interest. The third approach was the `snowball' approach of following up the reference lists of articles found. Finally, for 76 of the most likely journals, the contents of all issues were scanned, from 1988 to the present in some cases, from 1992 for others because the work had already been done for the 1993 review by Black (see Appendix for a list of the journals scanned).
Natriello's review covered a broader field than our own. The paper spanned a full range of assessment purposes, which he categorised as certification, selection, direction and motivation. Only the last two of these are covered here. Crooks used the term `classroom evaluation' with the same meaning as we propose for `formative assessment'. These two articles gave reference lists containing 91 and 241 items respectively, but only 9 items appear in both lists. This illustrates the twin and related difficulties of defining the field and of searching the literature.
The problems of composing a framework for a review are also illustrated by the differences between the Natriello and the Crooks articles. Natriello reviews the issues within a framework provided by a model of the assessment cycle, which starts from purposes, then moves to the setting of tasks, criteria and standards, then through to appraising performance and providing feedback and outcomes. He then discusses research on the impact of these evaluation processes on students. Perhaps his most significant point, however, is that in his view, the vast majority of the research into the effects of evaluation processes is irrelevant because key distinctions are conflated (for example by not controlling for the quality as well as the quantity of feedback). He concludes by suggesting how the weaknesses in the existing research-base might be addressed in future research.
Crooks' paper has a narrower focus--the impact of evaluation practices on students-and divides the field into three main areas--the impact of normal classroom testing practices, the impact of a range of other instructional practices which bear on evaluation, and finally the motivational aspects which relate to classroom evaluation. He concludes that the summative function of evaluation--grading--has been too dominant and that more emphasis should be given to the potential of classroom assessments to assist learning. Feedback to students should focus on the task, should be given regularly and while still relevant, and should be specific to the task. However, in Crooks' view the `most vital of all the messages emerging from this review' (p. 470) is that the assessments must emphasise the skills, knowledge and attitudes perceived to be most important, however difficult the technical problems that this may cause.
Like Natriello's review, the research cited by Crooks covers a range of styles and contexts, from curriculum-related studies involving work in normal classrooms by the students' own teachers, to experiments in laboratory settings by researchers. The relevance of work that is not carried out in normal classrooms by teachers can be called in question (Lundeberg & Fox, 1991), but if all such work were excluded, not only would the field be rather sparsely populated, but one would also be overlooking many important clues and pointers towards the difficult goal of reaching an adequately complex and complete understanding of formative assessment. Thus this review, like that of Natriello and more particularly that of Crooks, is eclectic. In consequence, decisions about what to include have been somewhat arbitrary, so that we now have some sympathetic understanding of the lack of overlap between the literature sources used in the two earlier reviews.
The processes described above produced a total of 681 publications which appeared relevant, at first sight, to the review. The bibliographic details for those identified by electronic means were imported (in most cases, including abstracts) into a bibliographic database, and the others were entered manually. An initial review, in some cases based on the abstract alone, and in some cases involving reading the full publication, identified an initial total of about 250 of these publications as being sufficiently important to require reading in full. Each of these publications was then coded with labels relating to its focus--a total of 47 different labels being used, with an average of 2.4 labels per reference. For each of the labelled publications, existing abstracts were reviewed and, in some cases modified to highlight aspects of the publication relevant to the present review, and abstracts written where none existed in the database. Based on a preliminary reading of the relevant papers, a structure of seven main sections was adopted.
The writing for each section was undertaken by first allocating each label to a section. All but one of the labels was allocated to a unique section (one was allocated to two sections). Abstracts of publications relevant to each section were then printed out together and each section was allocated to one of the authors so that initial drafts could be prepared, which were then revised jointly. The seven sections which emerged from this process may be briefly described as follows.
The approach in the section on Examples in evidence is pragmatic, in that an account is given first of a variety of selected pieces of research about the effectiveness of formative assessment, and then these are discussed in order to identify a set of considerations to be borne in mind in the succeeding--more analytic--sections. The next section on Assessment by teachers adds to the empirical background by presenting a brief account of evidence about the current state of formative assessment practice amongst teachers.
There follows a more structured account of the field. The next two sections deal respectively with the student perspective and the teachers' role. Whilst the section on Strategies and tactics for teachers focuses on tactics and strategies in general, the next section on Systems follows by discussing some specific and comprehensive systems for teaching in which formative assessment plays an important part. The section on Feedback is more reflective and theoretical, presenting an account, grounded in evidence, of the nature of feedback, a concept which is central to formative assessment. This prepares the ground for a final section, on Prospects for the theory and practice of formative assessment, in which we attempt a synthesis of some of the main issues in the context of an attempt to review the theoretical basis, the research prospects and needs, and the implications for practice and for policy of formative assessment studies.
Examples in Evidence
In this section we present brief accounts of pieces of research which, between and across them, illustrate some of the main issues involved in research which aims to secure evidence about the effects of formative assessment.
The first is a project in which 25 Portuguese teachers of mathematics were trained in self-assessment methods on a 20-week part-time course, methods which they put into practice as the course progressed with 246 students of ages 8 and 9 and with 108 older students with ages between 10 and 14 (Fontana & Fernandes, 1994). The students of a further 20 Portuguese teachers who were taking another course in education at the time served as a control group. Both experimental and control groups were given pre- and post- tests of mathematics achievement, and both spent the same times in class on mathematics. Both groups showed significant gains over the period, but the experimental group's mean gain was about twice that of the control group's for the 8 and 9-year-old students--a clearly significant difference. Similar effects were obtained for the older students, but with a less clear outcome statistically because the pre-test, being too easy, could not identify any possible initial difference between the two groups. The focus of the assessment work was on regular--mainly daily--self-assessment by the pupils. This involved teaching them to understand both the learning objectives and the assessment criteria, giving them opportunity to choose learning tasks and using tasks which gave them scope to assess their own learning outcomes.
This research has ecological validity, and gives rigorously constructed evidence of learning gains. The authors point out that more work is required to look for long-term outcomes and to explore the relative effectiveness amongst the several techniques employed in concert. However, the work also illustrates that an initiative can involve far more than simply adding some assessment exercises to existing teaching--in this case the two outstanding elements are the focus on self-assessment and the implementation of this assessment in the context of a constructivist classroom. On the one hand it could be said that one or other of these features, or the combination of the two, is responsible for the gains, on the other it could be argued that it is not possible to introduce formative assessment without some radical change in classroom pedagogy because, of its nature, it is an essential component of the pedagogic process.
The second example is reported by Whiting et al. (1995), the first author being the teacher and the co-authors university and school district staff. The account is a review of the teacher's experience and records, with about 7000 students over a period equivalent to 18 years, of using mastery learning with his classes. This involved regular testing and feedback to students, with a requirement that they either achieve a high test score--at least 90%--before they were allowed to proceed to the next task, or, if the score were lower, they study the topic further until they could satisfy the mastery criterion. Whiting's final test scores and the grade point averages of his students were consistently high, and higher than those of students in the same course not taught by him. `Me students' learning styles were changed as a result of the method of teaching, so that the time taken for successive units was decreased and the numbers having to retake tests decreased. In addition, tests of their attitudes towards school and towards learning showed positive changes.
Like the previous study, this work has ecological validity--it is a report of work in real classrooms about what has become the normal method used by a teacher over many years. The gains reported are substantial; although the comparisons with the control are not documented in detail, it is reported that the teacher has had difficulty explaining his high success rate to colleagues. It is conceded that the success could be due to the personal excellence of the teacher, although he believes that the approach has made him a better teacher. In particular he has come to believe that all pupils can succeed, a belief which he regards as an important part of the approach. `Me result shows two characteristic and related features--the first being that the teaching change involves a completely new learning regime for the students, not just the addition of a few tests, the second being that precisely because of this, it is not easy to say to what extent the effectiveness depends specifically upon the quality and communication of the assessment feedback. It differs from the first example in arising from a particular movement aimed at a radical change in learning provision, and in that it is based on different assumptions about the nature of learning.
The third example also had its origin in the idea of mastery learning, but departed from the orthodoxy in that the authors started from the belief that it was the frequent testing that was the main cause of the learning achievements reported for this approach. The project was an experiment in mathematics teaching (Martinez & Martinez, 1992), in which 120 American college students in an introductory algebra course were placed in one of four groups in a 2 X 2 experimental design for an 18-week course covering seven chapters of a text. Two groups were given one test per chapter, the other two were given three tests per chapter. Two groups were taught by a very experienced and highly rated teacher, the other two by a relatively inexperienced teacher with average ratings. The results of a post-test showed a significant advantage for those tested more frequently, but the gain was far smaller for the experienced teacher than for the newcomer. Comparison of the final scores with the larger group of students in the same course but not in the experiment showed that the experienced teacher was indeed exceptional, so that the authors could conclude that the more frequent testing was indeed effective, but that much of" the gain could be secured by an exceptional teacher with less frequent testing.
By comparison with the first study above, this one has similar statistical measures and analyses, but the nature of the two regimes being compared is quite different. Indeed, one could question whether the frequent testing really constitutes formative assessment--a discussion of that question would have to focus on the quality of the teacher-student interaction and on whether test results constituted feedback in the sense of leading to corrective action taken to close any gaps in performance (Ramaprasad, 1983). It is possible that the superiority of the experienced teacher may have been in his/her skill in this aspect, thus making the testing more effectively formative at either frequency.
Example number four was undertaken with 5-year-old children being taught in kindergarten (Bergan et al., 1991). The underlying motivation was a belief that close attention to the early acquisition of basic skills is essential. It involved 838 children drawn mainly from disadvantaged home backgrounds in six different regions in the USA. The teachers of the experimental group were trained to implement a measurement and planning system which required an initial assessment input to inform teaching at the individual pupil level, consultation on progress after two weeks, new assessments to give a further diagnostic review and new decisions about students' needs after four weeks, with the whole course lasting eight weeks. The teachers used mainly observations of skills to assess progress, and worked with open-style activities which enabled them to differentiate the tasks within each activity in order to match to the needs of the individual child. There was emphasis in their training on a criterion-referenced model of the development of understanding drawn up on the basis of results of earlier work, and the diagnostic assessments were designed to help locate each child at a point on this scale. Outcome tests were compared with initial tests of the same skills. Analysis of the data using structural equation modelling showed that the pre-test measures were a strong determinant of all outcomes, but the experimental group achieved significantly higher scores in tests in reading, mathematics and science than a control group. The criterion tests used, which were traditional multiple-choice, were not adapted to match the open child-centred style of the experimental group's work. Furthermore, of the control group, on average 1 child in 3.7 was referred as having particular learning needs and 1 in 5 was placed in special education; the corresponding figures for the experimental group were I in 17 arid 1 in 71.
The researchers concluded that the capacity of children is under-developed in conventional teaching so that many are `put down' unnecessarily and so have their futures prejudiced. One feature of the experiment's success was that teachers had enhanced confidence in their powers to make referral decisions wisely. This example illustrates again the embedding of a rigorous formative assessment routine within an innovative programme. What is more salient here is the basis, in that programme, of a model of the development of performance linked to a criterion based scheme of diagnostic assessment.
In example number five (Butler, 1988), the work was grounded more narrowly in an explicit psychological theory, in this case about a link between intrinsic motivation and the type of evaluation that students have been taught to expect. The experiment involved 48 11-year-old Israeli students selected from 12 classes across 4 schools, half of those selected being in the top quartile of their class on tests of mathematics and language, the other half being in the bottom quartile. The students were given two types of task in pairs, not curriculum related, one of each pair testing convergent thinking, the other divergent. They were given written tasks to be tackled individually under supervision, with an oral introduction and explanation. Three sessions were held, with the same pair of tasks used in the first and third. Each student received one of three types of written feedback with returned work, both on the first session's work before the second, and on the second session's work before the third. The second and third sessions, including all of the receipt and reflection on the feedback, occurred on the same day. For feedback, one-third of the group were given individually composed comments on the match, or not, of their work with the criteria which had been explained to all beforehand. A second group were given only grades, derived from the scores on the preceding session's work. The third group were given both grades and comments. Scores on the work done in each of the three sessions served as outcome measures. For the `comments only' group the scores increased by about one-third between the first and second sessions, for both types of task, and remained at this higher level for the third session. The `comments with grade' group showed a significant decline in scores across the three sessions, particularly on the convergent task, whilst the `grade only' group declined on both tasks between the first and last sessions, but showed a gain on the second session, in the convergent task, which was not subsequently maintained. Tests of pupils' interest also showed a similar pattern: however, the only significant difference between the high and the low achieving groups was that interest was undermined for the low achievers by either of the regimes involving feedback of grades, whereas high achievers in all three feedback groups maintained a high level of interest.
The results were discussed by the authors in terms of cognitive evaluation theory. A significant feature here is that even if feedback comments are operationally helpful for a student's work, their effect can be undermined by the negative motivational effects of the normative feedback, i.e. by giving grades. The results are consistent with literature which indicates that task-involving evaluation is more effective than ego-involving evaluation, to the extent that even the giving of praise can have a negative effect with low-achievers. They also support the view that pre-occupation with grade attainment can lower the quality of task performance, particularly on divergent tasks.
This study carries two significant messages for this general review. The first is that, whilst the experiment lacks ecological validity because it was not part of or related to normal curriculum work and was not carried out by the students' usual teachers, it nevertheless might illustrate some important lessons about ways in which formative evaluation feedback might be made more or less effective in normal classroom work. The second lesson is the possibility that, in normal classroom work, the effectiveness of formative feedback will depend upon several detailed features of its quality, and not on its mere existence or absence. A third message is that close attention needs to be given to the differential effects between low and high achievers, of any type of feedback.
The sixth example is in several ways similar to the fifth. In this work (Schunk, 1996), 44 students in one USA elementary school, all 9 or 10 years of age, worked over seven days on seven packages of instructional materials on fractions under the instructions of graduate students. Students worked in four separate groups subject to different treatments--for two groups the instructors stressed learning goals (learn how to solve problems) whilst for the other two they stressed performance goals (merely solve them). For each set of goals, one group had to evaluate their problem-solving capabilities at the end of each of the first sessions, whereas the other was asked instead to complete an attitude questionnaire about the work. Outcome measures of skill, motivation and self-efficacy showed that the group given performance goals without self-evaluation came out lower than the other three on all measures. The interpretation of this result suggested that the effect of the frequent self-evaluation had out-weighed the differential effect of the two types of goal. This was confirmed in a second study in which all students undertook the self-evaluation, but on only one occasion near the end rather than after all of the first six sessions. There were two groups who differed only in the types of goal that were emphasised-the aim being to allow the goal effects to show without the possible overwhelming effect of the frequent self-evaluation. As expected, the learning goal orientation led to higher motivation and achievement outcomes than did the performance goal.
The work in this study was curriculum related, and the instructions given in all four `treatments' were of types that might have been given by different teachers, although the high frequency of the self-evaluation sessions would be very unusual. Thus, this study comes closer to ecological validity but is nevertheless an experiment contrived outside normal class conditions. It shares with the previous (fifth) study the focus on goal orientation, but shows that this feature interacts with evaluative feedback, both within the two types of task, and whether or not the feedback is derived from an external source or from self-evaluation.
The seventh example involved work to develop an inquiry-based middle school science-based curriculum (Frederiksen & White, 1997). The teaching course was focused on a practical inquiry approach to learning about force and motion, and the work involved 12 classes of 30 students each in two schools. Each class was taught to a carefully constructed curriculum plan in which a sequence of conceptually based issues was explored through experiments and computer simulation, using an inquiry cycle model that was made explicit to the students. All of the work was carried out in peer groups. Each class was divided into two halves: a control group used some periods of time for a general discussion of the module, whilst an experimental group spent the same time on discussion, structured to promote reflective assessment, with both peer assessment of presentations to the class and self-assessment. This experimental work was structured around students' use of tools of systematic and reasoned inquiry, and the social context of writing and other communication modes. All students were given the same basic skills test at the outset. The outcome measures were of three types: one a mean score on projects throughout the course, one a score on two chosen projects which each student carried out independently, and one a score on a conceptual physics test. On the mean project scores, the experimental group showed a significant overall gain; however, when the students were divided into three groups according to low, medium or high scores on the initial basic skills test, the low scoring group showed a superiority, over their control group peers, of more than three standard deviations, the medium group just over two, and the high group just over one. A similar pattern, of superiority of the experimental group which was more marked for low scoring students on the basic skills test, was also found for the other two outcomes. Amongst the students in the experimental group, those who showed the best understanding of the assessment process achieved the highest scores.
This science project again shows a version of formative assessment which is an intrinsic component of a more thorough-going innovation to change teaching and learning. Whilst the experimental-control difference here lay only in the development of `reflective assessment' amongst the students, this work was embedded in an environment where such assessment was an intrinsic component. Two other distinctive features of this study are first, the use of outcome measures of different types, but all directly reflecting the aims of the teaching, and second the differential gains between students who would have been labelled `low ability' and `high ability' respectively.
The eighth and final example is different from the others, in that it was a meta-analysis of 21 different studies, of children ranging from pre-school to grade 1:2, which between them yielded 96 different effect sizes (Fuchs & Fuchs, 1986). The main focus was on work for children with mild handicaps, and on the use of the feedback to and by teachers. The studies were carefully selected-all involved comparison between experimental and control groups, and all involved assessment activities with frequencies of between 2 and 5 times per week. The mean effect size obtained was 0.70. Some of the studies also included children without handicap: these gave a mean effect size of 0.63 over 22 sets of results (not significantly different from the mean of 0.73 for the handicapped groups). The authors noted that in about half of the studies teachers worked to set rules about reviews of the data and actions to follow, whereas in the others actions were left to teachers' judgments. The former produced a mean effect size of 0.92 compared with 0.42 for the latter. Similarly, those studies in which teachers undertook to produce graphs of the progress of individual children as a guide and stimulus to action reported larger mean gains than those where this was not done (mean effect size 0.70 compared with 0.26).
Three features of this last example are of particular interest here. The first is that the authors compare the striking success of the formative approach with the unsatisfactory outcomes of programmes which had attempted to work from a priori prescriptions for individualised learning programmes for children, based on particular learning theories and diagnostic pre-tests. Such programmes embodied a deductive approach in contrast with the inductive approach of formative feedback programmes. The second feature is that the main learning gains from the formative work were only achieved when teachers were constrained to use the data in systematic ways which were new to them. The third feature is that such accumulation of evidence should have given some general impetus to the development of formative assessment--yet this paper appears to have been overlooked in most of the later literature.
Some General Issues
The studies chosen thus far are all based on quantitative comparisons of learning gains, six of them, and those reviewed in the eighth, being rigorous in using pre- and post--tests and comparison of experimental with control groups. We do not imply that useful information and insights about the topic cannot be obtained by work in other paradigms.
As mentioned in the Introduction, the ecological validity of studies is clearly important in determining the applicability of the results to normal classroom work. However, we shall assume that, given this caution, useful lessons can be learnt from studies which lie at various points between the `normal' classroom and the special conditions set up by researchers. In this respect all of the studies exhibit some degree of movement away from `normal' classrooms. The study (by Whiting et al., 1995) which is most clearly one of normal teaching within the everyday classroom is, inevitably, the one for which quantitative comparison with a strictly equivalent control was not possible. More generally, caution must be exercised for any studies where those teaching any experimental groups are not the same teachers as those for any control groups.
Given these reservations, however, it is possible to summarise some general features which these examples illustrate and which will serve as a framework for later sections of this article.
It is hard to see how any innovation in formative assessment can be treated as a marginal change in classroom work. All such work involves some degree of feedback between those taught and the teacher, and this is entailed in the quality of their interactions which is at the heart of pedagogy. The nature of these interactions between teachers and students, and of students with one another, will be key determinants for the outcomes of any changes, but it is difficult to obtain data about this quality from many of the published reports. The examples do exhibit part of the variety of ways in which enhanced formative work can be embedded in new modes of pedagogy. In particular, it can be a salient and explicit feature of an innovation, or an adjunct to some different and larger scale movement--such as mastery learning. In both cases it might be difficult to separate out the particular contribution of the formative feedback to any learning gains. Another evaluation problem that arises here is that almost all innovations are bound to be pursuing innovations in ends as well as in means, so that the demand for unambiguous quantitative comparisons of effectiveness can never be fully satisfied.
Underlying the various approaches are assumptions about the psychology of learning. These can be explicit and fundamental, as in the constructivist basis of the first and the last of the examples, or in the diagnostic approach of Bergan et al. (1991) or implicit and pragmatic, as in the mastery learning approaches.
For assessment to be formative the feedback information has to be used-which means that a significant aspect of any approach will be the differential treatments which are incorporated in response to the feedback. Here again assumptions about learning, and about the structure and nature of learning tasks which will provide the best challenges for improved learning, will be significant. The different varieties and priorities across these assumptions create the possibility of a wide range of experiments involving formative assessment.
The role of students in assessment is an important aspect, hidden because it is taken for granted in some reports, but explicit in others, particularly where self and peer assessments by and between students are an important feature (with some arguing that it is an inescapable feature-see Sadler, 1989).
The effectiveness of formative work depends not only on the content of the feedback and associated learning opportunities, but also on the broader context of assumptions about the motivations and self-perceptions of students within which it occurs. In particular, feedback which is directed to the objective needs revealed, with the assumption that each student can and will succeed, has a very different effect from that feedback which is subjective in mentioning comparison with peers, with the assumption--albeit covert--that some students are not as able as others and so cannot expect full success.
However, the consistent feature across the variety of these examples is that they all show that attention to formative assessment can lead to significant learning gains. Although there is no guarantee that it will do so irrespective of the context and the particular approach adopted, we have not come across any report of negative effects following on an enhancement of formative practice. In this respect, one general message of the Crooks review has been further supported.
One example, the kindergarten study of Bergan et al. (1991) brings out dramatically the importance that may be attached to the achievement of such gains. This particular innovation has changed the life chances of many children. This sharp reality may not look as important as it really is when a result is presented dryly in terms of effect sizes of (say) 0.4 standard deviations.
To glean more from the published work, it is necessary to change gear and move away from holistic descriptions of selected examples to a more analytic form of presentation. This will be undertaken in the next five sections.