Research in Extensive Reading
Published as Waring, R. 2001. Research in Extensive Reading. Kiyo, Notre Dame Seishin University: Studies in Foreign Languages and Literature. 25 (1):44-67
Over the last 15 years a considerable amount of experimental research has been published that deals with some aspect of second language Extensive Reading (ER). There have been studies that ask whether subjects can learn from ER (including many incidental learning from reading experiments), other studies that compare ER approaches with other treatments (such as with 'normal' approaches or 'translation' approaches), and yet others that have looked at the effect of ER on other aspects of language learning (such as on writing, confidence and motivation and so on).
Research in ER has been undertaken to demonstrate that language gains of many types occur from exposure to simplified second language texts. Research by Elley (1991), Hafiz and Tudor (1990), Krashen and Cho (1994), Lai (1993), Lituanas, Jacobs and Renandya (to appear) and Renandya, Rajan and Jacobs (1999) among others, report linguistic gains as a result of ER. Writing ability is said to improve as a result of extensive reading (Elley and Mangubhai, 1981; Hafiz and Tudor, 1990; Hedgcock and Atkinson, 1998; Janopolous, 1986; Mason and Krashen, in press; Robb and Susser, 1989; Tsang, 1996) as is spelling (Polak and Krashen, 1988). Reading extensively has also been reported to increase motivation to read and the development of a positive attitude to reading in the second language (Cho and Krashen, 1994, 1995; Constantino, 1994; Evans, 1999; Hayashi, 1999; Mason and Krashen, 1997). Oral proficiency was (anecdotally) said to have improved after reading large amounts of text (Cho and Krashen, 1994). There are a considerable number of vocabulary studies that report gains in vocabulary from ER (Day, Omura and Hiramatsu, 1991; Dupuy and Krashen, 1993; Ferris, 1988; Grabe and Stoller, 1997; Hayashi, 1999; Mason and Krashen, 1997, Mason and Krashen, in press; Pitts, White and Krashen, 1989 and Yamazaki, 1996 are just a few examples.).
Almost all of this research has been done by researchers who wish to show ER in a good light and there is considerable cross-citation within this literature which is used as evidence to support the claims made in the research. However, rarely does one find in these citations any critique of this literature and most often it is accepted as fact and cited without comment. The ER research literature (as a body of research) has been severely criticised by many researchers. For example, Coady (1997 : 226) referring to some oft-cited ER research says that "there appears to be a serious methodological problem with these studies". Nation (1999:124) says that "second language studies ..... generally lacked careful control of the research design". Horst, Cobb and Meara (1998) also point out that some of the incidental learning from exposure experiments that are often cited as supporting an ER approach are 'methodologically flawed' (1998 : 210). Unfortunately, a detailed examination of these 'flaws' is not made apparent in the papers and researchers need to be made aware of what problems exist.
As the popularity of ER, both as an approach to learning and as a research topic, has boomed in recent years, it seems timely and pertinent to look carefully and in detail at a broad range of ER studies to determine what we do know about ER, what problems these researchers have had while undertaking this research, and how that can inform future research so that we can learn from past mistakes. Unless we have a solid foundation of research upon which to form our ideas about what ER can do effectively and what it cannot, it will be that much more difficult to promote the need for ER within foreign language learning contexts. Thus, my concern is with the quality, reliability and accountability of much, but not all of, the research surveyed below (and the claims emanating from them) as a canonical base upon which the house of ER rests.
It should be mentioned at the outset that I am not at all against ER. Indeed, I have been actively promoting the development of ER for a number of years now (e.g. Waring , 1997: Waring and Takahashi, 2000). However, my own reading of the ER literature has raised nagging doubts about the quality of the research base upon which claims have made for ER. My aim in looking at this research carefully and in detail, is to assess what has been found and to ascertain what seems reliable and what does not. This survey also seeks to identify what we do not know about ER but need to know. This paper is not primarily meant as a review of findings, (although many of these will be mentioned in passing) but the central focus is with issues concerning research design, assessment methodology and other issues related to researching ER.
This survey reviews 28 pieces of research (sometimes there are three experiments in one paper) into various aspects of ER in second languages. Not all these experiments directly assess the effectiveness of ER, nor do all of them claim to be ER based, but they have been included because they have been used by researchers as providing support for claims that ER is beneficial to second language learners. If we wish to know something about ER, and the claims made about ER, we need to look at them and see them as within the body of second language ER research precisely because they are cited as evidence for ER.
A critique of the L2 ER studies
It needs to be stated plainly that we know very little about reading and about the assessment of reading. Alderson (2000) makes the point that in order for us to be able to say something about reading we have to know what reading is. He goes on to say that
in order to devise a test or assessment procedure for reading, we must surely appeal, if only intuitively, to some concept of what it means to read texts and understand them. How can we possibly test whether somebody has understood a text if we do not know what we mean by 'understand'? How can we possibly diagnose somebody's 'reading problems' if we have no idea what might constitute a problem and what the possible 'causes' might be? How can we possibly decide what 'level' a reader is 'at' if we have no idea what 'levels of reading' might exist, and what it means to be 'reading at a particular level'? In short those who need to test reading clearly need to develop some idea of what reading is, and yet that is an enormous task (pp. 6-7).
This should not stop us trying to find out about reading in second languages, but research into second language extensive reading will always be fraught with problems. Firstly, the volume of reading that subjects have to undertake in order that the research can be labelled 'Extensive Reading' means that it will take a lot of time. This necessitates that the reading be done over many sessions and often out of class and in a non-controlled environments, which naturally brings up cries of contamination due to outside influence during the period of the experiment. Secondly, research in ER has to be done in real classrooms under real conditions which means that conditions for investigating the nature of ER will be less than ideal and it will not always be possible to control all variables. It is just not practical in most circumstances to cleanly control all variables other than the ones we are looking at when researching ER, so we must learn to be happy with doing our best. However, the research survey does not always show that the researchers did their best to control variables. Indeed, quite a number of studies are deeply flawed either methodologically or in execution. In this section I shall review some of the areas of concern that have become apparent in this research.
The range of questions that have been investigated
Research into language gains and gains in affect (e.g. confidence and motivation) from ER in second languages is still in its infancy, is quite fragmented, and is rather difficult to interpret when looking for concrete ‘evidence’. However, a number of points are clear.
Firstly, this research has mostly been conducted with the learning of English and on Asian and Oceanian learners. In particular, there is very little widely known research into the second language learning of Mandarin, French, Spanish, Arabic and other major world languages. Only two of the studies surveyed here did not look at the learning of English (Dupuy and Krashen, 1993 looked at the learning of French and Pitts, White and Krashen, 1989 investigated the learning of pseudo-words). Secondly, quite a number of these studies seem to have used convenience populations (i.e. those available in the researcher's own classes) and / or are conducted on highly educated individuals (those at college) rather than a more 'normal' profile of the population in general. Thirdly, there is also a tendency to use population of individuals who are already proficient at learning second languages (e.g. English majors) and a tendency to use the upper streamed students rather than lower-streamed ones. (See Evans, 1999 for one example). Fourthly, there is also a very narrow ability range of subjects who have been investigated most of whom might be considered 'intermediate' level. Fifthly, most research has been conducted with adults rather than children and in foreign language environments rather than second language environments. This narrow range is rather troubling. We cannot hope to have much to say about ER in general until we have extensive amounts of research into language learning from ER in a whole range of languages, ages, educational backgrounds and so on.
When we look at the methodology of the research that has been done quite a different picture appears and almost the full range of experimental variables have been explored. There are experiments that have been conducted on individuals as cases studies (Cho and Krashen, 1994, Grabe and Stoller, 1997) and on very large populations (e.g. Elley and Mangubhai, 1983; Lai, 1993). There are experiments that have investigated learning from short texts (e.g. only 1032 words in Day, Omura and Hiramatsu, 1991) to very large amounts of text (e.g. a graded reader a day in Lai, 1993; 1500 pages over several months in Mason and Krashen, in press; 18 graded readers in 9 weeks in Yamazaki, 1996; and 161,000 words by the Korean student Jin-hee in Cho and Krashen, 1994). There are experiments that have lasted up to two years (e.g. Elley's 'book flood' experiments, 1991) and some that have been over in minutes (e.g. Day, Omura and Hiramatsu, 1991). Some research into ER has investigated language development in children and others in adults. Some studies were with mono-lingual groups while others were with subjects of varied backgrounds.
There is also a wide range of testing instruments that have been used. Some studies (e.g. Laufer-Dvorkin, 1981) used a battery of in-house general proficiency tests while others used standardised commercially available tests (Evans, 1999 used KET; Hafiz and Tudor, 1989 used the NFER tests; Hayashi, 1999 used TOEFL; and Lituanas, Jacobs and Renandya, to appear, used the Informal Reading Inventory and the Gray Standardised Oral Reading Test). Sometimes essays are written pre- and post- and assessed for gains in writing ability (e.g. Hafiz and Tudor, 1990; Mason and Krashen, in press) and sometimes in-house research-specific tests have been used (e.g. Day, Omura and Hiramatsu, 1991, Pitts, White and Krashen, 1989).
From the survey of 28 pieces of experimental research several types of study are apparent. There are studies which compare ER with other treatments, and others which seek to show how ER benefits other language skills (e.g. the effect of ER on writing or on vocabulary building). Others only wish to determine whether ER can lead to gains in language development from exposure to text. These categories are by no means clearly defined and some studies can fit the profile of two or more. Each of these areas will be surveyed.
Studies comparing ER to other treatments
The focus of these studies is to compare ER to other treatments or approaches. This type of study (with the 'gains from exposure' literature) makes up the majority of studies in L2 ER research. There are two sub-groups. There are studies comparing ER with another treatment (such as a ‘normal’ class), and those that compare ER under different conditions (such as ER research with ER reading and book reports in written in the L2, compared to ER with book reports written in English that were corrected and book reports written in English that were not corrected). There are several concerns with much of this research.
Firstly, in several studies (e.g. Evans, 1999; Mason and Krashen, all three experiments, 1997; Robb and Susser, 1989; and Yamazaki, 1996) extra time for contact with English was given to the experimental (ER) group. For example, in Robb and Susser (1989) the experimental group had to read 500 pages out of class during the year whilst the control group only had a short extra assigned reading per week. In Evans (1999) the ER group had extra reading while the controls did not. This means that with this design we will not be able to see the comparative benefit of ER over other methods as more exposure in one group will bias the results to that group, thus we should be cautious in interpreting the effectiveness of this research over other methods.
Secondly, the data for a considerable number of these studies were probably affected by outside influences (this also applies to the 'gains from exposure' literature) where the tuition variable was not controlled, some of this contamination was reported in the studies and some was not (see below). The most common factor influencing the study was the presence of concurrent classes or tuition that were not part of the study (Evans, 1999; Hayashi, 1999; Lai, 1993; all three experiments in Mason and Krashen, 1997; Mason and Krashen, in press; Renandya, Rajan and Jacobs, 1999; Robb and Susser, 1989; Tsang, 1996; and Yamazaki, 1996). In one study (Hafiz and Tudor. 1989), which is probably the most cited ER study, the data were collected in the UK despite the subjects living in a Punjabi community. The effect of outside exposure in the community at large and from their other classes at school was hardly mentioned as influential in the study. This makes it extremely unlikely that gains were directly affected by ER and makes it difficult to determine how much of the gains were due to only ER or to the other tuition. As was mentioned earlier, the nature of assessing ER is that it will take time and practicalities demand that it be done with real classes. It is therefore vital to try to minimise the effect of the external influence, and to report as fully as possible how the external influence may have affected the results so that correct interpretation is possible.
Thirdly, ER is typically compared with instructional approaches which do not have the benefit of the ‘rich’ environment of the ER approach (Coady, 1997). Comparisons are made with ‘audiolingual approaches’ (Elley, 1991), or ‘translation’ (Yamazaki, 1996), or ‘regular classes’ (Mason and Krashen, experiment 2), or classes which were ‘taught in the conventional way’ (Lituanas, Jacobs and Renandya, to appear). The question of how ER is comparable to other rich environments has yet to be resolved.
‘Gains in writing' experiments
This research asks whether writing ability can be affected by ER (Elley and Mangubhai, 1983; Hafiz and Tudor, 1990; Hedgcock and Atkinson, 1998; Janopoulos, 1986; Mason and Krashen, in press; Robb and Susser, 1989; Tsang, 1996 are but a few). A typical design is a follows. Students are given an essay test, they read something and they are given another essay test (most often the same title, but not always). Then the essays are scored on a variety of measures to check for differences pre- and post- ER. Some studies (e.g. Mason and Krashen, in press), used only statistical data such as the number of words used, the number of clauses, the number of error-free clauses and so on. Other studies had an holistic evaluation (e.g. Mason and Krashen, 1997 experiment 2 ) and yet others had an evaluation of factors such as coherence, cohesion, organization, logical progression, impression and so on (Tsang, 1996). It is important to clearly note when citing this research that different procedures were used in the ‘effects of ER on writing’ research because the analyses are looking at different things. The advantage of statistical data are that they are statistical and can be easily analysed using a computer, but the disadvantage is that they do not indicate levels of the quality of writing. Thus in these types of analysis it may be best to combine all of these factors in the analysis as Tsang (1996) did .
‘Gains in affect’ experiments
This research looked at whether an ER approach has a positive effect on motivation, confidence and general perception of the usefulness of ER. The term 'pleasure' that is attached to this type of reading research is used in two ways. The first investigates reading that is not done as part of school work as it is done by free-will. The second meaning occurs in research that asks about the reader's subjective reaction to ER.
The positive effect of ER on motivation and attitude to reading is very commonly reported and probably the strongest finding in all the papers reviewed here (e.g. Constantino, 1994; Evans, 1999; Elley, 19991; Mason and Krashen, 1997, in press; Hayashi, 1999, Yamazaki, 1996). Some of these data come from formal post-reading interviews but much of this evidence is anecdotal. While there are measures of motivation (e.g. Smith, 1973) and ways to reading confidence, none have these as yet have been used to provide quantitative data.
Quite a number of studies have asked what readers feel about their reading and whether it was ‘pleasurable’. McQuillan (1994) and Dupuy (1997) found that ER is preferred to grammar instruction and practice and to assigned readings. However, it should be noted that the preferences for other types of ‘pleasurable’ language instruction such as listening to music, watching videos, free conversation, surfing the Internet and so on were not asked which leaves open the question of a preference for ER over these other 'pleasurable' language pursuits.
'Gains from exposure' experiments
Experiments that have assessed gains from exposure to ER texts (most often they are called 'incidental learning' experiments) seek to demonstrate how much (usually vocabulary) has been learned. (There have been no studies that I know of that have directly researched the acquisition of grammar or syntax from being exposed to ER, although some of the 'gains in writing' experiments have been suggested as evidence for this.) This survey has found numerous problems with the 'gains from exposure' experiments and a few points will be made below.
Lack of quality control in test construction
It is very common in this research for the vocabulary or cloze tests to be written by the authors (e.g. Day, Omura and Hiramatsu, 1991; Pitts, White and Krashen, 1989; Mason and Krashen 1997, Mason and Krashen, in press; Yamazaki, 1999). Some of these in-house tests were subjected to extensive piloting and review (e.g. Yamazaki, 1999) while most were not (or at least were not reported to have been piloted and trialed, nor assessed for their quality). Some tests appear to be either of poor quality or insufficient care seems to have been taken in their construction. The apparent lack of quality control (and even a lack of a mention of quality control procedures) in some of these tests is a matter of grave concern as it is upon the quality of these tests that the data were gathered. In addition, only two of the 12 experiments that used their own test instrument published the test with the report.
Problems with the most commonly used test format for assessing gains from ER
The most common vocabulary test used for ER 'gains from exposure' research is the multiple-choice test (e.g. Day, Omura and Hiramatsu, 1991; Dupuy and Krashen, 1993; Pitts, White and Krashen, 1989). There are numerous reasons why this test may not be the most appropriate for assessing gains from exposure to ER texts. Firstly, the multiple choice test is very limited in its ability to assess gains from reading as it ignores many of the other potential gains or benefits from the reading of an extended text. This test is attempting to assess prompted recognition but other potential linguistic benefits that are largely ignored by multiple-choice tests include lexical access speed gains; the noticing of collocations, colligations or patterns within text; the learning of new word forms and the meaning of new words; the recognition of new word forms yet to be learned; an increase in the ability to guess from context; a (dis)confirmation that a previously guessed word's meaning is probably correct; recognition of new word associations; the raising of the ability to recognize discourse and text structure; an increase in the ability to 'chunk' text; the development of saccadic eye movements and so on and so on. Thus many 'gains' from ER are ignored by the multiple-choice test and many potential benefits of ER are underestimated.
Secondly, in addition to the inability of the multiple-choice test to capture many aspects of reading, the design of the test compounds the problem because of the nature of the test's criteria for successful completion. The multiple choice tests are designed to assess receptive understanding and are either correctly answered or not and as such have the problem of both ignoring and underestimating language gains at the same time. First, we need a little background. It is widely stated that all words are not equal as some are more frequent than others and some are 'easier' to learn (e.g. Laufer, 1997). There is also general agreement that most words are not learned in one meeting, but need many meetings for the sight sound-correspondences to be made and for the receptive understanding of the word to take place (Nation, 1990, 1999). Research seems to indicate that it takes an average of about 10-20 meetings of a word before a word is known receptively with each meeting adding to the knowledge about a word until a certain threshold of knowledge is gathered that allows successful understanding, or successful completion of a test item (Saragi, Nation and Meister, 1978; Nation, 1990, 1999). The threshold for success on multiple -choice vocabulary tests is little understood, but these tests (and other tests that have the right/wrong criteria for success) are severely limited in their ability to only reflect the knowledge of the words that have met the 'success threshold' as a result of the reading. For example, if a learner has met the word abominable two times before the reading and meets it once more during the reading, then although the learner has gained a little piece of knowledge about the word (such as a greater awareness of its general meaning or its spelling) he would not have enough knowledge to tip it over the 10 to 20 meetings threshold into success. Thus, the learner's gain from reading abominable once is ignored by the strict criteria for success of the multiple choice test and he gets zero on the test. Conversely, if a learner knew enough about abominable to meet the criteria for success before reading the text, and by reading the text her knowledge of abominable increased, this increase in knowledge also cannot be measured by the multiple-choice test and thus it will underestimate her gains.
Thirdly, this threshold is not a uniform one for all multiple choice tests. A test with distractors with similar meanings (anger, irritate, annoy, and frustrate) would be more difficult than one in which the distractors are dissimilar (boat, tree, cat, and hospital). It is likely that a learner will have more troubles in determining the correct answer from a set of similar words as more knowledge is required to separate them. This means that results from different tests and with multiple-choice tests that have distractors with different words will vary considerably. Thus interpretation of the results of an experiment can only be done properly when the test is published with the research. In addition, there is no common agreement on the number of choices to be used in multiple- choice tests in ER research. Dupuy and Krashen (1993) used 3, Pitts, White and Krashen (1989) and Day, Omura and Hiramatsu (1991) used 4. Fortunately, all three used a 'don't know' option to reduce guessing.
It is therefore clear that the full nature of vocabulary learning from the reading is not captured by the use of a multiple choice test and more sensitive measures (Joe, Nation and Newton, 1996) than multiple choice tests are necessary to capture the full nature of learning from exposure (see Waring, 1999 for a fuller discussion of these matters).
Lack of control for guessing
Some studies that used multiple-choice tests did not correct for guessing (e.g. Dupuy and Krashen, 1993) while others did (e.g. Day, Omura and Hiramatsu, 1991; and Pitts, White and Krashen, 1989). The guessing factor is important because raw scores will only inflate true knowledge and leave misleading data. If a 40 item test has 4 choices (three distractors and a correct item) and the test taker knows none of the words then wild guessing will mean a score of 40/4=10. If the test taker knows 20 items and guessed at a further 16 then her uncorrected score is likely to be 20 + (16 items /4 choices) = 24. Although the guessing factor reduces with ability level as there are less items to guess at, it is a major factor for lower ability learners or for learners who have a tendency to guess. Thus it is crucial that guessing be controlled for in multiple choice tests and correcting scores for guessing is most often better than not adjusting the scores at all.
These matters become especially relevant when the tests contain very few items. For example, Day, Omura and Hiramatsu (1991) tested only 17 items, Dupuy and Krashen (1993) tested 30. At the other end of the spectrum we have Cho and Krashen (1994) who assessed each of their case studies on several hundred words that they underlined from their reading. Nation (1993) points out that the sample size is a crucial factor in determining if the test is reliable. If the sample size is too small there is a high chance of statistical error. He says that statisticians have determined the confidence interval within which an observed score should be seen. He points out that "if a learner's observed score on the test was 50 out of 100 (50%) we could be 90% sure that the true value of his or her score lay between 42 (42%) and 58 (58%) out of 100 (i.e. a range of plus or minus 8)" (p. 35-36). In other words a 50% score on a test means that we can only be 90% sure that the subjects true score is between 42 and 58, and not that it is exactly 50%. Nation points out that if a test of 100 items has a 16 % confidence window (42 to 58) then a test with a much smaller sample size will have a much greater confidence window, which makes the test less reliable. A test with only 17 items would most probably be quite unreliable from this point of view.
There are other equally serious factors that are impacted by item sample size. It is a common finding in the L2 ER experiments surveyed here that the gains from the learning from reading are low. Horst, Cobb and Meara (1998) report an average of 10 to 20% gains on short experiments, and much lower figures for longer texts (but no retention data are given) (See below for other reasons why even these low estimates may be overestimated by the careful selection of tested words). One possible reason for this apparently low intake on these experiments with multiple choice tests is to do with the relationship between the opportunity for success and the number of chances to demonstrate the learner's knowledge. We have seen that each word takes time to pass the 'success threshold', and we know that it takes between 10-20 meeting for this threshold to be met. Thus if a test has 60 items and each word is only met once, we can expect only 1/10 or 1/20 of these 60 words to pass the 'success threshold', or a maximum of 6 (60/10) or 3 (60/20) words to be gained. If the test item sample is only 20, then we can only expect one word to pass the threshold and that is not enough to provide reliable data.
Few data on retention
Another very common element in the 'gains from exposure' research is the lack of concern for the retention of what was learned. Only one of the studies under investigation attempted to systematically gather retention data (Yamazaki, 1999). Retention data from the reading are important because they give us an idea of the quality, not only the quantity, of learning that occurs from exposure to the reading texts. Further, as most of the tests were given immediately after treatment, there is a very high probability that the subjects will score higher on the test than if the test was given even a few hours or even days later due to the nature of short-term memory loss (Baddeley, 1997). Thus the 'real' and lasting gains demonstrated in the research would probably have been over-estimated. This result was found in Yamazaki (1999) and is common throughout the second language vocabulary learning literature (see Weltens and Grendel, 1993 for a discussion). This therefore means that we should be cautious when accepting as fact that the gains that were reported in this kind of research were natural as we can expect a certain level of over-estimation due to the nature of language loss.
Controls not exposed to the target vocabulary
In some of the research that looked at how much can be learned from exposing subjects to a text, the controls were not exposed to the tested vocabulary. The assumption is that the controls should not need to see the vocabulary so that true learning could be measured. This design was used in the Pitts, White and Krashen (1991) replication of the Saragi, Nation and Meister (1978) Clockwork Orange study in which the subjects met 30 nadsat words (special vocabulary that only occurs in that book). Other studies that did not expose their controls to the tested vocabulary include Day, Omura and Hiramatsu (1991), Ferris, (1988) and Hafiz and Tudor (1990). (In two 'gains from exposure' studies under review (Evans, 1999 and Lai, 1993) comparison groups were mentioned and tested but confusingly were not compared with the experimental groups, which raises questions as to whether the authors understood the design).
'Gains from exposure' designs where the controls were not exposed to the tested vocabulary can tell us how many words were learned from exposure to an ER text. However, it is important to note that these studies cannot tell us anything important about whether ER ‘works better’ than any other treatment for language gains for things such as vocabulary. This is because the same amount of language gains that are found in these studies may have been gained more effectively from another treatment (say, direct vocabulary learning or by working on improving dictionary skills). Thus these studies basically are saying that ‘we gave the subjects something to read and they learned something’ or ‘subjects can learn X amount from reading’ and nothing more. This crucial point seems to have been missed by many researchers because it is very common for these studies to be cited as examples of how effective ER is, when in fact no such conclusion could or should be drawn as no comparisons were made in the studies and by definition, things can only be considered effective when they are compared to something else.
Other general concerns
Several other types of concern are evident in this body of research.
Quite a number of these studies were probably influenced by contaminating factors and some examples have already been mentioned. Sometimes the contamination was faithfully reported (e.g. Elley, experiment 1, 1991; Evans, 1999; Robb and Susser, 1989;Yamazaki, 1999) and in other studies it was unreported (e.g. Horst, Cobb and Meara, 1998; Mason and Krashen, 1997, in press).
Several types of contamination were evident,. Firstly, the subjects did not finish all their reading (Pitts, White and Krashen, 1989), or the same children were used as both the experimental and control group (Elley, experiment 1, 1991). Secondly, contamination was in evidence when the instruction was very similar in both control groups and treatment groups. For example, in Robb and Susser (1989) both the treatment group and the control group received reading strategy instruction and in Lituanas, Jacobs and Renandya (to appear) 45% of the experimental class' instruction was the same as the control group. Thirdly, in Dupuy and Krashen (1993) for example, the subjects were told to expect a test at the end of the reading and viewing, which in their academic settings it is to be expected that students who are told they will be tested would try extra hard to do well and this may have compromised the results above a 'natural' acquisition level. Fourthly, in other studies Hawthorne contamination effects were in evidence. These effects occur when a new element is introduced to the study. For example, in the REAP study in Elley (1991) some of the teachers taught both control and experimental groups, and new materials were introduced.
Another factor that needs to be discussed is pre-treatment ability level and the importance of controlling for ability levels. In some studies the pre-treatment ability levels were controlled or matched with similarly performing pairs in other groups (e.g. Elley and Mangubhai, 1983; Lituanas, Jacobs and Renandya, to appear; and Robb and Susser, 1989) or were randomly assigned to groups (Day, Omura and Hiramatsu, 1991) while in other studies ability levels were not controlled (Dupuy and Krashen, 1993; Lai, 1993) or there was no randomisation of individuals as intact classes were used (e.g. Dupuy and Krashen, 1993; and Mason and Krashen, in press).
The lack of control for ability level can have adverse effects on the experiment because there is a definite advantage to the lower ability learner whom in normal circumstances we can expect to learn more in a given time than advanced learners. From a vocabulary perspective, Nation (1997) has demonstrated that as the beginner meets many more unknown words when reading than an advanced learner, she has more opportunities to pick up new language than an advanced learner who has to read much more to meet the same number of unknown words. Thus in experiments where the pre-treatment ability levels of the subjects is not controlled for, we can expect more gains to be shown for beginners than for intermediate or advanced subjects provided both groups have to read the same amount of graded readers. Similarly, in experiments where beginners and advanced learners read the same text we can expect beginners to have more chances to pick up language than more advanced learners. This implies that controlling the pre-treatment ability level is crucial in getting reliable results.
There are two qualifications to this position. Firstly, if motivation is not there then the weaker students might not make many gains despite the presence of much unknown language. In Lai (1993) one of the three groups who all were given a book a day as Summer reading, had far larger gains (S2 an initially stronger group), than the other two groups on the reading test. Lai suggests that motivation may have played a factor in explaining why the weaker learners did not gain as much as the more advanced learners. Secondly, there is probably a threshold under which learners may not be able to take advantage of being exposed to more unknown language. This was hinted at in several pieces of research (e.g. Laufer-Dvorkin, 1981; Lai, 1993). If there is too much new input and it is not comprehensible, then there are likely to be few gains. Conversely, if the input is lacking in new input there will be few chances to learn and few chances to demonstrate learning. Laufer (1989) and Liu and Nation (1985) have shown that unless there is a 95% or higher coverage of the words in a text the probability of successful guessing of unknown words (learning) will be severely reduced. Nation, (1999) suggests it should be at least 98%. Thus, if the text is too difficult the weaker subjects will not be able to guess (learn) successfully and the advanced ones will be limited by knowing most of the words anyway and thus will meet fewer unknown words and structures. In addition, the beginning level subjects may not be able to learn much because they cannot comprehend the surrounding text well enough to take advantage of all the new language. Therefore, if research is conducted where a mixed ability class all read the same text the subjects’ chance of taking advantage of the same text is limited by their ability. Both these two points imply that more accurate results of the effect of a reading text on learners of a particular ability level will be gained by finding learners with similar pre-treatment abilities and that mixing learners of different abilities may confuse the issue.
In some studies there was excellent reporting and in others there was very little detail. For example, we know very little about the effect of the subjects' background in learning French in the Dupuy and Krashen (1993) study. In other studies the amount of reading that was done was left unreported (e.g. Elley and Mangubhai, 1983; Elley, 1991, experiment 2; and Constantino, 1994), or there is insufficient reporting on how much was read. Not knowing how much was read makes interpretation almost impossible, but a lack of detail can also affect interpretation. A common problem is for the researcher to report how many books were read rather than how many pages or how many words. If both advanced and beginning learners read the same number of books, a beginner would read an easier more illustrated book which is usually shorter than those an advanced learner would read, thus the page count is different for each. Reporting page numbers is a better method than just counting the number of books, but it is more preferable to report the number of words that have been read (but as publishers so not indicate the length of their books, this will be too troublesome for researchers to calculate).
Full reporting is also needed so that studies can be replicated (Waring, 1997). The nature of ER research means that replication will be difficult, however, this is not to say that procedures should not be put in place to ensure that replication can be done. Unfortunately, much of the L2 ER cannot be replicated because the research was specific to a particular group and group specific tests were used and there was insufficient reporting to allow for careful re-construction.
Do findings for children apply to adults?
Many of these studies assessed the effectiveness of ER with children (i.e. those under about 15) learning second languages. This research on children is widely cited as relevant to L2 ER without the qualification that children learn differently from adults and it is not altogether obvious that this research necessarily applies to adults, and vice versa. There are crucial differences that may give us pause when assuming that they are the same. Firstly, children are characterized as learning without much apparent analysis, freely and naturally compared to adults, whereas adults learning second languages are characterized as requiring a lot more effort. Secondly, the testing procedures that have been used in some of this research suggests that it corresponds much more to L1 children forms of assessment than for adults. Thirdly, many of the younger children in some of these studies would not have yet developed many of the necessary cognitive strategies for dealing with longer texts in second languages and may not be as able as adults to benefit from ER. This has been little explored.
Another concerns centre around the applicability of the L1 tests to L2 subjects. Hafiz and Tudor, (1990) and Lituanas, Jacobs and Renandya (to appear) both used assessment instruments that were designed for L1 rather than L2 subjects and their applicability to L2 subjects have not yet been explored.
Longer term to internalise
Laufer-Dvorkin (1981) concluded that the nature of the treatment meant that it was unlikely that there had been sufficient exposure to the target vocabulary to make a difference. Lai (1993) also hinted as this as an explanation why the weaker group did not progress as well as the others. Tsang also suggested that the 'lack of gains ... may be caused by insufficient input' (1996 : 227). This raises the question of what we mean by 'extensive' reading. Susser and Robb (1990) when reviewing various applications of 'extensive' they ranged from a page per day to at least two books a week. If we are to label a piece of research as relevant to ER then we need to have a common understanding for what we mean by 'extensive'. Further work in defining 'extensive reading' and standardisation of this definition is necessary if we are to compare like with like. Nation and Wang (1999) suggest that 'a book a week at the student's ability level' is sufficient for enough vocabulary recycling to take place where learning is possible. This amount of reading seems an adequate benchmark for it to be called 'extensive' reading.
Citing the work of others
It is common practice within research to cite the work of others to defend or add weight or evidence to one's argument. This is also very much in evidence in this literature. Some of the citations have been very clear about the research and have mentioned shortcomings and qualifications where necessary (e.g. Tsang). However, there are also papers which cite the ER research literature as fact with little regard for the problematic nature of much of the research. More worrying are the odd occasions when results are cited that bear little relation to what the research actually said. Indeed on occasions a piece of research is so mis-cited that it is almost unrecognisable from the original. It is hoped that there is a thoroughness and accuracy in the reporting of this literature and in particular for research which is difficult to locate copies.
Finally, it needs to be mentioned that in some of the most widely-cited studies rather odd statistical data are reported. While the source of these errors may be typographical either on the part of the author or the publisher, or a more serious problem with incorrect statistical procedures, it does raise concerns about the thoroughness of the research or the level of care taken in presenting the work.
The review has raised more questions about L2 ER research than it answered. Despite the problems mentioned above it is almost certain that measurable gains for learners reading extensively can be found. However, the extent and type of these gains is unknown for various input conditions and we do not know which conditions ‘make a difference’.
The premise behind some of these studies is to demonstrate that learners can read from ER when in fact, it is rather a moot point as to whether learners can learn from reading. Of course, learners learn new language from input, how else do they learn? Meara (1997) suggests this is like putting seeds in a pot only to confirm that they will grow into flowers. Thus this avenue of research is somewhat of a dead-end once the initial studies have confirmed the truism. The important question is 'how much of what is learned and how well is it learned?'
It is important to make a distinction between studies looking at gains from extensive reading and those looking at incidental learning from reading. When assessing gains from exposure to extensive reading we should expect low gains and several reasons for this have already been presented. Commonly, it is recommended that the students read at a level where 95% to 98% of the words on the page are understood (Nation, 1999) or where there are two or three unknown words on a page (Waring and Takahashi, 2000). It is precisely because ER is done at levels where few new items are introduced little vocabulary will be 'learned'. Thus to make a difference to vocabulary gains massive amounts of reading will be needed to provide enough input to make a difference and many of the shorter studies reported here may not have been able to reflect natural gains from Extensive reading. This is not to say that gains in new language are the only important reason second language learners should read extensively. Other excellent reasons include building reading fluency and reading speed, developing lexical access speed, building reading confidence and motivation and so on. However, if research is looking at incidental gains from reading and for the purposes of the research the researcher does not consider it important that the text is long and graded to the student level, then we can expect results to vary depending on the ability level of the learner, and on the amount of grading and the amount of reading done. Thus when citing this research, it is important to understand that the one type of study can expect different gains from the other and to cite them together without mentioning the difference may be misleading.
Horst, Cobb and Meara looking from an 'incidental' learning perspective, suggest that "one way of improving the methodology of this kind of study would be to test much larger numbers of potentially learnable words in order to ensure that the subjects have ample opportunity to demonstrate incidental gains" (1998 : 219). The assumption is that we need to understand incidental vocabulary learning from ER, so if we create conditions where more of it occurs, then we will be better able to understand both process and the product. This could be done either by testing more words that are likely to be tipped over the 'success threshold' (i.e. those words which the students are expected to partially know already), or by having words repeated many times in the text, thus raising the chance of success. Several researchers have already tried to test only words that the learners are likely to know or modified the input to make the words more available to the subjects and this also needs to be considered when interpreting 'gains'. (e.g. Day, Omura and Hiramatsu, 1991).
This reviewer is not convinced by this avenue of research for ER because if researchers only test items which are likely to be learned, then greater gains will be shown than those that would have occurred naturally from exposure and these results will only cloud rather than clear the picture for natural gains from exposure. This means one needs to be cautious when comparing studies and looking at language gains from this kind of research design as this design may greatly over-estimate natural gains (dependent on how much manipulation occurred). It is therefore important in experiments that seek to ask how much is naturally learned from exposure to ER that the words selected for assessing gains be randomly and naturally selected, and that the tests be published with the research.
This has other implications of the comparison of ER studies that look at vocabulary gains. Some studies (as part of the research design) have looked at the effect of frequency on the acquisition of vocabulary from reading and have controlled the frequency of the test words in order to ascertain what the effect of repetition is (e.g. Horst, Cobb and Meara, 1998; and Yamazaki, 1999). The assumption is that the more repetitions there are in a text the more likely it is that there are more gains (there is some limited evidence for this position from these two studies. In Horst, Cobb and Meara, 1998, gains are reported for words met 8 times or more). Thus caution must be exercised when comparing studies which have carefully controlled frequency (larger gains can be expected) with studies that have not purposely controlled input frequency (smaller gains can be expected). Extra special care should be made when citing the 'gains from exposure' research to identify which studies modified the input to increase the potential for gains and which did not. Care should also be taken when citing research with small populations or case studies (e.g. Cho and Krashen, 1994, Grabe and Stoller, 1997) because these gains made in these studies are much more likely to reflect what these individuals did rather than provide us with a picture of the larger population which they represent.
In the review numerous problems have been found with this research and some of these are quite serious (e.g. contamination, poor tests and test method, and poor research design). Of the 25 studies that investigated 'gains from exposure', 'gains from writing' or compared ER with other treatments, a full 100% were contaminated either by the presence of outside tuition or exposure, or the controls were not exposed to the tested vocabulary or the ER group had longer exposure to English. Some of these studies suffered from all these forms of contamination. This lack of experimental control, mostly as a result of the use of convenience populations, means that while circumstantial evidence supporting ER abounds, the presence of contamination factors undermines the research as it cannot provide unequivocal evidence of the effectiveness of ER. This is hardly a strong research foundation upon which the house of ER rests. However, as was mentioned before, it is extremely difficult to find or create experimental conditions when the nature of ER means that we can only measure the effectiveness of it over time, but stringent efforts must be made to find and create these conditions.
This reviewer thus concludes in the same vein as Coady, (1997), Horst, Cobb and Meara (1998) and Nation, (1999) that the L2 ER research body is a severely troubled one from an experimental point of view as it is hard to find problem-free studies. These troubles are much more in evidence in the 'gains' and 'comparison' research than with research on 'ER and affect'. While not all the research is of equal concern, there are several oft-cited studies that may not be able to live up the claims made upon them as relevant to L2 pedagogy.
This review therefore suggests that we should treat the findings for the effectiveness of ER from the 'gains' and 'comparison' ER research with more than considerable pause. It also suggests that we should be extremely cautious in proposing that there is ‘strong evidence for the value of ER’ (Lituanas, Jacobs and Renandya, to appear:1), and that we should take Krashen's very strong claim about the effect of reading with a ton of salt. The research certainly does not give us enough evidence to support his position because much of the evidence we have comes from troubled research. Thus we are a very long way away from being able to answer Alderson's questions.
This review has not turned me into a disbeliever. I believe very strongly that ER has an important place (not the only place) in second language learning. I sincerely hope that a relatively trouble-free research base will emerge in the future that pays heed to some of the problems that have been found here which can relieve me of my nagging doubts about the present quality of much L2 ER research. I also hope it will allow us to develop a reliable base upon which those of us who care about ER can rest our case. Until then, I will finish by saying that
ER is good for second language learners (especially for affect). The research does not yet support a stronger conclusion, however. Reading is probably one way, and only one way we become good readers, it seems that through ER we can develop a good writing style, an adequate vocabulary, advanced grammar, and it may help us to become good spellers..... but we still do not have the evidence to be sure.
Alderson, C. Assessing Reading. Cambridge: Cambridge University Press. 2000.
Baddeley, A. Human Memory. Theory and Practice. Hove: Psychology Press. 1997.
Cho, K. and S. Krashen. Acquisition of vocabulary from the Sweet Valley Kids series: Adult ESL Acquisition. Journal of Reading, 37: 662-667. 1994.
Choppin, B. Correction for guessing. In Keeves, J. (Ed.) Educational Research, Methodology and Measurement. Oxford : Pergamon Press. 1988.
Coady, J. Extensive reading. In Coady, J. and T. Huckin. Second language Vocabulary Acquisition: A rationale for Pedagogy. Cambridge: Cambridge University Press. 1997.
Constantino, R. Pleasure Reading Helps, Even If Readers Don’t Believe It. Journal of Reading; 37 (6): 504-05. 1994 .
Day, R., and J. Bamford. Extensive reading in the second language classroom. Cambridge: Cambridge University Press. 1998.
Day, R., C. Omura and M. Hiramatsu. Incidental EFL vocabulary learning and reading. Reading in a Foreign Language. 7 (2): 541-551. 1991.
Dupuy, B. Voices from the classroom: Students favor extensive reading over grammar instruction and practice, and give their reasons. Applied Language Learning, 8 (2): 253-261. 1997.
Dupuy, B. and S. Krashen. Incidental vocabulary acquisition in French as a foreign language. Applied Language Learning, 4 (1): 55-64. 1993.
Elley, W., and F. Mangubhai. The impact of reading on second language learning. Reading Research Quarterly, 19: 53-67. 1983.
Elley, W. Acquiring Literacy in a Second Language: The Effect of Book-Based Programs. Language Learning, 41(3): 375-411. 1991.
Ellis, R. Modified Oral Input and the Acquisition of Word Meanings. Applied Linguistics, 16 (4): 409-41. 1995.
Evans, S. Extensive Reading: A preliminary investigation in a Japanese Senior High School. MA Thesis: Columbia University (Tokyo). 1999.
Grabe, W. and F. Stoller. Reading and vocabulary development in a second language: a case study. In Coady, J. and T. Huckin. Second language Vocabulary Acquisition: A rationale for Pedagogy. Cambridge: Cambridge University Press, 98-122. 1997.
Hafiz, F. and I. Tudor. Extensive reading and the development of language skills. English Language Teaching Journal 43 (1): 4-11. 1989.
Hayashi, K. Reading strategies and extensive reading in EFL classes. RELC Journal, 30 (2): 114-132. 1999.
Hedgcock, J. and D. Atkinson. Differing Reading Writing Relationships in L1 and L2 Literacy Development? TESOL Quarterly, 27 (2): 329-33.1993 .
Horst, M., T. Cobb and P. Meara. Beyond a Clockwork Orange: Acquiring second language vocabulary through reading. Reading in a Foreign Language. 11 (2): 207-223. 1998.
Janopoulos, M. The relationship of pleasure reading and second language writing proficiency. TESOL Quarterly, 20 (4): 763-768. 1986.
Joe, A., P. Nation and J. Newton. Sensitive Vocabulary Tests. Draft paper. Victoria University of Wellington, New Zealand. 1996.
Krashen, S. The power of reading. Insights from the research. Englewood, Co.: Libraries Unlimited. 1993.
Lai, F. The Effect of a Summer Reading Course on Reading and Writing Skills. System, 21 (1): 87-100. 1993.
Laufer, B. What percentage of text-lexis is essential for comprehension? In: C. Lauren and M. Nordmann (Eds.). Special language: from humans thinking to thinking machines. Clevedon: Multilingual Matters. 1989.
Laufer, B. What’s in a word that makes it hard or easy: some intralexical factors that affect the learning of words. In Schmitt, N. and M. McCarthy (Eds.): Vocabulary: Description, Acquisition and Pedagogy: Cambridge, Cambridge University Press. 140-155. 1997.
Laufer-Dvorkin, B. "Intensive" versus "Extensive" Reading for Improving University Students' Comprehension in English as a Foreign Language. Journal of Reading, 25 (1): 40-43. 1981.
Lituanas, P., G. Jacobs and W. Renandya. A study of Extensive reading with remedial students. http://www.geocities.com/Athens/Thebes/1650/Philippines.html. To appear.
Mason, B. and S. Krashen. Extensive Reading in English as a foreign language. System, 25 (1): 91-102. 1997.
Mason, B. and S. Krashen. Can we increase the Power of Reading by adding more output and/or more correction? Texas Papers in Foreign Language Education. In press.
McQuillan, J. Reading versus grammar: What students think is pleasurable and beneficial for language acquisition. Applied Language Learning, 5 (2): 95-100. 1994.
Meara, P. Towards a new approach to modelling vocabulary acquisition. In Schmitt, N. and M. McCarthy (Eds.): Vocabulary: Description, Acquisition and Pedagogy: Cambridge, Cambridge University Press. 109-121. 1997.
Nagy, W., P. Herman. and R. Anderson. Learning words from context. Reading Research Quarterly, 20: 233-253. 1985.
Nation, P. and M. Wang. Graded Readers and Vocabulary. Reading in a Foreign Language, 12 (2): 355-380. 1999.
Nation, P. Teaching and Learning Vocabulary. Boston, Ma.: Heinle and Heinle. 1990.
Nation, P. Using dictionaries to estimate vocabulary size: essential, but rarely followed procedures. Language Testing, 10 (1): 27-40. 1993.
Nation, P. Learning Vocabulary in Another Language. English Language Institute Occasional Publication 19. Victoria University of Wellington, New Zealand.1999.
Nation, P. The language learning benefits of extensive reading. The Language Teacher, 21(5): 13-16. 1997.
Pilgreen, J. & S. Krashen. Sustained silent reading with English as a second language high school students: impact on reading comprehension, reading frequency, and reading enjoyment. School Library Media Quarterly, 22: 21-23. 1993.
Pitts, M., H. White, and S. Krashen. Acquiring second language vocabulary through reading: a replication of the Clockwork Orange study using second language acquirers. Reading in a Foreign Language. 5 (2): 271-275. 1989.
Polak, J., and S. Krashen. Do we need to teach spelling? The relationship between spelling and vocabulary reading among community college ESL students. TESOL Quarterly, 22: 141-146. 1988.
Renandya, W., B. Rajan, and G. Jacobs. Extensive reading with adult learners of English as a second language. RELC Journal. 30 (1): 39-61. 1999.
Robb, T. N. and B. Susser. Extensive reading vs Skills Building in and EFL Context. Reading in a Foreign Language. 5(2): 239-51. 1989.
Saragi, T., P. Nation and G. Meister. Vocabulary Learning and Reading. System; 6 (2): 72-8. 1978 .
Smith, J. A quick measure of achievement motivation. British Journal of Social and Clinical Psychology; 12(2): 137-143. 1973.
Susser, B. and T. Robb. EFL extensive reading instruction: research and procedure. JALT Journal, 12 (2): 161-185. 1990.
Tsang, W. Comparing the effects of reading and writing on writing performance. Applied Linguistics, 17: 210-233. 1996.
Tudor, I. and F. Hafiz. Extensive reading as a means of input to L2 learning. Journal of Research in Reading. 12 (2): 164-178. 1989.
Tudor, I. and F. Hafiz. Graded readers as an input medium in L2 learning. System, 18 (1): 31-42. 1990.
Waring. R. The Negative Effects of Learning Words in Semantic Sets: a Replication. System , 25 (2): 261-74. 1997.
Waring, R. Guest editor. “Special edition on Extensive Reading”. The Language Teacher, 21 (5). 1997.
Waring, R. Tasks for Assessing Receptive and Productive Second Language Vocabulary. Ph.D. Thesis. University of Wales.
Waring, R. and S. Takahashi. The Oxford University Press Guide to the ‘Why and ‘How’ of Using Graded Readers. Tokyo: Oxford University Press. 2000.
Weltens, B. and M. Grendel. Attrition of vocabulary knowledge. In: R. Schreuder and B. Weltens (Eds.) The Bilingual Lexicon. Amsterdam: Benjamins. 1993.
Yamazaki, A. Vocabulary Acquisition through Extensive Reading. Unpublished Dissertation, Temple University. 1996.
 In this paper ER is equated with 'Pleasure reading' , 'Sustained Silent Reading' and other forms of reading where the texts are considered to be 'Extensive Reading' tests (See Day and Bamford, 1997 p. 6-8 for a discussion of approaches to reading that can be considered 'Extensive Reading').
 For the purposes of this survey children are defined as those under high school age or about 15 to 16.
 For example, some items on Dupuy and Krashen's (1993) test seems to have been fairly easy to guess intelligently, also the test did not contain only the supposed 'colloquial' test items and there are also 3 spelling mistakes in the test items. This reviewer managed a score of 14 out of 30 with only a minimal amount of schoolboy French (as there are only 3 choices wild guessing will get a score of 10). Another example of poor quality control is found in Cho and Krashen (1994) where one subject was tested on 161 words met by the other students which makes interpretation of what she had gained from her reading troublesome as she had not met the words she was tested on.
 Here are two 'standard correction' equations
S = ____c____ E or S = R - ____E____
(c -1) (c -1)
S = the corrected score
E = the number of incorrect items
c = the number of choices
R = raw score
Using these two equations our hypothetical learner who knew 20 items would be awarded a corrected score of 21.33 for equation 1, and 18.67 for equation 2. Neither equation is perfect as they did not predict the 20 items our learner knew. These standard methods of correction have been criticised ever since they were introduced in the 1920's because they can lead to negative scores and they ignore that a subject may have eliminated one or more choice (see Choppin, 1988 for a fuller discussion and other more complex equations).
 Ellis (1995 : 424) for example, cites the Dupay (sic) and Krashen (1993) study as testing 42 (15) L2 learners of French learning from ‘Trois Hommes et un coffin’ (sic) (couffin) after 80 (40) minutes of exposure to reading.
 In Dupuy and Krashen (1993) for example, a t-test was used to compare 15 experimental subjects with 2 control groups of 9 and 13 (i.e. a comparison of 15+9 =24 and 15+13=28 subjects in both analyses). The degrees of freedom were reported as 14 and 14 (the dfs for a matched t-test) when the standard way of calculating degrees of freedom in a normal t-test involving two independent groups is n-2 thus the dfs should be 22 and 26. If inappropriate procedures were applied to the data, this may have compromised the findings and the claims based upon these findings. Similar confusing data are found in all three of the Mason and Krashen (1997) experiments and elsewhere in the L2 ER literature.