Extensive Reading in Second Languages: A Critique of the Research

Rob Waring waring_robert @ yahoo.com

Notre Dame Seishin University                                                                    

October 2000. Draft 1.

For comment only. Do not cite.



This paper surveys 28 pieces of research into Extensive Reading in second languages to ascertain whether at the start of the 21st century we have good evidence upon which to justify the claims that Extensive Reading 'works'. The findings of this survey reveals that there is little solid evidence either for or against the benefits of ER for increasing language knowledge over other forms of input and practice. Most evidence is either circumstantial or comes from poorly designed or executed research, or was contaminated by outside factors.



Over the last 15 years a considerable amount of experimental research has been published that deals with some aspect of second language Extensive Reading (ER)1. There have been studies that ask whether subjects can learn from ER (including many incidental learning from reading experiments), other studies that compare ER approaches with other treatments (such as with 'normal' approaches or 'translation' approaches), and yet others that have looked at the effect of ER on other aspects of language learning (such as on writing, confidence and motivation and so on).

Research in ER has been undertaken to demonstrate that language gains of many types occur from exposure to simplified second language texts. Research by Elley (1991), Hafiz and Tudor (1990), Krashen and Cho (1994), Lai (1993), Lituanas, Jacobs and Renandya (1999) and Renandya, Rajan and Jacobs (1999) among others, report linguistic gains as a result of ER. Writing ability is said to improve as a result of extensive reading (Elley and Mangubhai, 1981; Hafiz and Tudor, 1990; Hedgcock and Atkinson, 1998; Janopolous, 1986; Mason and Krashen, in press; Robb and Susser, 1989; Tsang, 1996) as is spelling (Polak and Krashen, 1988). Reading extensively has also been reported to increase motivation to read and the development of a positive attitude to reading in the second language (Cho and Krashen, 1994, 1995; Constantino, 1994; Evans, 1999; Hayashi, 1999; Mason and Krashen, 1997). Oral proficiency was (anecdotally) said to have improved after reading large amounts of text (Cho and Krashen, 1994). There are a considerable number of vocabulary studies that report gains in vocabulary from ER (Day, Omura and Hiramatsu, 1991; Dupuy and Krashen, 1993; Ferris, 1988; Grabe and Stoller, 1997; Hayashi, 1999; Mason and Krashen, 1997, Mason and Krashen, in press; Pitts, White and Krashen, 1989 and Yamazaki, 1996 are just a few examples.).

Almost all of this research has been done by researchers who wish to show ER in a good light and there is considerable cross-citation within this literature which is used as evidence to support the claims made in the research. However, rarely does one find in these citations any critique of this literature and most often it is accepted as fact and cited without comment. The ER research literature (as a body of research) has been severely criticised by many researchers. For example, Coady (1997 : 226) referring to some oft-cited ER research says that "there appears to be a serious methodological problem with these studies". Nation (1999:124) says that "second language studies ..... generally lacked careful control of the research design". Horst, Cobb and Meara (1998) also point out that some of the incidental learning from exposure experiments that are often cited as supporting an ER approach are 'methodologically flawed' (1998 : 210). Unfortunately, a detailed examination of these 'flaws' is not made apparent in the papers and researchers need to be made aware of what problems exist.

As the popularity of ER, both as an approach to learning and as a research topic, has boomed in recent years, it seems timely and pertinent to look carefully and in detail at a broad range of ER studies to determine what we do know about ER, what problems these researchers have had while undertaking this research, and how that can inform future research so that we can learn from past mistakes. Unless we have a solid foundation of research upon which to form our ideas about what ER can do effectively and what it cannot, it will be that much more difficult to promote the need for ER within foreign language learning contexts. Thus, my concern is with the quality, reliability and accountability of much, but not all of, the research surveyed below (and the claims emanating from them) as a canonical base upon which the house of ER rests.

It should be mentioned at the outset that I am not at all against ER. Indeed, I have been actively promoting the development of ER for a number of years now (e.g. Waring , 1997: Waring and Takahashi, 2000). However, my own reading of the ER literature has raised nagging doubts about the quality of the research base upon which claims have made for ER. My aim in looking at this research carefully and in detail, is to assess what has been found and to ascertain what seems reliable and what does not. This survey also seeks to identify what we do not know about ER but need to know, and I shall therefore conclude this survey with a research agenda that might seek to inform future research into the nature of ER and its role in second language learning. This paper is not primarily meant as a review of findings, (although many of these will be mentioned in passing and the Appendix summarises many of them), but the central focus is with issues concerning research design, assessment methodology and other issues related to researching ER.

The survey

Stephen Krashen is probably the most famous proponent of the need for reading in second languages and especially Sustained Silent Reading (SSI), Pleasure Reading (PR) and Extensive Reading (ER). Krashen has stated that:

"Reading is good for you. The research supports a stronger conclusion, however. Reading is the only way, the only way we become good readers, develop a good writing style, an adequate vocabulary, advanced grammar, and the only way we become good spellers". (1993:23)

While Krashen was mainly talking about first languages, his work with second languages suggests that he means this would hold for second languages too.

The Appendix provides an overview of 28 pieces of research (sometimes there are three experiments in one paper) into various aspects of ER in second languages. For each piece of research the method and findings have been summarised and notes about the research have been made. The notes seek to identify areas of concern or raise questions left unanswered by the research. This survey investigates a wide range of experiments. Not all these experiments directly assess the effectiveness of ER, nor do all of them claim to be ER based, but they have been included because they have been used by researchers as providing support for claims that ER is beneficial to second language learners. If we wish to know something about ER, and the claims made about ER, we need to look at them and see them as within the body of second language ER research precisely because they are cited as evidence for ER. It might be worthwhile for the reader to consult this appendix before reading on.


A critique of the L2 ER studies


It needs to be stated plainly that we know very little about reading and about the assessment of reading. Alderson (2000) makes the point that in order for us to be able to say something about reading we have to know what reading is. He goes on to say that


in order to devise a test or assessment procedure for reading, we must surely appeal, if only intuitively, to some concept of what it means to read texts and understand them. How can we possibly test whether somebody has understood a text if we do not know what we mean by 'understand'? How can we possibly diagnose somebody's 'reading problems' if we have no idea what might constitute a problem and what the possible 'causes' might be? How can we possibly decide what 'level' a reader is 'at' if we have no idea what 'levels of reading' might exist, and what it means to be 'reading at a particular level'? In short those who need to test reading clearly need to develop some idea of what reading is, and yet that is an enormous task (pp. 6-7).

This should not stop us trying to find out about reading in second languages, but research into second language extensive reading will always be fraught with problems. Firstly, the volume of reading that subjects have to read in order that the research can be labeled 'Extensive Reading' means that it will take a lot of time. This necessitates that the reading be done over many sessions and often out of class and in a non-controlled environments, which naturally brings up cries of contamination due to outside influence during the period of the experiment. Secondly, research in ER has to be done in real classrooms under real conditions which means that conditions for investigating the nature of ER will be less than ideal and it will not always be possible to control all variables. It is just not practical in most circumstances to cleanly control all variables other than the ones we are looking at when researching ER, so we must learn to be happy with doing our best.

However, the research survey presented in the Appendix does not always show that the researchers did their best to control variables. Indeed, quite a number of studies are deeply flawed either methodologically or in execution. In this section I shall review some of the areas of concern that have become apparent in this research.


The range of questions that have been investigated

Research into language gains and gains in affect (e.g. confidence and motivation) from ER in second languages is still in its infancy, is quite fragmented, and is rather difficult to interpret when looking for concrete 'evidence'. However, a number of points are clear.

Firstly, this research has mostly been conducted with the learning of English and on Asian and Oceanian learners. In particular, there is very little widely known research into the second language learning of Mandarin, French, Spanish, Arabic and other major world languages. Only two of the studies surveyed here did not look at the learning of English (Dupuy and Krashen, 1993 looked at the learning of French and Pitts, White and Krashen, 1989 investigated the learning of pseudo-words). Secondly, quite a number of these studies seem to have used convenience populations (i.e. those available in the researcher's own classes) and / or are conducted on highly educated individuals (those at college) rather than a more 'normal' profile of the population in general. Thirdly, there is also a tendency to use population of individuals who are already proficient at learning second languages (e.g. English majors) and a tendency to use the upper streamed students rather than lower-streamed ones. (See Evans, 1999 for one example). Fourthly, there is also a very narrow ability range of subjects who have been investigated most of whom might be considered 'intermediate' level. Fifthly, most research has been conducted with adults rather than children and in foreign language environments rather than second language environments. This narrow range is rather troubling. We cannot hope to have much to say about ER in general until we have extensive amounts of research into language learning from ER in a whole range of languages, ages, educational backgrounds and so on.

When we look at the methodology of the research that has been done quite a different picture appears and almost the full range of experimental variables have been explored. There are experiments that have been conducted on individuals as cases studies (Cho and Krashen, 1994, Grabe and Stoller, 1997) and on very large populations (e.g. Elley and Mangubhai, 1983; Lai, 1993). There are experiments that have investigated learning from short texts (e.g. only 1032 words in Day, Omura and Hiramatsu, 1991) to very large amounts of text (e.g. a graded reader a day in Lai, 1993; 1500 pages over several months in Mason and Krashen, in press; 18 graded readers in 9 weeks in Yamazaki, 1996; and 161,000 words by the Korean student Jin-hee in Cho and Krashen, 1994). There are experiments that have lasted up to two years (e.g. Elley's 'book flood' experiments, 1991) and some that have been over in minutes (e.g. Day, Omura and Hiramatsu, 1991). Some research into ER has investigated language development in children2 and others in adults. Some studies were with mono-lingual groups while others were with subjects of varied backgrounds.

There is also a wide range of testing instruments that have been used. Some studies (e.g. Laufer-Dvorkin, 1981) used a battery of in-house general proficiency tests while others used standardised commercially available tests (Evans, 1999 used KET; Hafiz and Tudor, 1989 used the NFER tests; Hayashi, 1999 used TOEFL; and Lituanas, Jacobs and Renandya, 1999, used the Informal Reading Inventory and the Gray Standardised Oral Reading Test). Sometimes essays are written pre- and post- and assessed for gains in writing ability (e.g. Hafiz and Tudor, 1990; Mason and Krashen, in press) and sometimes in-house research-specific tests have been used (e.g. Day, Omura and Hiramatsu, 1991, Pitts, White and Krashen, 1989).

From the survey of 28 pieces of experimental research mentioned in the Appendix several types of study are apparent. There are studies which compare ER with other treatments, and others which seek to show how ER benefits other language skills (e.g. the effect of ER on writing or on vocabulary building). Others only wish to determine whether ER can lead to gains in language development from exposure to text. These categories are by no means clearly defined and some studies can fit the profile of two or more. Each of these areas will be surveyed.


Studies comparing ER to other treatments

The focus of these studies is to compare ER to other treatments or approaches. This type of study (with the 'gains from exposure' literature) makes up the majority of studies in L2 ER research. There are two sub-groups. There are studies comparing ER with another treatment (such as a 'normal' class), and those that compare ER under different conditions (such as ER research with ER reading and book reports in written in the L2, compared to ER with book reports written in English that were corrected and book reports written in English that were not corrected). There are several concerns with much of this research.

Firstly, in several studies (e.g. Evans, 1999; Mason and Krashen, all three experiments, 1997; Robb and Susser, 1989; and Yamazaki, 1996) extra time for contact with English was given to the experimental (ER) group. For example, in Robb and Susser (1989) the experimental group had to read 500 pages out of class during the year whilst the control group only had a short extra assigned reading per week. In Evans (1999) the ER group had extra reading while the controls did not. This means that with this design we will not be able to see the comparative benefit of ER over other methods as more exposure in one group will bias the results to that group, thus we should be cautious in interpreting the effectiveness of this research over other methods.

Secondly, the data for a considerable number of these studies were probably affected by outside influences (this also applies to the 'gains from exposure' literature) where the tuition variable was not controlled, some of this contamination was reported in the studies and some was not (see below). The most common factor influencing the study was the presence of concurrent classes or tuition that were not part of the study (Evans, 1999; Hayashi, 1999; Lai, 1993; all three experiments in Mason and Krashen, 1997; Mason and Krashen, in press; Renandya, Rajan and Jacobs, 1999; Robb and Susser, 1989; Tsang, 1996; and Yamazaki, 1996). In one study (Hafiz and Tudor. 1989), which is probably the most cited ER study, the data were collected in the UK despite the subjects living in a Punjabi community. The effect of outside exposure in the community at large and from their other classes at school was hardly mentioned as influential in the study. This makes it extremely unlikely that gains were directly affected by ER and makes it difficult to determine how much of the gains were due to only ER or to the other tuition. As was mentioned earlier, the nature of assessing ER is that it will take time and practicalities demand that it be done with real classes. It is therefore vital to try to minimise the effect of the external influence, and to report as fully as possible how the external influence may have affected the results so that correct interpretation is possible.

Thirdly, ER is typically compared with instructional approaches which do not have the benefit of the 'rich' environment of the ER approach (Coady, 1997). Comparisons are made with 'audiolingual approaches' (Elley, 1991), or 'translation' (Yamazaki, 1996), or 'regular classes' (Mason and Krashen, experiment 2), or classes which were 'taught in the conventional way' (Lituanas, Jacobs and Renandya, to appear). The question of how ER is comparable to other rich environments has yet to be resolved.


'Gains in writing' experiments

This research asks whether writing ability can be affected by ER (Elley and Mangubhai, 1983; Hafiz and Tudor, 1990; Hedgcock and Atkinson, 1998; Janopoulos, 1986; Mason and Krashen, in press; Robb and Susser, 1989; Tsang, 1996 are but a few). A typical design is as follows. Students are given an essay test, they read something and they are given another essay test (most often the same title, but not always). Then the essays are scored on a variety of measures to check for differences pre- and post- ER. Some studies (e.g. Mason and Krashen, in press), used only statistical data such as the number of words used, the number of clauses, the number of error-free clauses and so on. Other studies had an holistic evaluation (e.g. Mason and Krashen, 1997 experiment 2 ) and yet others had an evaluation of factors such as coherence, cohesion, organization, logical progression, impression and so on (Tsang, 1996). It is important to clearly note when citing this research that different procedures were used in the 'effects of ER on writing' research because the analyses are looking at different things. The advantage of statistical data are that they are statistical and can be easily analysed using a computer, but the disadvantage is that they do not indicate levels of the quality of writing. Thus in these types of analysis it may be best to combine all of these factors in the analysis as Tsang (1996) did.


'Gains in affect' experiments

This research looked at whether an ER approach has a positive effect on motivation, confidence and general perception of the usefulness of ER. The term 'pleasure' that is attached to this type of reading research is used in two ways. The first investigates reading that is not done as part of school work as it is done by free-will. The second meaning occurs in research that asks about the reader's subjective reaction to ER.

The positive effect of ER on motivation and attitude to reading is very commonly reported and probably the strongest finding in all the papers reviewed here (e.g. Constantino, 1994; Evans, 1999; Elley, 19991; Mason and Krashen, 1997, in press; Hayashi, 1999, Yamazaki, 1996). Some of these data come from formal post-reading interviews but much of this evidence is anecdotal. While there are measures of motivation (e.g. Smith, 1973) and ways to reading confidence, none have these as yet have been used to provide quantitative data.

Quite a number of studies have asked what readers feel about their reading and whether it was 'pleasurable'. McQuillan (1994) and Dupuy (1997) found that ER is preferred to grammar instruction and practice and to assigned readings. However, it should be noted that the preferences for other types of 'pleasurable' language instruction such as listening to music, watching videos, free conversation, surfing the Internet and so on were not asked which leaves open the question of a preference for ER over these other 'pleasurable' language pursuits.


'Gains from exposure' experiments

Experiments that have assessed gains from exposure to ER texts (most often they are called 'incidental learning' experiments) seek to demonstrate how much (usually vocabulary) has been learned. (There have been no studies that I know of that have directly researched the acquisition of grammar or syntax from being exposed to ER, although some of the 'gains in writing' experiments have been suggested as evidence for this.) This survey has found numerous problems with the 'gains from exposure' experiments and a few points will be made below.

Lack of quality control in test construction

It is very common in this research for the vocabulary or cloze tests to be written by the authors (e.g. Day, Omura and Hiramatsu, 1991; Pitts, White and Krashen, 1989; Mason and Krashen 1997, Mason and Krashen, in press; Yamazaki, 1999). Some of these in-house tests were subjected to extensive piloting and review (e.g. Yamazaki, 1999) while most were not (or at least were not reported to have been piloted and trialed, nor assessed for their quality). Some tests appear to be either of poor quality or insufficient care seems to have been taken in their construction3. The apparent lack of quality control (and even a lack of a mention of quality control procedures) in some of these tests is a matter of grave concern as it is upon the quality of these tests that the data were gathered. In addition, only two of the 12 experiments that used their own test instrument published the test with the report.

Problems with the most commonly used test format for assessing gains from ER

The most common vocabulary test used for ER 'gains from exposure' research is the multiple-choice test (e.g. Day, Omura and Hiramatsu, 1991; Dupuy and Krashen, 1993; Pitts, White and Krashen, 1989). There are numerous reasons why this test may not be the most appropriate for assessing gains from exposure to ER texts. Firstly, the multiple choice test is very limited in its ability to assess gains from reading as it ignores many of the other potential gains or benefits from the reading of an extended text. This test is attempting to assess prompted recognition but other potential linguistic benefits that are largely ignored by multiple-choice tests include lexical access speed gains; the noticing of collocations, colligations or patterns within text; the learning of new word forms and the meaning of new words; the recognition of new word forms yet to be learned; an increase in the ability to guess from context; a (dis)confirmation that a previously guessed word's meaning is probably correct; recognition of new word associations; the raising of the ability to recognize discourse and text structure; an increase in the ability to 'chunk' text; the development of saccadic eye movements and so on and so on. Thus many 'gains' from ER are ignored by the multiple-choice test and many potential benefits of ER are underestimated.

Secondly, in addition to the inability of the multiple-choice test to capture many aspects of reading, the design of the test compounds the problem because of the nature of the test's criteria for successful completion. The multiple choice tests are designed to assess receptive understanding and are either correctly answered or not and as such have the problem of both ignoring and underestimating language gains at the same time. First, we need a little background. It is widely stated that all words are not equal as some are more frequent than others and some are 'easier' to learn (e.g. Laufer, 1997). There is also general agreement that most words are not learned in one meeting, but need many meetings for the sight sound-correspondences to be made and for the receptive understanding of the word to take place (Nation, 1990, 1999). Research seems to indicate that it takes an average of about 10-20 meetings of a word before a word is known receptively with each meeting adding to the knowledge about a word until a certain threshold of knowledge is gathered that allows successful understanding, or successful completion of a test item (Saragi, Nation and Meister, 1978; Nation, 1990, 1999). The threshold for success on multiple -choice vocabulary tests is little understood, but these tests (and other tests that have the right/wrong criteria for success) are severely limited in their ability to only reflect the knowledge of the words that have met the 'success threshold' as a result of the reading. For example, if a learner has met the word abominable two times before the reading and meets it once more during the reading, then although the learner has gained a little piece of knowledge about the word (such as a greater awareness of its general meaning or its spelling) he would not have enough knowledge to tip it over the 10 to 20 meetings threshold into success. Thus, the learner's gain from reading abominable once is ignored by the strict criteria for success of the multiple choice test and he gets zero on the test. Conversely, if a learner knew enough about abominable to meet the criteria for success before reading the text, and by reading the text her knowledge of abominable increased, this increase in knowledge also cannot be measured by the multiple-choice test and thus it will underestimate her gains.

Thirdly, this threshold is not a uniform one for all multiple choice tests. A test with distractors with similar meanings (anger, irritate, annoy, and frustrate) would be more difficult than one in which the distractors are dissimilar (boat, tree, cat, and hospital). It is likely that a learner will have more troubles in determining the correct answer from a set of similar words as more knowledge is required to separate them. This means that results from different tests and with multiple-choice tests that have distractors with different words will vary considerably. Thus interpretation of the results of an experiment can only be done properly when the test is published with the research. In addition, there is no common agreement on the number of choices to be used in multiple- choice tests in ER research. Dupuy and Krashen (1993) used 3, Pitts, White and Krashen (1989) and Day, Omura and Hiramatsu (1991) used 4. Fortunately, all three used a 'don't know' option to reduce guessing.

It is therefore clear that the full nature of vocabulary learning from the reading is not captured by the use of a multiple choice test and more sensitive measures (Joe, Nation and Newton, 1996) than multiple choice tests are necessary to capture the full nature of learning from exposure (see Waring, 1999 for a fuller discussion of these matters).

Lack of control for guessing

Some studies that used multiple-choice tests did not correct for guessing (e.g. Dupuy and Krashen, 1993) while others did (e.g. Day, Omura and Hiramatsu, 1991; and Pitts, White and Krashen, 1989). The guessing factor is important because raw scores will only inflate true knowledge and leave misleading data. If a 40 item test has 4 choices (three distractors and a correct item) and the test taker knows none of the words then wild guessing will mean a score of 40/4=10. If the test taker knows 20 items and guessed at a further 16 then her uncorrected score is likely to be 20 + (16 items /4 choices) = 24. Although the guessing factor reduces with ability level as there are fewer items to guess at, it is a major factor for lower ability learners or for learners who have a tendency to guess 4. Thus it is crucial that guessing be controlled for in multiple choice tests and correcting scores for guessing is most often better than not adjusting the scores at all.

Sample sizes

These matters become especially relevant when the tests contain very few items. For example, Day, Omura and Hiramatsu (1991) tested only 17 items, Dupuy and Krashen (1993) tested 30. At the other end of the spectrum we have Cho and Krashen (1994) who assessed each of their case studies on several hundred words that they underlined from their reading. Nation (1993) points out that the sample size is a crucial factor in determining if the test is reliable. If the sample size is too small there is a high chance of statistical error. He says that statisticians have determined the confidence interval within which an observed score should be seen. He points out that "if a learner's observed score on the test was 50 out of 100 (50%) we could be 90% sure that the true value of his or her score lay between 42 (42%) and 58 (58%) out of 100 (i.e. a range of plus or minus 8)" (p. 35-36). In other words a 50% score on a test means that we can only be 90% sure that the subjects true score is between 42 and 58, and not that it is exactly 50%. Nation points out that if a test of 100 items has a 16 % confidence window (42 to 58) then a test with a much smaller sample size will have a much greater confidence window, which makes the test less reliable. A test with only 17 items would most probably be quite unreliable from this point of view.

There are other equally serious factors that are impacted by item sample size. It is a common finding in the L2 ER experiments surveyed here that the gains from the learning from reading are low. Horst, Cobb and Meara (1998) report an average of 10 to 20% gains on short experiments, and much lower figures for longer texts (but no retention data are given) (See below for other reasons why even these low estimates may be overestimated by the careful selection of tested words). One possible reason for this apparently low intake on these experiments with multiple choice tests is to do with the relationship between the opportunity for success and the number of chances to demonstrate the learner's knowledge. We have seen that each word takes time to pass the 'success threshold', and we know that it takes between 10-20 meeting for this threshold to be met. Thus if a test has 60 items and each word is only met once, we can expect only 1/10 or 1/20 of these 60 words to pass the 'success threshold', or a maximum of 6 (60/10) or 3 (60/20) words to be gained. If the test item sample is only 20, then we can only expect one word to pass the threshold and that is not enough to provide reliable data.

Few data on retention

Another very common element in the 'gains from exposure' research is the lack of concern for the retention of what was learned. Only one of the studies under investigation attempted to systematically gather retention data (Yamazaki, 1999). Retention data from the reading are important because they give us an idea of the quality, not only the quantity, of learning that occurs from exposure to the reading texts. Further, as most of the tests were given immediately after treatment, there is a very high probability that the subjects will score higher on the test than if the test was given even a few hours or even days later due to the nature of short-term memory loss (Baddeley, 1997). Thus the 'real' and lasting gains demonstrated in the research would probably have been over-estimated. This result was found in Yamazaki (1999) and is common throughout the second language vocabulary learning literature (see Weltens and Grendel, 1993 for a discussion). This therefore means that we should be cautious when accepting as fact that the gains that were reported in this kind of research were natural as we can expect a certain level of over-estimation due to the nature of language loss.

Controls not exposed to the target vocabulary

In some of the research that looked at how much can be learned from exposing subjects to a text, the controls were not exposed to the tested vocabulary. The assumption is that the controls should not need to see the vocabulary so that true learning could be measured. This design was used in the Pitts, White and Krashen (1991) replication of the Saragi, Nation and Meister (1978) Clockwork Orange study in which the subjects met 30 nadsat words (special vocabulary that only occurs in that book). Other studies that did not expose their controls to the tested vocabulary include Day, Omura and Hiramatsu (1991), Ferris, (1988) and Hafiz and Tudor (1990). (In two 'gains from exposure' studies under review (Evans, 1999 and Lai, 1993) comparison groups were mentioned and tested but confusingly were not compared with the experimental groups, which raises questions as to whether the authors understood the design).

'Gains from exposure' designs where the controls were not exposed to the tested vocabulary can tell us how many words were learned from exposure to an ER text. However, it is important to note that these studies cannot tell us anything important about whether ER 'works better' than any other treatment for language gains for things such as vocabulary. This is because the same amount of language gains that are found in these studies may have been gained more effectively from another treatment (say, direct vocabulary learning or by working on improving dictionary skills). Thus these studies basically are saying that 'we gave the subjects something to read and they learned something' or 'subjects can learn X amount from reading' and nothing more. This crucial point seems to have been missed by many researchers because it is very common for these studies to be cited as examples of how effective ER is, when in fact no such conclusion could or should be drawn as no comparisons were made in the studies and by definition, things can only be considered effective when they are compared to something else.


Other general concerns

Several other types of concern are evident in this body of research.


Quite a number of these studies were probably influenced by contaminating factors and some examples have already been mentioned. Sometimes the contamination was faithfully reported (e.g. Elley, experiment 1, 1991; Evans, 1999; Robb and Susser, 1989;Yamazaki, 1999) and in other studies it was unreported (e.g. Horst, Cobb and Meara, 1998; Mason and Krashen, 1997, in press).

Several types of contamination were evident,. Firstly, the subjects did not finish all their reading (Pitts, White and Krashen, 1989), or the same children were used as both the experimental and control group (Elley, experiment 1, 1991). Secondly, contamination was in evidence when the instruction was very similar in both control groups and treatment groups. For example, in Robb and Susser (1989) both the treatment group and the control group received reading strategy instruction and in Lituanas, Jacobs and Renandya (1999) 45% of the experimental class' instruction was the same as the control group. Thirdly, in Dupuy and Krashen (1993) for example, the subjects were told to expect a test at the end of the reading and viewing, which in their academic settings it is to be expected that students who are told they will be tested would try extra hard to do well and this may have compromised the results above a 'natural' acquisition level. Fourthly, in other studies Hawthorne contamination effects were in evidence. These effects occur when a new element is introduced to the study. For example, in the REAP study in Elley (1991) some of the teachers taught both control and experimental groups, and new materials were introduced.

Ability levels

Another factor that needs to be discussed is pre-treatment ability level and the importance of controlling for ability levels. In some studies the pre-treatment ability levels were controlled or matched with similarly performing pairs in other groups (e.g. Elley and Mangubhai, 1983; Lituanas, Jacobs and Renandya, 1999; and Robb and Susser, 1989) or were randomly assigned to groups (Day, Omura and Hiramatsu, 1991) while in other studies ability levels were not controlled (Dupuy and Krashen, 1993; Lai, 1993) or there was no randomisation of individuals as intact classes were used (e.g. Dupuy and Krashen, 1993; and Mason and Krashen, in press).

The lack of control for ability level can have adverse effects on the experiment because there is a definite advantage to the lower ability learner whom in normal circumstances we can expect to learn more in a given time than advanced learners. From a vocabulary perspective, Nation (1997) has demonstrated that as the beginner meets many more unknown words when reading than an advanced learner, she has more opportunities to pick up new language than an advanced learner who has to read much more to meet the same number of unknown words. Thus in experiments where the pre-treatment ability levels of the subjects is not controlled for, we can expect more gains to be shown for beginners than for intermediate or advanced subjects provided both groups have to read the same amount of graded readers. Similarly, in experiments where beginners and advanced learners read the same text we can expect beginners to have more chances to pick up language than more advanced learners. This implies that controlling the pre-treatment ability level is crucial in getting reliable results.

There are two qualifications to this position. Firstly, if motivation is not there then the weaker students might not make many gains despite the presence of much unknown language. In Lai (1993) one of the three groups who all were given a book a day as Summer reading, had far larger gains (S2 an initially stronger group), than the other two groups on the reading test. Lai suggests that motivation may have played a factor in explaining why the weaker learners did not gain as much as the more advanced learners. Secondly, there is probably a threshold under which learners may not be able to take advantage of being exposed to more unknown language. This was hinted at in several pieces of research (e.g. Laufer-Dvorkin, 1981; Lai, 1993). If there is too much new input and it is not comprehensible, then there are likely to be few gains. Conversely, if the input is lacking in new input there will be few chances to learn and few chances to demonstrate learning. Laufer (1989) and Liu and Nation (1985) have shown that unless there is a 95% or higher coverage of the words in a text the probability of successful guessing of unknown words (learning) will be severely reduced. Nation, (1999) suggests it should be at least 98%. Thus, if the text is too difficult the weaker subjects will not be able to guess (learn) successfully and the advanced ones will be limited by knowing most of the words anyway and thus will meet fewer unknown words and structures. In addition, the beginning level subjects may not be able to learn much because they cannot comprehend the surrounding text well enough to take advantage of all the new language. Therefore, if research is conducted where a mixed ability class all read the same text the subjects' chance of taking advantage of the same text is limited by their ability. Both these two points imply that more accurate results of the effect of a reading text on learners of a particular ability level will be gained by finding learners with similar pre-treatment abilities and that mixing learners of different abilities may confuse the issue.

Insufficient reporting

In some studies there was excellent reporting and in others there was very little detail. For example, we know very little about the effect of the subjects' background in learning French in the Dupuy and Krashen (1993) study. In other studies the amount of reading that was done was left unreported (e.g. Elley and Mangubhai, 1983; Elley, 1991, experiment 2; and Constantino, 1994), or there is insufficient reporting on how much was read. Not knowing how much was read makes interpretation almost impossible, but a lack of detail can also affect interpretation. A common problem is for the researcher to report how many books were read rather than how many pages or how many words. If both advanced and beginning learners read the same number of books, a beginner would read an easier more illustrated book which is usually shorter than those an advanced learner would read, thus the page count is different for each. Reporting page numbers is a better method than just counting the number of books, but it is more preferable to report the number of words that have been read (but as publishers so not indicate the length of their books, this will be too troublesome for researchers to calculate).

Full reporting is also needed so that studies can be replicated (Waring, 1997). The nature of ER research means that replication will be difficult, however, this is not to say that procedures should not be put in place to ensure that replication can be done. Unfortunately, much of the L2 ER cannot be replicated because the research was specific to a particular group and group specific tests were used and there was insufficient reporting to allow for careful re-construction.

Do findings for children apply to adults?

Many of these studies assessed the effectiveness of ER with children (i.e. those under about 15) learning second languages. This research on children is widely cited as relevant to L2 ER without the qualification that children learn differently from adults and it is not altogether obvious that this research necessarily applies to adults, and vice versa. There are crucial differences that may give us pause when assuming that they are the same. Firstly, children are characterized as learning without much apparent analysis, freely and naturally compared to adults, whereas adults learning second languages are characterized as requiring a lot more effort. Secondly, the testing procedures that have been used in some of this research suggests that it corresponds much more to L1 children forms of assessment than for adults. Thirdly, many of the younger children in some of these studies would not have yet developed many of the necessary cognitive strategies for dealing with longer texts in second languages and may not be as able as adults to benefit from ER. This has been little explored.

Another concerns centre around the applicability of the L1 tests to L2 subjects. Hafiz and Tudor, (1990) and Lituanas, Jacobs and Renandya (1999) both used assessment instruments that were designed for L1 rather than L2 subjects and their applicability to L2 subjects have not yet been explored.

Longer term to internalise

Laufer-Dvorkin (1981) concluded that the nature of the treatment meant that it was unlikely that there had been sufficient exposure to the target vocabulary to make a difference. Lai (1993) also hinted as this as an explanation why the weaker group did not progress as well as the others. Tsang also suggested that the 'lack of gains ... may be caused by insufficient input' (1996 : 227). This raises the question of what we mean by 'extensive' reading. Susser and Robb (1990) when reviewing various applications of 'extensive' they ranged from a page per day to at least two books a week. If we are to label a piece of research as relevant to ER then we need to have a common understanding for what we mean by 'extensive'. Further work in defining 'extensive reading' and standardisation of this definition is necessary if we are to compare like with like. Nation and Wang (1999) suggest that 'a book a week at the student's ability level' is sufficient for enough vocabulary recycling to take place where learning is possible. This amount of reading seems an adequate benchmark for it to be called 'extensive' reading.

Citing the work of others

It is common practice within research to cite the work of others to defend or add weight or evidence to one's argument. This is also very much in evidence in this literature. Some of the citations have been very clear about the research and have mentioned shortcomings and qualifications where necessary (e.g. Tsang). However, there are also papers which cite the ER research literature as fact with little regard for the problematic nature of much of the research. More worrying are the odd occasions when results are cited that bear little relation to what the research actually said. Indeed on occasions a piece of research is so mis-cited that it is almost unrecognisable from the original 5. It is hoped that there is a thoroughness and accuracy in the reporting of this literature and in particular for research which it is difficult to locate copies.

Analytical errors?

Finally, it needs to be mentioned that in some of the most widely-cited studies rather odd statistical data are reported 6. While the source of these errors may be typographical either on the part of the author or the publisher, or a more serious problem with incorrect statistical procedures, it does raise concerns about the thoroughness of the research or the level of care taken in presenting the work.



The review has raised more questions about L2 ER research than it answered. Despite the problems mentioned above it is almost certain that measurable gains for learners reading extensively can be found. However, the extent and type of these gains is unknown for various input conditions and we do not know which conditions 'make a difference'.

The premise behind some of these studies is to demonstrate that learners can read from ER when in fact, it is rather a moot point as to whether learners can learn from reading. Of course, learners learn new language from input, how else do they learn? Meara (1997) suggests this is like putting seeds in a pot only to confirm that they will grow into flowers. Thus this avenue of research is somewhat of a dead-end once the initial studies have confirmed the truism. The important question is 'how much of what is learned and how well is it learned?'

It is important to make a distinction between studies looking at gains from extensive reading and those looking at incidental learning from reading. When assessing gains from exposure to extensive reading we should expect low gains and several reasons for this have already been presented. Commonly, it is recommended that the students read at a level where 95% to 98% of the words on the page are understood (Nation, 1999) or where there are two or three unknown words on a page (Waring and Takahashi, 2000). It is precisely because ER is done at levels where few new items are introduced little vocabulary will be 'learned'. Thus to make a difference to vocabulary gains massive amounts of reading will be needed to provide enough input to make a difference and many of the shorter studies reported here may not have been able to reflect natural gains from Extensive reading. This is not to say that gains in new language are the only important reason second language learners should read extensively. Other excellent reasons include building reading fluency and reading speed, developing lexical access speed, building reading confidence and motivation and so on. However, if research is looking at incidental gains from reading and for the purposes of the research the researcher does not consider it important that the text is long and graded to the student level, then we can expect results to vary depending on the ability level of the learner, and on the amount of grading and the amount of reading done. Thus when citing this research, it is important to understand that the one type of study can expect different gains from the other and to cite them together without mentioning the difference may be misleading.

Horst, Cobb and Meara looking from an 'incidental' learning perspective, suggest that "one way of improving the methodology of this kind of study would be to test much larger numbers of potentially learnable words in order to ensure that the subjects have ample opportunity to demonstrate incidental gains" (1998 : 219). The assumption is that we need to understand incidental vocabulary learning from ER, so if we create conditions where more of it occurs, then we will be better able to understand both process and the product. This could be done either by testing more words that are likely to be tipped over the 'success threshold' (i.e. those words which the students are expected to partially know already), or by having words repeated many times in the text, thus raising the chance of success. Several researchers have already tried to test only words that the learners are likely to know or modified the input to make the words more available to the subjects and this also needs to be considered when interpreting 'gains'. (e.g. Day, Omura and Hiramatsu, 1991).

This reviewer is not convinced by this avenue of research for ER because if researchers only test items which are likely to be learned, then greater gains will be shown than those that would have occurred naturally from exposure and these results will only cloud rather than clarify the picture for natural gains from exposure. This means one needs to be cautious when comparing studies and looking at language gains from this kind of research design as this design may greatly over-estimate natural gains (dependent on how much manipulation occurred). It is therefore important in experiments that seek to ask how much is naturally learned from exposure to ER, that the words selected for assessing gains be randomly and naturally selected, and that the tests be published with the research.

This has other implications of the comparison of ER studies that look at vocabulary gains. Some studies (as part of the research design) have looked at the effect of frequency on the acquisition of vocabulary from reading and have controlled the frequency of the test words in order to ascertain what the effect of repetition is (e.g. Horst, Cobb and Meara, 1998; and Yamazaki, 1999). The assumption is that the more repetitions there are in a text the more likely it is that there are more gains (there is some limited evidence for this position from these two studies. In Horst, Cobb and Meara, 1998, gains are reported for words met 8 times or more). Thus caution must be exercised when comparing studies which have carefully controlled frequency (larger gains can be expected) with studies that have not purposely controlled input frequency (smaller gains can be expected). Extra special care should be made when citing the 'gains from exposure' research to identify which studies modified the input to increase the potential for gains and which did not. Care should also be taken when citing research with small populations or case studies (e.g. Cho and Krashen, 1994, Grabe and Stoller, 1997) because these gains made in these studies are much more likely to reflect what these individuals did rather than provide us with a picture of the larger population which they represent.

On a somewhat broader note, there is an assumption underlying much ER research that the learning of a second language can be measured by assessing ability at time 1, then introducing the input conditions and re-testing again at time 2. This also assumes that the knowledge of a subject is stable at time 1 and is stable again at time 2. This is despite the commonly-held assumption that a second language is said to be in a state of constant flux or interlanguage development. Research has shown that there is an element of both stability and instability with a learner's knowledge (Waring, 1999). If this is so, then the current ER research method that assumes stability at test time does not necessarily reflect the underlying instability in language knowledge. This inherent instability in the reading acquisition process means that it is difficult to assume that the knowledge at time 1 is stable and thus can be compared to stable knowledge at time 2. This is a common problem with pre- / post test designs which are common in ER research. How can we say that our learners 'gained' X amount if the underlying nature of language acquisition means that the knowledge was unstable anyway? This matter is far from resolved and we may need more sensitive assessment procedures that take this inherent instability into account.

The above raises questions about the linearity of L2 ER research and its suitability for finding out about ER. Meara's point (1997) that researching whether vocabulary or structures can be learned from reading is like putting seeds in a pot just to check that they will grow, demonstrates the linearity of the research design. The question is essentially linear firstly, because it assumes that after reading extensively, more of feature X is added on the end (much like a pay bonus gets added to your salary), and secondly, because it is also concerned with input conditions and how they affect outcomes. It is by no means clear that the current research paradigm that asks 'how much will students learn from reading extensively?' is an appropriate one to ask. This question is very narrow in scope and can only answer a very limited question. For example, asking how much vocabulary is learned will only provide information about vocabulary gains (often measured by 'new meanings' understood), but will not tell us how the newly learned words have been fitted the interlanguage system, and if any backsliding (negative effects on other words) has occurred, or countless other aspects of vocabulary. Also it will not tell us if the word's collocations are better understood, or many other aspects of word learning have been mastered. Thus under this paradigm, if we wish to find out about all the effects of ER on a multitude of linguistic factors we have to test for each one. This then raises the question of how we can pull together a large array of data into a coherent whole which can help us answers some of the questions raised by Alderson at the beginning of this paper.

A broader and more interesting research agenda is raised by asking the question 'what effect does ER have' or the other way round - 'what changes as a result of ER?' This question is not primarily interested in how much has been learned, but is concerned with how the lexicon or a student's interlanguage or her reading abilities have changed globally as a result of the reading. At present there are few ways that we can assess these changes other than globally (see Waring 1999, for one way to do this for vocabulary). This research agenda means that we have to come up with assessment procedures that can assess the changes that result from ER and that too is no easy task.

Research Agenda

In the mean time, and until more global measures are found, there are several important unanswered questions with ER. Here is a list of some questions in ER research into second languages that are as yet little understood.

What are the essential features that allow reading to be called Extensive Reading?

How do learners from different language groups / ability levels benefit from ER?

Can all learners benefit from ER? (Are gains from ER independent of learning style?)

When is the optimum time to introduce ER into the curriculum?

What are the minimum linguistic requirements for students to read extensively?

What is the best way to prepare learners for reading extensively?

How do gains in ER compare with those from other 'rich' forms of input?

Which types of reading strategy training can have an effect on ER ability?

What is an optimum relationship between intensive reading, extensive reading and reading strategy instruction for learners of different L1s and different abilities?

What is the relationship between ER and motivation and confidence in reading?

How does ER impact other language skills? Which skills? Why? Are there more effective methods than ER in affecting these skills?

At what stage in the acquisition process do learners benefit most from ER?

What volume of text is necessary for gains to take place at different ability levels?

What are the different effects on learners of reading ER text at i-1, i, or i+1? How does this change with proficiency level? Is this consistent for all learners?

Do different types of gains occur at i-1, i, or i+1?

What is the optimum comprehension level for most gains to take place?

What is the optimum reading speed for gains to take place?

What is the optimum level of i for developing reading speed?

Which features of an L2 are more readily picked up through ER than other methods?

Is there an acquisition order for the language that can be learned from ER?

At what level do learners from different L1s benefit most?

How fragile is learning from ER? What is the optimum recycle rate for structure and vocabulary within ER, and at what levels? What are the best ways to reinforce learning from ER?

What text types lead to the most gains?

What is the optimum balance between simplification and elaboration of the text for comprehension and for language gains? How does this vary by reading level?

Does simplification or elaboration of the text lead to more language gains?

How is pragmatic competence affected by ER?

Does general ER prepare learners for technical ER? How?

What is the best way for a teacher to assess language gains from ER?


This review has pointed out numerous problems with this research and some of these are quite serious (e.g. contamination, poor tests and test method, and poor research design). Of the 25 studies that investigated 'gains from exposure', 'gains from writing' or compared ER with other treatments, a full 100% were contaminated either by a) the presence of outside tuition or exposure, or b) the controls were not exposed to the tested vocabulary or c) the ER group had longer exposure to English. Some of these studies suffered from all these forms of contamination. This lack of experimental control, mostly as a result of the use of convenience populations, means that while circumstantial evidence supporting ER abounds, the presence of contamination factors undermines the research as it cannot provide unequivocal evidence of the effectiveness of ER. This is hardly a strong research foundation upon which the house of ER rests. However, as was mentioned before, it is extremely difficult to find or create experimental conditions when the nature of ER means that we can only measure the effectiveness of it over time, but stringent efforts must be made to find and create these conditions.

This reviewer thus concludes in the same vein as Coady, (1997), Horst, Cobb and Meara (1998) and Nation, (1999) that the L2 ER research body is a severely troubled one from an experimental point of view as it is hard to find problem-free studies. These troubles are much more in evidence in the 'gains' and 'comparison' research than with research on 'ER and affect'. While not all the research is of equal concern, there are several oft-cited studies that may not be able to live up the claims made upon them as relevant to L2 pedagogy.

This review therefore suggests that we should treat the findings for the effectiveness of ER from the 'gains' and 'comparison' ER research with more than considerable pause. It also suggests that we should be extremely cautious in proposing that there is 'strong evidence for the value of ER' (Lituanas, Jacobs and Renandya, 1999), and that we should take Krashen's very strong claim about the effect of reading with a ton of salt. The research certainly does not give us enough evidence to support his position because much of the evidence we have comes from troubled research. Thus we are a very long way away from being able to answer Alderson's questions.

This review has not turned me into a disbeliever. I believe very strongly that ER has an important place (not the only place) in second language learning. I sincerely hope that a relatively trouble-free research base will emerge in the future that pays heed to some of the problems that have been found here which can relieve me of my nagging doubts about the present quality of much L2 ER research. I also hope it will allow us to develop a reliable base upon which those of us who care about ER can rest our case. Until then, I will finish by saying that


ER is good for second language learners (especially for affect). The research does not yet support a stronger conclusion, however. Reading is probably one way, and only one way we become good readers, it seems that through ER we can develop a good writing style, an adequate vocabulary, advanced grammar, and it may help us to become good spellers..... but we still do not have the evidence to be sure.




Alderson, C. Assessing Reading. Cambridge: Cambridge University Press. 2000.

Baddeley, A. Human Memory. Theory and Practice. Hove: Psychology Press. 1997.

Cho, K. and S. Krashen. Acquisition of vocabulary from the Sweet Valley Kids series: Adult ESL Acquisition. Journal of Reading, 37: 662-667. 1994.

Choppin, B. Correction for guessing. In Keeves, J. (Ed.) Educational Research, Methodology and Measurement. Oxford : Pergamon Press. 1988.

Coady, J. Extensive reading. In Coady, J. and T. Huckin. Second language Vocabulary Acquisition: A rationale for Pedagogy. Cambridge: Cambridge University Press. 1997.

Constantino, R. Pleasure Reading Helps, Even If Readers Don't Believe It. Journal of Reading; 37 (6): 504-05. 1994 .

Day, R., and J. Bamford. Extensive reading in the second language classroom. Cambridge: Cambridge University Press. 1998.

Day, R., C. Omura and M. Hiramatsu. Incidental EFL vocabulary learning and reading. Reading in a Foreign Language. 7 (2): 541-551. 1991.

Dupuy, B. Voices from the classroom: Students favor extensive reading over grammar instruction and practice, and give their reasons. Applied Language Learning, 8 (2): 253-261. 1997.

Dupuy, B. and S. Krashen. Incidental vocabulary acquisition in French as a foreign language. Applied Language Learning, 4 (1): 55-64. 1993.

Elley, W., and F. Mangubhai. The impact of reading on second language learning. Reading Research Quarterly, 19: 53-67. 1983.

Elley, W. Acquiring Literacy in a Second Language: The Effect of Book-Based Programs. Language Learning, 41(3): 375-411. 1991.

Ellis, R. Modified Oral Input and the Acquisition of Word Meanings. Applied Linguistics, 16 (4): 409-41. 1995.

Evans, S. Extensive Reading: A preliminary investigation in a Japanese Senior High School. MA Thesis: Columbia University (Tokyo). 1999.

Grabe, W. and F. Stoller. Reading and vocabulary development in a second language: a case study. In Coady, J. and T. Huckin. Second language Vocabulary Acquisition: A rationale for Pedagogy. Cambridge: Cambridge University Press, 98-122. 1997.

Hafiz, F. and I. Tudor. Extensive reading and the development of language skills. English Language Teaching Journal 43 (1): 4-11. 1989.

Hayashi, K. Reading strategies and extensive reading in EFL classes. RELC Journal, 30 (2): 114-132. 1999.

Hedgcock, J. and D. Atkinson. Differing Reading Writing Relationships in L1 and L2 Literacy Development? TESOL Quarterly, 27 (2): 329-33.1993 .

Horst, M., T. Cobb and P. Meara. Beyond a Clockwork Orange: Acquiring second language vocabulary through reading. Reading in a Foreign Language. 11 (2): 207-223. 1998.

Janopoulos, M. The relationship of pleasure reading and second language writing proficiency. TESOL Quarterly, 20 (4): 763-768. 1986.

Joe, A., P. Nation and J. Newton. Sensitive Vocabulary Tests. Draft paper. Victoria University of Wellington, New Zealand. 1996.

Krashen, S. The power of reading. Insights from the research. Englewood, Co.: Libraries Unlimited. 1993.

Lai, F. The Effect of a Summer Reading Course on Reading and Writing Skills. System, 21 (1): 87-100. 1993.

Laufer, B. What percentage of text-lexis is essential for comprehension? In: C. Lauren and M. Nordmann (Eds.). Special language: from humans thinking to thinking machines. Clevedon: Multilingual Matters. 1989.

Laufer, B. What's in a word that makes it hard or easy: some intralexical factors that affect the learning of words. In Schmitt, N. and M. McCarthy (Eds.): Vocabulary: Description, Acquisition and Pedagogy: Cambridge, Cambridge University Press. 140-155. 1997.

Laufer-Dvorkin, B. "Intensive" versus "Extensive" Reading for Improving University Students' Comprehension in English as a Foreign Language. Journal of Reading, 25 (1): 40-43. 1981.

Lituanas, P., G. Jacobs and W. Renandya. A study of Extensive reading with remedial students. In Y. M. Cheah & S. M. Ng (Eds.), Language instructional issues in Asian classrooms (pp. 89-104). Newark, DE: International Development in Asia Committee, International Reading Associatio.

Mason, B. and S. Krashen. Extensive Reading in English as a foreign language. System, 25 (1): 91-102. 1997.

Mason, B. and S. Krashen. Can we increase the Power of Reading by adding more output and/or more correction? Texas Papers in Foreign Language Education. In press.

McQuillan, J. Reading versus grammar: What students think is pleasurable and beneficial for language acquisition. Applied Language Learning, 5 (2): 95-100. 1994.

Meara, P. Towards a new approach to modelling vocabulary acquisition. In Schmitt, N. and M. McCarthy (Eds.): Vocabulary: Description, Acquisition and Pedagogy: Cambridge, Cambridge University Press. 109-121. 1997.

Nagy, W., P. Herman. and R. Anderson. Learning words from context. Reading Research Quarterly, 20: 233-253. 1985.

Nation, P. and M. Wang. Graded Readers and Vocabulary. Reading in a Foreign Language, 12 (2): 355-380. 1999.

Nation, P. Teaching and Learning Vocabulary. Boston, Ma.: Heinle and Heinle. 1990.

Nation, P. Using dictionaries to estimate vocabulary size: essential, but rarely followed procedures. Language Testing, 10 (1): 27-40. 1993.

Nation, P. Learning Vocabulary in Another Language. English Language Institute Occasional Publication 19. Victoria University of Wellington, New Zealand.1999.

Nation, P. The language learning benefits of extensive reading. The Language Teacher, 21(5): 13-16. 1997.

Pilgreen, J. & S. Krashen. Sustained silent reading with English as a second language high school students: impact on reading comprehension, reading frequency, and reading enjoyment. School Library Media Quarterly, 22: 21-23. 1993.

Pitts, M., H. White, and S. Krashen. Acquiring second language vocabulary through reading: a replication of the Clockwork Orange study using second language acquirers. Reading in a Foreign Language. 5 (2): 271-275. 1989.

Polak, J., and S. Krashen. Do we need to teach spelling? The relationship between spelling and vocabulary reading among community college ESL students. TESOL Quarterly, 22: 141-146. 1988.

Renandya, W., B. Rajan, and G. Jacobs. Extensive reading with adult learners of English as a second language. RELC Journal. 30 (1): 39-61. 1999.

Robb, T. N. and B. Susser. Extensive reading vs Skills Building in and EFL Context. Reading in a Foreign Language. 5(2): 239-51. 1989.

Saragi, T., P. Nation and G. Meister. Vocabulary Learning and Reading. System; 6 (2): 72-8. 1978 .

Smith, J. A quick measure of achievement motivation. British Journal of Social and Clinical Psychology; 12(2): 137-143. 1973.

Susser, B. and T. Robb. EFL extensive reading instruction: research and procedure. JALT Journal, 12 (2): 161-185. 1990.

Tsang, W. Comparing the effects of reading and writing on writing performance. Applied Linguistics, 17: 210-233. 1996.

Tudor, I. and F. Hafiz. Extensive reading as a means of input to L2 learning. Journal of Research in Reading. 12 (2): 164-178. 1989.

Tudor, I. and F. Hafiz. Graded readers as an input medium in L2 learning. System, 18 (1): 31-42. 1990.


Waring. R. The Negative Effects of Learning Words in Semantic Sets: a Replication. System , 25 (2): 261-74. 1997.

Waring, R. Guest editor. "Special edition on Extensive Reading". The Language Teacher, 21 (5). 1997.

Waring, R. Tasks for Assessing Receptive and Productive Second Language Vocabulary. Ph.D. Thesis. University of Wales. 1999.

Waring, R. and S. Takahashi. The Oxford University Press Guide to the 'Why and 'How' of Using Graded Readers. Tokyo: Oxford University Press. 2000.

Weltens, B. and M. Grendel. Attrition of vocabulary knowledge. In: R. Schreuder and B. Weltens (Eds.) The Bilingual Lexicon. Amsterdam: Benjamins. 1993.

Yamazaki, A. Vocabulary Acquisition through Extensive Reading. Unpublished Dissertation, Temple University. 1996.


Appendix 1






Dvorkin, 1981

Compared 4 adult ESL groups on their reading development. One had ER only, and one intensive reading only and 2 groups had both. All groups had reading skills training. Tested using a 3 skills in-house tests pre- and post.

Intensive class performed better on the post-test. The ER group said that ER was too superficial to be of benefit.

Unusual definition of ER (only 7-10 pages per class). 'ER' involved in class reading and skills. Not enough reading to make it an ER class. Outside exposure unknown. The classes were very similar in many ways. Test was mainly on strategies.

Elley and Mangubhai, 1983

A 2 year study compared 2 groups of 380 Fijian children (ESL) learning English with graded readers. Matched controls followed the normal English language program.

After one year, substantial improvement in reading and word recognition. After 2 years this extended to all aspects of L2 abilities including oral and written production.

Unknown how much was read.

Huge number of minus scores on some tests which makes interpretation difficult as few details about the tests are given.

The effect of outside exposure to English is unknown.

Janopoulos, 1986

79 adult ESL subjects were interviewed about how much 'pleasure' reading they did which was correlated with success on a written placement essay test.

Amount of L2 pleasure reading was associated (0.76) with English writing proficiency. No similar effect for heavy pleasure readers in L1. The relationship was correlational not causal.

Amount read may be only one factor in explaining better writing proficiency. Others may include higher motivation (to read or write), L1, previous experience with writing etc. None of these variables was controlled. It might also be that the ability to write well enables students to read more pleasurably and thus more is done.

Ferris, 1988

ESL university subjects read Animal Farm. Same M/C test pre and post. Controls only did the vocabulary test.

Significantly better gains were found on the test than the control subjects who did not read the book.

Confirmed a truism that subjects can learn vocabulary from reading. As controls did no reading we cannot say ER made the difference, only the reading did. No retention data.

Pitts, White and Krashen, 1989

35 and 16 ESL subjects read 2 chapters (6700 words) of The Clockwork Orange. Group 2 also saw a video. M/C test. Controls only took test.

Group 1 gained 6.1% (1.81 / 30 words; s.d. 4.26).

Group 2 gained 8.1% (2.42 / 28 words; s.d. 2.64).

Difficult text (50% did not finish)

Modest gains only. High sds show many students had zero scores No retention data. Controls were not exposed to the vocabulary at all.

Robb and Susser, 1989

Compared the acquisition of a group of 63 Japanese college students (av. 600+ pages of graded readers at home and SRA materials in class) with a group who used a 'skills' based reading book (n = about 63). Assessed by a 4 skills test and reading speed.

ER > Skills on most skills measures.

ER > S on reading speed.

Anecdotal evidence that the reading helped their writing.

Reading skills were also taught with the SRA cards (and they had some intensive reading instruction in another class), so we cannot say SRA reading made the difference.

ER subjects read more than Skills subjects, therefore unequal time exposed to English.

Hafiz and Tudor, 1989

Compared 16 Pakistani English-born children (10-11yrs) read graded readers out of class with controls did little reading. (Often cited as an L2 study but the Ss were probably bilingual as most had been born and educated in the UK despite speaking Punjabi at home).

On a battery of tests, the Exp. group significantly outperformed Controls on vocabulary, reading comprehension and writing. Relaxed atmosphere promotes growth.

The data were collected in the UK so non-ER contamination was a very high possibility! Controls did no outside extra reading, so it does not follow that ER is better than any other form of reading, but may indicate that ER helps in some way (and so could have other factors)

Hafiz and Tudor, 1990

Compared 25 male 15-16 year old Pakistani children with matched controls. Exp. group had 4 hours of English plus 4 more hours of silent self-selected graded reading (90 hours total). 6 essays formed the test. Controls (n=24) only took the 6 essay tests.

Exp. group improved on almost all measures of vocabulary and writing ability. Best gains shown in fluency and in range of expression. Gains in writing accuracy are also strong.

Controls had no English input at all, so the gains in the ER group are meaningless for ER. Unclear if gains are due to the English class or the ER. Writing gains measured using L1 measures. Essay topic (a description) may have been biased towards the 'treatment Ss because this group had had extensive exposure to descriptive discourse.

Day, Omura, Hiramatsu, 1991

92 High school and 200 College students in Japan read an edited 1032 word text and were compared with Controls who only took test and did no reading. 17 item M/C test.

Gains of 8.5% (HS) and 33% (Uni) in vocabulary (gains as % over what the controls knew) and significant differences for experimental vs control group.

The tested words were purposely selected to give opportunities for the subjects to learn them thus the gains are over-inflated over an unmodified text. Very small test sample so data will not be very reliable. No pre-test to determine if the groups were different. Large variations for exp. groups. Controls were not exposed to the tested vocabulary.

Elley, 1991

Experiment 1

"The Niue study"

The 'Fiafia study' compared the learning of the same group elementary age children learning English from stories read by the teacher ('shared-book approach') and an audiolingual approach. 3 tests given.

Over the year, Ss gained 32% on Reading comprehension, 98% on Word recognition and 67% on Oral language.

The same children were used as the experimental group and control group one year apart, which invalidates between-method comparisons as the comparison started with the children at a different base ability. The gains scores are thus meaningless to ascertain whether the shared-book approach made the difference.

Elley, 1991

Experiment 2

"Book Flood"

Compared 3 groups of 9-11 year-old Fijians (2 classes over 2 school years) on Silent reading, Shared book approach and a control (audiolingual approach) group. Battery of tests given.

Shared-book students outperformed control on almost all measures as did silent reading group. The effect of outside exposure is unknown.

No data on how much was read. Different tests used in different classes and years. While the data show significant differences by method, the data on p.387 show decreasing scores on some tests in their second years in both classes and in all groups. Thus the residual gains scoring method effect seems rather large, or even inappropriate.

Elley, 1991

Experiment 3

"The REAP study"

3 separate studies of Singaporean 6 year olds who benefited from the Shared book approach (REAP Ss) were compared with an 'orthodox audiolingual approach'. Battery of 7 tests.

REAP Ss > non-REAP Ss in each study. Main effects for word recognition, Oral Language and Accuracy and comprehension measures. Significant differences found for vocabulary in 2 /3 studies.

Some 'contamination' as teachers and tests crossed between groups and new materials were introduced. No apparent pre-tests to ascertain if any of the groups differed pre-treatment.

Dupuy and Krashen, 1993

Wanted to ascertain how much vocabulary is learned from exposure. 15 learners of French read a text with colloquial words it was assumed Ss would not know. They also watched a video with 8 of the tested words. 22 Controls only had test.

Exp. group outperformed the 2 controls (14.9 vs. 8.0 and 8.9) on the 30 item m/c test (30% did not finish reading). Experiment confirms that Ss can learn from exposure.

Test biased to the ER group as Controls did not see the vocabulary. No randomization and no pre-testing. Few data on subjects' backgrounds. No correction for guessing. Some items on test susceptible to intelligent guessing.

dfs in the statistics seem very odd.

Lai, 1993

'Summer reading' study

Assessed the increase in reading ability of 3 experimental groups of 11-15 year old Hong Kong Ss (n=266) who read an average of 16 graded readers in and out of class with teacher guidance over a 4 week period. A standardised reading comprehension (RC) test and a reading speed test were used . The controls with a similar background had read graded readers and taken the same RC test in a different experiment.

It was found that in only 1 of the 3 groups the more Ss read, the more gains they made on the RC test. Gains in the tested areas were shown once the Ss had attained a certain proficiency level. The reading speed of only 2/3 groups increased significantly. Group 3 (who wrote an essay) showed some gains pre- and post in essay writing (not on all measures).

"As there were interaction and output activities during class time, we cannot say that ..... reading was the only factor affecting language development" (p.94). Results differed between the 3 groups who all had about the same input which points to uncertainty in supporting the universality of a comprehensible input position. Unclear why comparison data with the control group was not presented or analysed in the paper.

Cho and Krashen, 1994

This study assessed how much vocabulary was learned from ER. 4 adult ESL subjects (3 Korean, 1 Spanish) were asked to read Sweet Valley books in their own time. Koreans tested (translation) on words they underlined as unknown. Spanish subject given 165 word test of the words the Koreans underlined.

Number of words read, and their learning rate.

Mi-ae (56,000): 1 / 1,497

Su-jin (126,000): 1 / 7,200

Jin-hee (161,000): 1 / 19,634

Alma (70,000): 1 / 7,000

Anecdotal evidence that oral ability improved after reading

Motivation and confidence in English increased.

Learning rate is very slow. Amount of vocabulary one can learn depends on one's starting level. Stronger readers meet unknown words less often and have to read more to gain one more word. Unclear if all unknown words were underlined. Questionable data from Alma (words from her own reading were not tested).

Constantino, 1994

Case study of 3 pleasure readers and 2 non-pleasure readers to assess levels of motivation from ER. Pleasure readers (PR) were encouraged to read as much as they could.

PR subject's motivation in reading improved, as did their confidence and self-perception of general language ability. Non-PR Ss remained frustrated readers.

Anecdotal evidence of increased motivation. We do not know how much reading made a difference to motivation.


Tsang, 1996

Compared 144 11-17year old subjects in Hong Kong English-medium school in 3 conditions over 24 weeks. All students in 3 groups had the same regular classes plus a treatment. Group 1 read 8 self-selected books and did 8 book reports. Group 2 wrote 8 extra English essays. G3 had 8 extra maths assignments with no extra English. Assessed pre- and post by writing on the same essay

Significant gains in content and written language use for G1 compared to G2 and G3. Few gains in vocabulary seen.

Amount of ER in G1 was a small % of the total English input (8 books compared plus 110 hrs of class time plus outside exposure from an English-medium environment). It is unlikely such a small amount of reading would have had such a major effect over the other groups. Other studies do not show such a difference with even 3 times as much reading.

Yamazaki, 1996

2 groups of Japanese High Schoolers were compared for 2 treatment conditions in one of several concurrent English classes over 9 weeks. Each student in G1 (n=31) read the same 18 graded readers out of class and did 9 read faster exercises in class. G2 (n=43) read and translated passages from graded readers and did vocabulary memorization tasks in class, and were given 9 translation tasks out of class. Sssessed with a series of tests and on vocabulary selected from the graded readers.

There was no significant difference between the two groups although both groups' vocabulary increased significantly. (26% gain in the ER group). Delayed post-test showed vocabulary loss but not to pre-test levels. There was no effect for frequency of occurrence in the text and whether they were learned or not. Positive influences on confidence in reading in English were reported for the ER subjects.

Selection of the tested vocabulary was biased to the ER group as they had been exposed to it all in their reading but the translation students had not. This may therefore mean that the vocabulary learning for the translation group is under-estimated. Also the time spend doing translation tasks probably out-weighted the time spent reading graded readers, thus it is possible to conclude that the translation strategy may have been more successful. The 26% gain in vocabulary also comes from the other outside exposure, not only the graded readers.

Grabe and Stoller, 1997

A case study of an adult beginning reader of Portuguese in Brazil. Studied by reading the first page of the newspaper and looking up unknown words, by watching TV and other input. Assessed by in-house discrete translation vocabulary tests, reading comprehension (translation) test, listening tests and cloze tests.

Scores on all tests increased markedly. Results show that an intensive word study program can have dramatic effects on language development. (This study was included because it is sometimes cited as an ER study when in fact it is not in comparison to the others mentioned here.)

'Extensive Reading' here is meant to mean 'read widely but very intensively' (i.e. with lots of discrete word study and dictionary use). Large gains are not surprising in the initial stages of learning to read (all input is new and thus is available for learning).

Mason and Krashen, 1997.

Experiment 1

Compared 30 reluctant Japanese college readers read an average of 30 books over a semester and wrote a diary in Japanese about their progress with Controls who had 'intensive reading'. 100 item in-house cloze test.

Significant gains over a control group on cloze test. Anecdotal evidence of increased motivation and general English ability.

Exp group spent more time reading than did controls. Larger gains for weaker students are to be expected as they know less. High variation in gains in control group (possible cross-overs in ability?).

Mason and Krashen, 1997

Experiment 2

Compared 40 University and 31 Junior college students who read graded readers and wrote English summaries of each book with 39 and 18 Controls who had regular intensive reading instruction. 100 item cloze test.

Significant difference on cloze test for controls taught regular classes. Written summaries showed a marked increase in quality from 'good' from 'average' or 'not good'. 36 of 37 said their writing had improved. 32 of 38 said their reading improved their writing.

Extra time ER group spent reading may have lead to more acquisition.

The effect of the extra instruction in the students' concurrent English classes is unknown (6 unreported extra classes per week known at the university). Odd dfs on some data (possible data analysis mix-ups?)

Mason and Krashen, 1997

Experiment 3

Compared 2 groups who read graded readers (one wrote summaries in English n=40, the other in Japanese n=36), with a comparison group n=38 who did traditional intensive reading only. 100 item cloze test, written summaries pre- and post and a reading comprehension test (post only).

ER groups (combined data) outperformed the IR group on the cloze test (Japanese group did not outperform the control). The Japanese group outperformed the English group in writing. Both exp. groups outperformed Controls on reading comprehension test. Reading speed increased more for the Japanese group than the English, but both much better than Controls.

Possible floor effect on the Japanese group's reading speed. Little is known of the background of the three classes and other factors that may have contributed to gains. Their outside exposure to English was unreported despite their being 6 other English classes per week. Odd dfs on some data (possible data analysis mix-ups?)

Renandya, Rajan and Jacobs, 1999

Asked how much can be learned from ER. 49 Vietnamese read an average of 728pp in 6 weeks, and did book reports and talked about their books. Pre- and post in-house test. Ss completed a questionnaire on reading while in Singapore.

Significant but low correlations between gain scores on the test with

a) amount read while in Singapore (0.39)

b) amount read in L1 (0.45)

c) amount of newspapers read while in Singapore (0.36)

Outside reading and language study was also done in other classes, so we cannot be sure gains on the language test were from ER only. Over 60% of the variance unaccounted for on most tests. Data are correlational not causal.

Horst, Cobb and Meara, 1999

The teacher read aloud The Mayor of Casterbridge to 34 adult Omani subjects to assess how many words would be learned from exposure to them. A 45 item M/C test and a 13 item word association test were given as pre- and post tests.

Gain of 22% (5 words) of the unknown words in the M/C test. Gain of 16% (1.8 words) on the word association test.

When using a "gain formula", higher learning rates were found.

Shows that more able students can pick up more vocabulary. However, being read to may have required less cognitive effort and may have meant less 'learning'. Gain scores of other studies cited in this paper are under-reported. Outside exposure was likely in an intensive English program.

Hayashi, 1999

100 Japanese university sophomores had 90 minutes per week and read an average of 759 pages in a semester out of class. Subjects were assessed pre and post by a TOEFL test and a vocabulary test. A questionnaire on reading strategies was also given.

A 0.48 correlation between vocabulary test score and the number of pages read and a 0.43 correlation between reading test and number of pages read. Questionnaire data showed an increase in the students' perception of improvement in their reading, and that writing book reports helped their writing. Found that reading a lot in L1 and L2 was the 'most important factor for improving reading skills rather than just teaching reading strategies'

The correlation data do not tell us that the high score was caused by the amount of reading as there are alternative explanations (e.g. having a high score may enable Ss to read more). This question was not resolved by performing statistical tests on the pre- and post TOEFL and vocabulary tests. The amount of outside exposure is not reported. We cannot say for sure whether it was ER that had an effect on language acquisition. No retention data collected.

Evans, 1999

29 experimental Japanese high school students read an average of 8 books over 6 months. 36 Controls had no extra English input. The two groups were matched on other English input. Ss were assessed on their gain scores on the Key English Test.

Both groups gained over the time period, but no between group data are presented, so this confirms that the Ss can learn from ER input.

Experimental group had more input from their reading and could be expected to gain more.

Mason and Krashen, in press

Compared 104 EFL Japanese female college English majors who had had no contact with English outside the class under 3 conditions. G1 wrote reports in L1, G2 wrote reports in English without teacher feedback, G3 wrote reports and had feedback. Summary writing ability (error free clauses and volume) assessed. Average of 1500 pages of graded reading for each of the 3 groups. 100 item cloze test

100 item cloze test pre and post showed gains for all students but none significant between groups. Most gains were in writing (No holistic / organization / coherence or impressionistic measures taken of their writing ability)

The effect of the (unreported) extra instruction in the Ss' 6 other English classes is unknown.

No random selection.

No holistic evaluation of essays.

Unequal initial ability between groups? G2 started and finished much weaker at writing than G1 or G3 (possibly did not reach a threshold for the effect of the reading to show in writing). G3 considerably stronger starting writers.

Lituanas, Jacobs and Renandya, in Press

2 groups of 30 matched pair 'remedial'13-14 year old Filipinos were compared. While both groups had a regular English class, one group had an ER program (45% reading 45% ER reading skills) while Controls had a regular reading class. Tested pre- and post with 2 reading tests.

ER > Regular reading class on both Informal Reading Inventory and the Gray Standardized Oral Reading Test.

Reading tests are for L1 not L2 learners. Confirms that Ss can learn from input. ER subjects exposed to outside tuition so we cannot necessarily attribute gains to ER only.



1. In this paper ER is equated with 'Pleasure reading' , 'Sustained Silent Reading' and other forms of reading where the texts are considered to be 'Extensive Reading' tests (See Day and Bamford, 1997 p. 6-8 for a discussion of approaches to reading that can be considered 'Extensive Reading').

2. For the purposes of this survey children are defined as those under high school age or about 15 to 16.

3. For example, some items on Dupuy and Krashen's (1993) test seems to have been fairly easy to guess intelligently, also the test did not contain only the supposed 'colloquial' test items and there are also 3 spelling mistakes in the test items. This reviewer managed a score of 14 out of 30 with only a minimal amount of schoolboy French (as there are only 3 choices wild guessing will get a score of 10). Another example of poor quality control is found in Cho and Krashen (1994) where one subject was tested on 161 words met by the other students which makes interpretation of what she had gained from her reading troublesome as she had not met the words she was tested on.

4. Here are two 'standard correction' equations

S = ____c____ E or S = R - ____E____
            (c -1)                                  (c -1)

S = the corrected score
E = the number of incorrect items
c = the number of choices
R = raw score

Using these two equations our hypothetical learner who knew 20 items would be awarded a corrected score of 21.33 for equation 1, and 18.67 for equation 2. Neither equation is perfect as they did not predict the 20 items our learner knew. These standard methods of correction have been criticised ever since they were introduced in the 1920's because they can lead to negative scores and they ignore that a subject may have eliminated one or more choice (see Choppin, 1988 for a fuller discussion and other more complex equations).

5. Ellis (1995 : 424) for example, cites the Dupay (sic) and Krashen (1993) study as testing 42 (actually 15) L2 learners of French learning from 'Trois Hommes et un coffin' (sic) (couffin) after 80 (actually 40) minutes of exposure to reading.

6. In Dupuy and Krashen (1993) for example, a t-test was used to compare 15 experimental subjects with 2 control groups of 9 and 13 (i.e. a comparison of 15+9 =24 and 15+13=28 subjects in both analyses). The degrees of freedom were reported as 14 and 14 (the dfs for a matched t-test) when the standard way of calculating degrees of freedom in a normal t-test involving two independent groups is n-2 thus the dfs should be 22 and 26. If inappropriate procedures were applied to the data, this may have compromised the findings and the claims based upon these findings. Similar confusing data are found in all three of the Mason and Krashen (1997) experiments and elsewhere in the L2 ER literature.