The ‘State Rating Task’- An Alternative Method of Assessing Receptive and Productive Vocabulary.

 

Rob Waring

 

Published as Waring, R.        2000 The ‘State Rating Task’ - an alternative method of assessing receptive and productive vocabulary.       Kiyo, Notre Dame Seishin University: Studies in Foreign Languages and Literature. 24 (1): 125-154

Introduction

The tasks traditionally used for assessing second language receptive and productive vocabulary have been the multiple-choice test (Symonds, 1926; Symonds and Penney 1930), translation (Takala, 1984; Morgan and Oberdeck, 1930), sentence or word completion tests (Waring, 1996; Laufer 1998, Laufer and Paribakht, 1998) or definition tests (Schmitt, 1997). The multitude of tasks that have been used to assess receptive and productive vocabulary show little uniformity in agreement as to what might be considered a receptive or productive task (Waring, 1999).

 

A review of the history of vocabulary assessment techniques (Waring, in preparation) has shown that many of the assessment techniques currently in use for assessing receptive and productive vocabulary have remained largely unchanged since their introduction over 70 years ago. Thorndike (1924) for example, had wrestled with construct definition of multiple-choice tests. He stated that his criterion for knowing a word was “to be able to define it passably, or recognize a definition of it among 3 or 4 wrong definitions”.  He said that this definition of word knowledge was “lamentably vague, but is the best we can do at present” (p. 74). 75 years later these criteria are still commonly used and they are still lamentably vague.

 

Despite considerable research that has been done into researching the properties of multiple-choice tests (see Henning, 1991 for one example), the technique's ability to assess vocabulary knowledge remains largely unquestioned. Indeed its use as a testing instrument both with research and within assessment is so widely accepted that it is the default choice for many. The same could also be said for the other types of tests mentioned above.

 

It is also common in the literature to find the terms recognition vocabularies equated with receptive or passive vocabulary and recall vocabularies with productive or active vocabularies. Often little care is taken to distinguish them (Melka, 1997). Most of the research into receptive and productive vocabulary has ignored the vocabularies of speaking, listening, reading and writing, but instead has involved asking how many words can be recognized or recalled in particular circumstances.  However, at the task level we often see the terms recognition test and receptive test used interchangeably, as are recall tests and productive tests.  This has led to considerable confusion in deciding how and why we should label a specific task as a receptive or productive one, or to put it another way, how to find an appropriate receptive or productive vocabulary test for a given receptive and productive vocabulary research question.  There are no clear-cut answers or guidelines, only a long tradition of certain tasks being selected as either a measure of receptive or productive vocabulary. It is no surprise therefore to note that it is rare in vocabulary research for the researcher to explain why a particular technique was used for assessing receptive and productive vocabulary and how the task or technique can be said to be measuring receptive and productive vocabulary.

 

By contrast it is commonly accepted by researchers that the development of the knowledge of a word proceeds from the receptive to the productive and thus these two are on a continuum of development from the receptive to the productive  (Melka, 1997; Piggott, 1981; Faerch, Haastrup and Phillipson, 1984; Palmberg, 1987). Indeed, dissenting views are hard to come by. It is not quite clear why this notion is so pervasive among those theorizing about the nature of receptive and productive vocabulary. Quite probably it stems from the notion that a word must be received before it is produced.  This has lead Henriksen and Haastrup (1998) and Haastrup and Henriksen (1998), to claim that a series of tasks can be said to be on a knowledge continuum from the receptive to the productive. Furthermore, they suggest that tasks placed along this cline can assess a learner’s progress in the learning of a word as it moves from receptive to the productive. However this notion is not without difficulties. These tasks are shown in Figure 1.

 

Figure 1: An Operationalisation of a Knowledge Continuum and its relationship with kinds of tests.  From Henriksen and Haastrup  (1998:71).

 

 

The problems with the assessment of vocabulary knowledge and growth are legion  (see Anderson and Freebody ,1981; Meara and Buxton, 1987 and Wesche and Paribakht, 1996). Chief among these problems are the unreliability of the task to assess the construct being examined.  For example, a multiple-choice test or a selection test is purported to assess a subject’s ability to recognize an item from distractors and thus it is often seen as a measure of receptive or recognition ability. However, as I have argued elsewhere (Waring, 1999 62ff) successful completion of a multiple-choice item is dependent on a whole host of mental processes which involve both recall and recognition.

 

While we may be able to find suitable tests of immaculate sensitivity that can reliably form a hierarchy from the selection tests to cued recall tests along the Henriksen / Haastrup continuum this would do nothing to resolve the basic issue of whether these tasks are measuring receptive and productive vocabulary.  My concern is with why we would want to do so.  The fact that we may be able to find such a hierarchy does not make it a valid thing to do unless we understand that the tests relate directly to the underlying construct being tested. We may be able to show that 2 multiple choice tests of differing sensitivity score higher than 2 cued recall tests of differing sensitivities, but we have not shown that we are measuring receptive and productive vocabulary. Indeed Waring (1999) found systematic differences between the responses to a multiple-choice test and a cued recall test. We cannot assume that this is a systematic difference between receptive and productive vocabulary unless we conclusively demonstrate that these tests assess one and only one of these constructs. There are two points that arise from these findings. 

 

Firstly, the difference between receptive and productive test scores by assessing word knowledge along a continuum may either be inherent in the task or the knowledge underlying the performance on the task.  Although it would be possible to find a set of tasks that reliably discriminate between levels of task difficulty we will not necessarily know whether this relates to a similar scale of word knowledge. This implies that we will not find a continuum of knowledge from the receptive to the productive along which tasks may be placed.  This is because we cannot determine to what degree responses to a test are a result of the task demands or are a result of the underlying knowledge. There still may be a continuum from receptive to productive vocabulary but these tests are not suited to demonstrating this.

 

Secondly, we may conclude that these test are not really ‘receptive’ or ‘productive’ tests at all, but rather something more complex.  It has yet to be shown that demonstrating one’s ability to discriminate between distractors is a valid way of assessing recognition or even receptive vocabulary.  Similarly, it has yet to be shown that demonstrating one’s ability to do sentence completion tasks is a valid way to demonstrate recall or even productive vocabulary. Even if we could show this, there would be more requirements that would have to be shown to demonstrate that we can compare receptive and productive vocabulary using these tests.  Researchers would have to show that the two tests were wholly separate in the knowledge sources they assessed.  The receptive test would have to measure receptive knowledge and only receptive knowledge and the productive test would have to measure productive knowledge and only productive knowledge. We must show that no productive knowledge would be required in the productive test and no receptive knowledge was needed in the receptive test. If there was a cross-over in knowledge use, then the results will be less than clear. Similarly, if we see receptive tests as a test of one’s ability to recognize test items and productive tests as tests of one’s ability to recall, we also face difficulties. Such difficulties include showing that a multiple-choice test is and only is a test of recognition and that a cued recall test is and is only a test of recall.  The above is clearly a very difficult task and not one that might be resolved soon.

 

Confusion between the test format and the mental processes underlying the successful completion of a test item lie at the heart of the difficulties we have just found.  There will always be a tension between the construct, the theory and the test format and we must operationalize the construct in the best manner we can. But we must also be clear about some of the sources of confusion that surround these issues.  The terms recognition and recall  are both test formats (a recognition test or a recall test) and mental processes  (the process of recognition and the process of recall). The empirical evidence for calling a recognition test a test of recognition or a recall test a test of recall is thin on the ground, despite the tradition of doing so. It is not self-evident that these are in unity.

 

Typically when a researcher is measuring recognition or passive or even receptive vocabulary a ‘recognition test’ is usually the measuring instrument used. ‘Recognition tests’ are typically framed as those that require a subject to select or recognize  (the mental process) the target from a given number of responses such as in a multiple-choice test or a matching test. A ‘recall / active test’ is typically framed as one that required a subject to recall (the mental process) previously learned material.  Tests of active vocabulary are usually sentence or word completion, essay writing and even L1 to L2 translation. Immediately we can see a rather broad spectrum of test formats that cover these notions of recognition / passive tests and recall / active vocabulary tests and it is not clear that only recognition or recall is involved

 

A similar argument may be made for active and passive tests and active and passive vocabularies. As was shown above, active and passive or receptive and productive vocabularies are generally described as those for speaking and writing or listening and reading. Morgan and Oberdeck (1930) for example use this description in their measurement of active and passive vocabulary, but the tests are not of speaking or writing, nor of listening or reading. Their active and passive tests are L1 to L2 translation and multiple-choice recognition. It is difficult to see what these test formats have to do with the 4 vocabularies of speaking, writing, reading and listening. Melka reiterates this point in her survey of the literature by saying that “it is not obvious that any particular form of test is either specifically or adequately suited for testing either Reception or Production. (1997: 97).

 

Even with the vaguest and most liberal criteria of defining receptive vocabulary and productive vocabulary it is extremely difficult to disentangle what each test format is actually testing.  If two of these tests are used side by side with each assessing either so-called receptive and so-called productive vocabulary then the difficulties become even more noticeable. The complexities are particularly acute if one is attempting to measure a particular trait or knowledge source.  Thus clear and consistent definitions of the product and processes involved may elude us for some time.

 

In all tests there will never be a perfect match between the desired word knowledge and the ability of the test to assess this knowledge.  As we have seen there is rather a large assumption to be made that a multiple-choice test is assessing receptive or recognition vocabulary.  Similar difficulties were highlighted with other instruments.  A central concern in finding suitable tasks to assess receptive and productive vocabulary is finding tasks with the fewest number of assumptions as possible. 

 

Self-Report tasks and Knowledge Scales

All the tests that have been mentioned above are objective tests (tests for which there is a correct answer).  We have seen many of the difficulties involved in assessing receptive and productive vocabulary using these test formats.  There are however, other testing procedures that we can take that may prove to be suitable alternatives to the use of objective tests.

 

Any task we may use for researching receptive and productive vocabulary acquisition should be able to reflect on the developmental patterns of receptive and productive vocabulary.  Ideally this should be done in ways that require few assumptions that a particular aspect of word knowledge is being shown in the task.  This implies that the task construct should be as clearly defined as possible so that we can match a task method with the construct.  That is, the task should be able to do what it is setting out to do. 

 

It is likely that there is an underlying knowledge about words which is mediated by the mental processes of receiving and producing language. The receptive and productive vocabulary product  (the test score) is probably the result of the interaction between the mental processes of reception and production and the underlying word knowledge.  The degree of control one has over the interaction between the processes and the underlying word knowledge may also reflect the surface receptive and productive vocabulary product. In other words, it is not just what you know, but the ability to control what you know, and reflect on this knowledge that can be reflected in performance on a test. Thus, performance on a test is a function of one’s underlying knowledge and one’s ability to control what happens in the receptive and production mental processes. From a researcher’s point of view it would behove us to find a task that allows us to access word knowledge without having to mediate that through a complicated task procedure. This implies we should find a task with the smallest level of assumption that our task is actually accessing the word knowledge we want to assess. Therefore, tests such as multiple-choice tests or cued recall tests may not be helpful to us as they have a very high level of inference that they are directly accessing the word knowledge we want to assess.

 

The simplest way for a researcher who is interested in finding out if a learner knows a word, is to ask her about it.  The advantage of this is that it strips away a layer of assumption that the information desired is what is being given by the learner. For example, if a researcher wants to know if a subject can ‘understand’ a given word, the most obvious way would be to ask her rather than have her knowledge mediated by an additional layer of assumption that would be evident in an objective test.  If the subject’s knowledge is mediated by a multiple-choice test (or some other test) then there is a necessary assumption that needs to be made that says the multiple-choice test does indeed measure her understanding and not something else.  A self-report measure directly asks about the knowledge needed and thus there is no large assumption needed. These self-report type of tasks may prove to be useful in stripping away some of the multiple layers of inference between the test construct and the sought after knowledge. 

 

However, the use of a self-report measure is not without its own difficulties.  Chief among these difficulties are ensuring that the knowledge the subject provides us with is an accurate reflection of her competence. We must also ensure that we obtain this information in reliable ways. Other difficulties are concerned with whether subjects can reliably report what they know, or even whether they ‘know’ what they know. The use of self-report tasks within second language receptive and productive vocabulary acquisition is an unexplored area and it is one that is in need of exploration. But first we must find a way to do it.

 

In realization of many of the problems of traditional vocabulary assessment tools recently considerable progress has been made in the search for alternatives to these tests. Dominant among these new alternatives are the Vocabulary Knowledge Scales.  These scales are of varying types, all of which are self-report tasks. A self-report task is one which requires the learner to respond to a vocabulary test item by expressing that knowledge in her own words, or as a response to some pre-defined response boundaries. For example, a learner may be asked to say whether she knows a word within the boundaries of a YES/NO task (also known as a checklist task). Alternatively, she may respond by selecting one of a series of responses along a scale of word knowledge such as Zimmerman’s (1997) scale. The learner responds as below.

 

Figure 2: Zimmerman’s Knowledge Scale

a)    I don’t know the word

b)    I have seen the word before but I am not sure of the meaning

c)     I understand the word when I see it or hear it in a sentence, but I do not use it in my own speaking or writing

d)    I can use the word in a sentence

                          table  d                bride  c                 wealth a

Other examples of such scales can be found in Eichholz and Barbe, (1961); Heim and Watts, (1958, 1961), D’Anna and Zechmeister (1991) and Wesche and Paribakht (1993) and Paribakht and Wesche (1996).

 

The essential difference between the objective tests we have already examined and a self-report task is that the learner rather than the examiner decides at what level the word is known or not. The degree of knowledge that can be demonstrated in a multiple-choice test is decided by the test creator, whereas in the self-report task no such assumption is needed as there are no right or wrong answers. Essentially, self-report tasks are based on the notion that we are able to say how we feel, what our opinions are, how we would, or do react to certain stimuli. In other words these instruments ask about our metacognitive knowledge - the knowledge about our learning. 

 

There are several advantages of using self-report tasks like that of Zimmerman.  Firstly, as only a single judgment is required, the task can be performed quickly and easily.  This therefore means many words can be tested in a very short space of time. Subjects can rate 150 items in 10 minutes using such a task and this rate of response would have been impossible on multiple-choice tests. Secondly, as there is no ‘right answer’ to these tests one does not have to be concerned with the effect of distractors and contextual factors and so on which can greatly affect performance on the test.  Thirdly, with objective tests a researcher has to assume that the test construct is assessing a particular type of vocabulary.  However, in Zimmerman’s rating scale the vocabulary knowledge being assessed is stated explicitly.   In this task the learner is being directly asked ‘how well do you know this word?’  Therefore, there is a high level of assumption that the test construct is matched with the knowledge being provided by the learner.

 

A recent incarnation is that of Wesche and Paribakht (1993) and Paribakht and Wesche (1996).  This scale is different from the Zimmerman scale in that it requires verifiable evidence of knowledge held at Levels III, IV and V. Their particular aim is to have the VKS seen as a “practical instrument for use in studies of the initial recognition and use of new words” (1996: 29).  The basic idea of the scale is to measure progressive degrees of word knowledge. This is their scale.

 

Figure 3:  The Vocabulary Knowledge Scale from Wesche and Paribakht (1993)

 

I:     I don't remember having seen this word before

II:    I have seen this word before but I don't know what it means

III:   I have seen this word before and I think it means ________ (synonym or translation)

IV:   I know this word. It means __________ (synonym or translation)

V:    I can use this word in a sentence. e.g.: ___________________ (if you do this section, please also do section IV)

 

These self-report Vocabulary Knowledge Scale tasks are all attempting to assess aspects of receptive and productive vocabulary.  In their present forms however there are some difficulties. Firstly, there is an assumption that receptive vocabulary is lower on the scale than the productive.  This is reflected in the assignment of a linear ordinal scale to word knowledge levels. Although this may seem to be plausible or even logical, it is still a matter for theorists to show.  The tests are very heavily balanced in favour of the receptive ability and with only one item at the productive level there is insufficient evidence of the depth of knowledge of that ability.

 

Secondly, as Read (1998) and others have noted, Knowledge Scales suffer from description difficulties which make them internally inconsistent in several ways. There are a variety of keywords used as knowledge prompts such as know, have seen, means and can use  which can lead to confusion. A learner could know a word but have never seen it in writing, but know the pronunciation of it. This difficulty is particularly evident in D’Anna and Zechmeister’s scale.  There are several sub-scales of word knowledge and control which lead to an unnecessarily complex scale. This complexity makes it difficult to disentangle the aspects of word knowledge that are being assessed.

 

Thirdly, the assignment of ordinal numbers to the stages of the scale makes interpretation rather difficult. The major problems come from the use of mean scores derived from assigning ordinal numbers for the scores that are used in experiments based on a VKS measure.  For example, what would a score of 3.5 on a pre-test and 3.7 at a post-test mean? Can we say that a learner has ‘gained’ knowledge in the interim?  If the scores were 2.1 and then 4.8, this inference would be less obvious as it would be easier to show how much knowledge had been gained, but the problem would remain as to what it meant.  Does it mean for example, that a gain of 0.2 means the word is better known, or more often recognized, or better used? This kind of reporting hides valuable developmental patterns. Two learners both with a mean rating of 2.5 might have completely different profiles.  One learner’s average rating may be made up from 10 unknown words and 10 Level V words, whereas a second learners’ scores may be made up of equal numbers of Level II and III scores.

 

Fourthly, Wesche and Paribakht are attempting to verify knowledge along a scale but it is not altogether clear whether we could find a scale of task difficulty as this does not necessarily relate to receptive and productive vocabulary knowledge. As we have seen, the uses of such tasks to verify knowledge along these scales might complicate rather than clarify matters. It would seem from this that attempting to verify knowledge held by learners at various levels is an extremely complex business and one which involves many rather arbitrary assignments of tests to knowledge stages.

 

Lastly, there is a confusion within the Wesche and Paribakht scale regarding linearity. Their scale is both linear and non-linear at the same time. The linear nature of the Knowledge Scales is due to their scoring method.  The stages are arranged in an order and each is assigned a score according to how much it reflects the learner’s overall word knowledge.  However, the knowledge asked of the learner at each stage involves more than one aspect of word knowledge.  For example, Level III of Wesche and Paribakht’s scale asks for a response to I have seen this word before and I think it means ________. The two aspects of knowledge are have seen this word before and I think it means _____. One aspect of word knowledge is to do with recognizing that the learner has met the word before, and the other is that she can ascribe a meaning to the word. These two aspects of word knowledge are being assessed but they are represented by a single stage.  The other stages in the Wesche and Paribakht scale function similarly. In Wesche and Paribakht’s scale the knowledge required is multi-faceted and thus not linear, but the scoring is linear. This conflict within the Wesche and Paribakht scale means that their scoring method makes the non-linear nature of the knowledge into a linear scale, and thus misrepresents the nature of the knowledge provided by the learner.

 

There seems considerable tension in the conceptualisation of what the levels or stages mean in some of these Vocabulary Knowledge Scales.  These instruments are attempting to measure the various stages of acquisition.   However, their stages are essentially not in a linear scale because the data are nominal. Each stage of these Scales has multiple-knowledge sources which means they are not really stages but States of knowing a word.  If we conceive of the stages in vocabulary Knowledge Scales as States, then each State is functionally independent and does not fit on a linear scale.  This means that we should not score responses to such a task with a linear scoring scale. We may wish to label a State as a number but we cannot add up scores and divide them to make averages as Wesche and Paribakht did.  This view of word knowledge as States of ‘knowing’ is captured in Multi-State models of vocabulary testing to which I shall now turn.

 

Multi-State Models of Vocabulary Testing

Traditionally, the assessment of vocabulary knowledge has been done by taking a snap-shot of vocabulary ability at time 1 and comparing this with that at time 2.  This method assumes that vocabulary knowledge is stable at both data collection times. This method also is concerned with how many words are learned or retained rather than what has changed in the lexicon as a result of the treatment. Multi-State models do not assume that words are stable in the same sense as the more traditional methods of assessing vocabulary. These models are concerned with the levels of stability and instability in the lexicon and seek to follow how words are known on one day, but forgotten the next, but are used the day after (Meara and Rodriguez Sanchez, 1993).

 

Multi-State models investigate the ability of humans to discriminate between known and unknown information.  This skill is the basis for decisions regarding learning (e.g. Bianz, Versonder and Voss, 1978 cited in Zechmeister, Rusch and Markell, 1986), knowledge of an event (King, Zechmeister and Shaughnessy, 1980) and reliability of statements about recognition vocabulary (Hart, 1965, 1967).

 

It is widely accepted by language researchers and language learners that vocabulary learning in general is not an ‘all-or-nothing affair’ but is more likely a rather messy process.  It is also widely accepted that we hold varying degrees of word knowledge.  Full mastery of many words would be difficult if not impossible to attain for many L2 learners and indeed some native speakers. If we accept this, then we can identify certain basic States of ‘knowing’ of words. Multi-State models of vocabulary testing assume that the knowledge of any word can be said to be in a certain State of ‘knowing’ at any moment in time.  Examples of such States may be a ‘no knowledge’ State (State A), a ‘partial knowledge’ State (State B) and a ‘knowing’ State (State C).  A learner might respond to a test word or phrase by saying that she knows table well enough to assign it into State C, but has never seen perambulate and rates it A. Similarly, she knows something about explore and rates it as a partial knowledge State B item. As several States of knowing are often identified it might be more appropriate to call models of vocabulary testing based on this idea ‘Multi-State’ models.

 

A Multi-State model assumes that all States are interconnected with no necessary assumption that one State is higher or lower in knowledge than another along a continuum. An example of a 5 State Multi-State model is presented in Figure 4.  This shows the five possible States into which a learner may rate her knowledge of a single word. The labels for each State in Figure 4 are examples that have been taken from the Knowledge Scales above. The interconnectivity of all States allows a learner to rate her knowledge in a particular State at a given time. In this way we can obtain a profile of the knowledge held at a point in time for a set of words. 

 

Figure 4: An example metacognitive Multi-State model with 5 States.

 

In a Multi-State model, words are seen to be in a particular State at a given time (t1) such as in Figure 5.

 

Figure 5: Several words rated by State at time 1

There is no assumption that there are words in all States, but all words should be in a particular State. This way of looking at word knowledge also allows us to follow changes in knowledge over time. If we collect data on a whole set of words we will be able to see the developmental patterns in the set rather than at the item level.  The holistic view taken by Multi-State models is one that holds that the whole rather than the individual items will tell us about the patterns of development.  This development can be seen when a learner rates her knowledge of a set of words at several data times.  For example, a learner may rate her knowledge of a word at time 1 as State B and at time 2 (t2) she may rate it as State D. However, it is important to note that by choosing State D at t2 there is no assumption in this model that State C has been passed through on the way as was implied in the linear scoring method of Knowledge Scales.  Figure 6 shows how the words have changed by State at t2.  We see that the same words are there, but several more have been added. Also some words have moved between States.

 

Figure 6: Several words rated by State at time 2

 

Multi-State models of vocabulary testing are not primarily concerned with how lexicons grow. Growth can imply a single direction either towards a single point as in a scale of development, or along a continuum or cline.   Multi-State models of vocabulary testing are concerned with how lexicons change.  Change implies a multi-dimensional lexicon that can develop into a more complex, more highly intraconnected or interconnected unit.  This does not necessarily mean that the lexicon is getting ‘larger’. There are many ways that change can occur within a lexicon.  For example, it can mean a lexicon that attrites, or one that is undergoing some systematic change or one that is acting in a near random manner. This simple distinction between investigating vocabulary development as ‘growth’ and as ‘change’ implies a fundamentally different way of approaching the measurement of vocabulary development from Knowledge Scales and traditional tests. Multi-State models concern themselves with the developmental patterns at work underlying the changing nature of lexicons. ‘Growth’ then, from a Multi-State model view is the by-product of ‘Change’.

 

This model of vocabulary acquisition implies that we should be able to construct a task or series of tasks that will allow us to map the changes happening within the lexicon over time.  Such a task might thus be called a State Rating Task (SRT).  The essential properties of such a task would be ones that allow a subject to rate each word into a different State of knowing with no necessary assumption that one State is higher or lower than another. As ratings of words in a SRT involve knowledge about perceptions of word knowledge rather than verifiable evidence for word knowledge (say multiple-choice tests) it is essential that we understand the metacognitive basis of responses to tasks based on Multi-State models.

 

Palinscar and Brown (1989 cited in Schouten-van-Parreren, 1994) state that metacognitive knowledge is i) insight into the quality of one’s own knowledge, skills, strategies and attitudes, and ii) insight into the demands of the task.  This knowledge is the relatively stable information which humans have about their cognitive processes and those of others (Flavell and Wellman, 1977).  Learners are also able to refer to the ‘domain knowledge’ about the topic which they are learning. Domain knowledge includes conceptual and factual knowledge and the manner in which this knowledge is organized and communicated. Vocabulary knowledge and world/content knowledge are examples of domain knowledge (Grabe, 1991). Wenden (1998) states that learners generate their own hypotheses about factors which contribute to learning. Research into metamemory has shown that learners make some attempt to validate their own hypotheses and they link them together in a logical fashion.  This led Wenden to conclude that

 

“these terms point to the following characteristics of metacognitive knowledge and beliefs [they are] :

(1) a part of a learner’s store of acquired knowledge

(2) relatively stable and statable

(3) early developing

(4) a system of related ideas

(5) an abstract representation of a learner’s experience” (1998 p. 517)

 

This suggests that learners are able to report systematic, stable and principled insights into their own knowledge of words.  This should not be surprising because part of the formulation processes that goes into making a sentence involves, at least in part, a judgment whether a word is able to be used or not. If a learner is thinking of how to say a particular concept, she may come up with two or three suitable words or phrases to express it, but may decide not to use a particular form because she does not have the confidence to know how to fit it with other words.  However on the other hand, she may know it well enough to demonstrate that knowledge in another way, such as on an objective test or in written work where more time is available.  It is therefore not surprising that learners know whether they can use words or not and can demonstrate this in principled and systematic ways.

 

These kinds of tasks rest on the assumption that i) we can report what we know and ii) we report it reliably. Schouten-van-Parreren (1994, 1996) tested these notions with 75 Dutch learners learning French. The subjects were asked to rate their knowledge of a set of words into the 4 categories of have never seen the word before, have seen the word but forgotten its meaning, know the ‘approximate’ meaning and know the exact meaning.  Schouten-van-Parreren thus asked for ratings for knowledge of meaning. After the survey, an unexpected test of their receptive knowledge was given.  The correlation between the metacognitive task and the vocabulary test was a respectable 0.59, which is fairly high for 75 subjects. In a similar test with different groups ‘learning’ words the correlations were higher at 0.79 and 0.81 for two measures. Dolch (1932) also found that children know which words they don’t know. Barrow, Ishino and Nakanishi (1999) in a large-scale study of word knowledge also show that second language learners are capable of introspecting into their own knowledge.  Thus it seems that such tasks reliably tap word knowledge to at least a limited degree. 

 

A State Rating Task

 

Confusions surrounding definition of the terms receptive vocabulary, productive vocabulary, active and passive vocabulary and so on are at the root of problems facing their measurement. The terms Understanding and Use vocabulary are relatively free of the complexities brought to bear on them by association that afflict receptive and productive vocabulary and active and passive vocabulary.  Thus, henceforth I shall direct my attentions at Understanding and Use vocabulary within the context of a State Rating Task. Broadly speaking, Understanding vocabulary is that vocabulary which a learner understands when met in reading or listening and Use vocabulary is that which a learner can use in writing and speech.

 

Following the above there are several qualities a State Rating Task might have if it is to function as a valid task of the self-rating of Understanding and Use vocabulary knowledge.

 

A)       The design must be transparent to the subject using it. 

B)       There must be a way for the subjects to show the difference between ‘known’ and  ‘unknown’ words and be able to report this reliably.

C)       The rubric must be able to allow for multiple levels of knowledge of Understanding and  Use vocabulary.

D)      The rubric should be simple enough so that at first meeting most of the intent of the task is immediately understood.  It should take little time to learn.

E)       The rubric’s wording or visual structure should be such that it can be understood by second language learners who are going to be using it. .

F)       There must be some ways in which the ratings that are being assessed can be verified. 

 

In order that a SRT meet criterion A above, a group of second language learners were consulted about the construction of the SRT that could best be used by them. The subject group who helped in the assessment of the various rubrics consisted of a class of second language learners who met 5 days a week over a four month period. 46 separate SRTs and tasks from several major themes were presented to the subject group over this period for their assessment. The intention of this search was to try to meet the criteria laid out in the previous section.  Space does not allow for a detailed analysis of the strengths and weaknesses of all 46 SRTs tested and details of those and the final version that will be presented here can be found in Waring (1999, Ch. 8.)

 

The final version of the 46 SRTs uses the two knowledge sources of Understanding and Use vocabulary and is presented in Figure 7.

 

Figure 7: The Final Version of the SRT

I do not know this word

 

E

 

 

I think I understand this word

|

|

V

 

I understand

this word

|

|

V

 

 

I don’t know how to use this word

---------->

I think I understand this word but I don’t know how to use it

 

D

 

I understand this word

but I don’t know how to use it

 

C

 

 

 

I know how to use this word

--------->

I think I understand this word and I know how to use it

 

B

I understand this word and I know how to use it

 

A

 

 

 

table             A                      threaten     E

flower           D                      widow        B

branch          E                      storm                A

 

When doing this task the subject is presented with a list of words against which she has to assign a letter, A, B, C, D or E. All words must be rated and blanks are not acceptable. The subject has to make several lexical decisions in her assessment of her knowledge of a word.  The first decision is to assess whether a word is known or not. If the word is unknown, in the sense that the subject cannot provide a meaning or translation equivalent for the item, then the subject is required to write letter E next to the item. If a word is felt to be familiar but no meaning can be ascribed to it then it is to be rated E. Similarly, if the word reminds the subject of another word or one in her own language, but not the form presented it is also to be rated as unknown. If the subject rates the item as E then she has to move onto the next item, but if it is not rated E she has to make two further decisions relative to that test item which could be taken in either order. The subject has to decide at what level the word is understood when met in listening or reading by selecting the DB column if the item is not understood well or the CA column if the item is usually understood. She also has to decide how well she can use the item in speech or writing. This is done by selecting either the DC column if the item cannot be used, or the BA  column if it can be used well.  The final task was to find a single State rating (A, B, C or D) from the combination of the two previous decisions. Thus a subject who can understand a word well in listening or reading but cannot use the word would assign a C rating to that word. 

 

The State Rating Tasks presented in Figure 7 has no verification test as part of the test design.  The Wesche and Paribakht Knowledge Scale had specific tests selected to assess knowledge at Levels III, IV and V, however the many difficulties with verifying this knowledge suggest that a verification procedure would be unnecessary.

 

There are several advantages of using SRTs for assessing vocabulary knowledge. Several advantages have already been mentioned above. These include the ability of SRTs to look at changes over time using a single measure; the advantages of using non-linear data analysis, and very clear relationship between the knowledge sought and the knowledge provide.  There are other advantages that need to be mentioned. 

 

Firstly, Multi-State models would allow us to see vocabulary development at the outset from a primarily holistic viewpoint. This is because we are primarily focused on, and concerned with, the development of a set of words using a single task. If we were using more traditional multiple-choice tests, cued recall tests and so on to test vocabulary, then we would have to construct a test battery that would serve at least two functions. It must allow us synthesise the results into a whole, and also it must allow us to accurately and reliably tap relevant knowledge sources. Therein lies a tension. Under a Battery Test viewpoint we would be working with many different test formats, test sensitivities, and often different word sets. We would also need to concern ourselves with the possibility of between-measure learning. In the end, and in order to make sense of the data, we would have to reduce the results into a synthesis. This reduction of detail would inevitably mean that our data would have to be compromised when analysing these data and identifying patterns of development.  If we adopt a holistic perspective at the outset we would not need to i) identify the sub-components, ii) find an appropriately sensitive test of each component, and iii) synthesise the results into a whole.

 

Secondly, with SRTs we can try to find individual development patterns. When examining patterns of development we might find for example that different learners report varying developmental speeds and patterns between States. One person can move quickly from State A to State B while others may stay there for quite some time. Another learner may report wild fluctuations between datatimes, while yet others report slow steady progress. Thus, the more holistic nature of these models will allow us to gain insight into the way a learner is responding to the development of their lexicon or the way they respond to a particular treatment. We may find for example, that as a result of extensive reading there is a reported change in States for certain words (those found in the text as opposed to those not found in the text) from State 1 to State 2 that may not have been picked up by say, a multiple-choice test that is interested only in complete knowledge at one level.

 

Thirdly, SRTs will provide us with a degree of quality of the development within a lexicon, not only the product. Many traditional vocabulary size tests produce raw estimates of vocabulary size. These tests are often used at the beginning and end of a course to measure the ‘growth’ in vocabulary.  However, this looks only at the product and not at the quality of the developmental patterns involved. Let’s assume we have 2 learners both of whom start the course with the same vocabulary size. During the course both learners show an increase on a multiple-choice test of vocabulary size of 500 words. From this figure we can know very little about the quality of development over that period. If the learners had been tested with a SRT we may have seen a different picture of development. Learner A may have shown a significant shift from mainly State 0 (‘no knowledge’) ratings to State 2 (‘well known’) ratings while Learner B showed mainly a shift from State 1 to State 2. Both learners faired equally on the test, but Learner A had showed more development.  A State model of vocabulary development can trace this development within a single test.  More traditional vocabulary tests cannot do this without a battery of tests which would lead to synthesis problems.  Therefore Multi-State models can help to distinguish between learners who are reporting the same raw ‘growth’ as measured by say the Levels Test or the EFL Vocabulary Tests (Meara, 1992).

 

How then is development measured using the SRT?  Development can be measured by matrix analysis which allows us to compare data collected at different times, for example before and after a particular treatment. The essential difference between the SRT and the Knowledge Scales is that although the SRT may be designed to represent knowledge linearly, the data should be treated as nominal data. This means that although we can ask the learner about the degrees of how well a word is known, we cannot ascribe a score to this.  We cannot therefore add up the total ratings to a set of words and divide by the number of States to reach an average score.  We must analyse the data by comparing which words moved from State to State item by item for the two data times (t1 and t2).  For example, we may find the t1 and t2 ratings look like the data in Table 1.

 

                 Table 1 A hypothetical time 1 (t1) to time 2 (t2) distribution of test scores for a SRT with 165 test items.

 

 

 

 

t2 ratings

 

 

 

 

State E

State D

State C

State B

State A

 

State E

15

3

0

1

0

t1

State D

4

20

6

9

3

ratings

State C

1

3

16

5

1

 

State B

0

2

4

12

6

 

State A

0

1

3

15

35

 

In this table we can see that 15 items were rated as being in State E at t1 and stayed in the same State at t2. Three items that were in State E at t1 are now in State D at t2.  No items moved from State E to State C but one item moved to State B from State E, and no items jumped from State E to State A. And so on for the other categories.  In this way we can see the development between States over time. Depending on the treatment between the two datatimes we may see large or small changes.  Data presented in this way allows us to treat the data as nominal data which gives us the advantage of treating each State as independent.  It is worth noting in this table that typically State movements tend to occur to ‘nearby’ States (State E to D for example) rather than to ‘far’ States (State E to State A).

 

SRT Reliability

The State Rating Task (SRT) outlined here meets all of the criteria set out above. However, there are a few remaining unanswered questions.  The most basic of these questions concerns whether learners are able to use this instrument and whether the data they provide us with is meaningful. The second question concerns whether the SRTs mentioned above have at least face validity for language learners.  If learners feel they are not able to use these tasks then their use needs to be re-considered.  I shall first turn to the second question before presenting detailed analysis of the different forms of reliability of the SRT.

 

30 subjects were interviewed on their reactions and feelings about the SRT.  One aspect of this kind of test is the correct use and interpretation of the various States.  We have to be sure, especially if the SRT is to be used for between-group or between-subject assessments, that the SRT is understood by the subjects. This was assessed by asking the subjects to reconstruct the rubric without looking at it after they had used the SRT for the first time (a practise session).   A subject’s own interpretation of the SRT is likely to be shown by the way that the SRT is reconstructed. All 30 of the subjects were able to identify the 5 States and the major separation between State E and the group of other States.  All subjects were able to identify which letter (A, B, C, D or E) went in which part of the rubric.  All the subjects were able to identify the dual knowledge sources of Understanding and Use vocabulary.  All the subjects noted that each knowledge source had two levels.  All but one subject got these in the correct order in the rubric.  This subject had reversed her Understanding and Use vocabulary positions in the SRT (State B was confused with State C).  Very few of the subjects got the exact wording in each State, but the general flavour of the attempts of 27 of the subjects showed that they had reconstructed the SRT satisfactorily (including the subject who had reversed States B and C).  3 subjects could only do part of the reconstruction. The subjects were invited to write a few lines about their opinions of the SRT. 28 of the 30 subjects did so. In the main, comment said that the test was not difficult to do although some of the words were unknown.  The unfamiliarity of the test design did not seem to affect the subjects in their responses to it.

 

These data show that the explanations of the design and the detailed focus on the various aspects of the rubric had been satisfactorily understood and had been remembered.  This means that if the subjects are sufficiently ‘trained’ and reminded of the rubric design then the SRT can be learned and we may be assured that the subjects are rating to the same basic plan.  However, this does not necessarily imply that all subjects will rate the same word into the same State, nor does it mean that each subject interprets that State in the same way.

 

The reliability of the SRT was assessed in three ways.  The first way (Experiment 1) was to compare responses on the SRT with actual knowledge and the second was to ascertain the degree of concurrent validity of the SRT with several standard language proficiency tests (Experiment 2).  Experiment 3 investigated whether the SRT ratings are consistently rather than erratically supplied. 

 

Experiment 1

29 English as a second language learners volunteered to be part of this experiment. Three tests were used in Experiment 1. The first was a 125 item SRT which was followed by a Translation test and a Sentence definition test.  29 subjects were given both the SRT and the translation test. The Sentence test gives the subject another way to demonstrate their understanding of the word.   The Translation Test required the subjects to translate the test items used in the SRT into their own language.  Explicit instructions were given for them not to guess and to include two or more equivalents if need be with the ‘closest’ translation written first. 95% of the responses appeared against single test items.  As the subjects were from several language backgrounds native speaker markers for the tests had to be found. Each test was marked by one native speaker and checked by a second where possible.  The marker was required to give a full mark to each equivalent or a half mark if the subject’s intended meaning was considered close semantically. Only 83 (2%) of the 3625 possible translations were assessed in this way.

 

The sentence test required the subjects to write a sentence showing their knowledge of the test item. Only 13 subjects completed the sentence test and the SRT. The instructions explicitly asked the subject to use the word in such a way that would demonstrate their understanding of the word. Subjects were told that they would be marked on their ability to demonstrate their understanding of the word, not necessarily on whether the word could be used.  Full points were awarded for a sentence that demonstrated their understanding even if grammatical errors were present in the sentence as this was not a test of grammatical accuracy, but one of showing that the meaning was known. Inflectional errors were ignored. Derivational errors such as helpful instead of helping were given half marks. Half points were awarded in some cases because the SRT is has partial knowledge as a component  within it (the I think I understand this word column).  The SRT was presented first. Before the test was presented two practice tests were given to sensitise the subjects to the test format. At this time the test was explained in detail and several examples of how to respond were given.  The Translation test and the Sentence test were given after the SRT.

 

The data were analysed  by selecting three groupings of State ratings for comparison with the other tests. This was done because we could not identify a particular State that would correspond to the test against which it would be correlated.  The first set of States was the known / unknown dichotomy which involved the sum of the 4 ‘knowing’ States A, B C and D.  The second was the set of three States that had at least one element in their rating of know how to use or understand this word  that is, the sum of States A, B and C. This group can be compared with the State D ratings which contain no ‘higher’ level of knowledge as it contains only think I understand and don’t know how to use and thus are at a ‘lower’ level of knowledge.  The third was the State A only ratings.  The correlations with State A only ratings should be lower than the grouped correlations with the other tests because there is no partial knowledge component whereas most of the other tests have a partial knowledge component in their marking.

 

Correlation coefficients (Table 2) were calculated for the SRT and the other tests by counting the number of test items rated in sets of States and correlating them against the number of correct responses in the other tests. The correlations between the Translation test and the Sentence test with the combined total for States A, B and C were .87 and .93.  The correlations between the Translation test and the Sentence test with the combined total for States A, B, C and D were .87 and .97. The correlations were lower for State A only ratings at .61 and .87 as one would expect.

 

Table 2: The correlations between the Translation and Sentence tests with three sets of SRT ratings.

 

SRT Combined totals for States A, B, C and D

SRT

Combined totals for States

A, B and C

SRT

total for State A only

Translation Test (n = 25)

 

 

0.87

p < .001

0.87

p < .001

0.61

p < .001

Sentence Test (n = 11)

 

 

0.93

p < .001

 

0.92

p < .001

 

0.87

p < .001

 

 

The item level data show two kinds of responses. The first are the words that are said to be known that were known (ABC or D and a correct response on the test) and the words said not be known and were not known (rated E and not correct on the test).  The second consists of words said to be known but were not known (rated ABC or D, but not correct on the test) and the words said not to be known that were in fact known (rated E, but a correct answer was given). 86% of the items were reported accurately and 14% were not on the translation test and 84% of the items were reported accurately on the Sentence test. A discussion of the results will be left until later.

 

Experiment 2

Four tests were used in this experiment. Two of the tests were the receptive and productive versions of the Levels Tests  (Nation, 1990; and Laufer and Nation, 1995). These tests were used as reliability measures for two reasons.  Firstly, these widely used test items have been shown to have high levels of scalability (Read, 1988). Secondly, the tests range in difficulty from common words to rare words which will assess the full vocabulary ability of the subjects.  The receptive test is a matching multiple-choice test of 90 items.  The productive Levels Test is a cued recall test where the subject has to complete the word in a sentence such as in He’s not married, he’s a bac______.  The third test was the Nelson Quickcheck Test (Fowler and Coe, 1987), which is a widely used general language multiple-choice proficiency test used for placement and diagnostic testing of second language learners. Its focus is mainly on grammatical competence. There are 4 tests of increasing difficulty. There are 4 versions of the tests each with 4 levels of increasing difficulty with 25 questions each.  The scalability of the test was high (over 80% on all criteria). Of the 28 subjects who took the Nelson Quickcheck tests and the SRT, all completed the tests.  The fourth test was the same SRT that was used in Experiment 1.

 

The results for the three tests with the three sets of SRT ratings are shown in Table 3. The same sets of States that were used in Experiment 1 were also used here.

 

Table 3: The correlations of the SRT with 3 other tests. Number of cases in parenthesis.

 

SRT Combined totals for States A, B, C and D

SRT

Combined totals for States

A, B and C

SRT

total for State A only

Vocabulary Levels Test Receptive  (n = 24)

 

0.84

p < .001

 

0.89

p < .001

 

0.61

p = .002

 

Vocabulary Levels Test Productive Marking  (n = 20)

 

0.62

p = .004

 

0.68

p < .001

 

0.49

p = .029

 

Nelson Quickcheck Test (n = 24)

 

 

0.76

p < .001

 

0.75

p < .001

 

0.68

p < .001

 

 

All tests were significant. The reliability coefficients between the Combined State Ratings (States A, B C and D or A, B and C) and the Translation and Sentence Tests of between 0.87 and 0.93 are also shown.  The Vocabulary Levels Test - Receptive was also 0.84 and 0.89. The Vocabulary Levels Test (Productive) whether marked strictly or leniently was a respectable 0.61 to 0.68 as was the Nelson Test at 0.75 and 0.76.  The reliability coefficients are lower for the State A ratings as shown in Table 4 but all are still significant.

 

Table 4: Correlation of the tests used for assessing the reliability of the SRT.

 

receptive Vocabulary Levels Test

 

productive Vocabulary Levels Test

 

Translation Test

Sentence Test

productive Vocabulary Levels Test

.83

p = .001

(18)

---

 

 

Translation Test

 

 

.89

p = .001

(23)

.69

p = .001

(19)

---

 

Sentence Test

 

 

.86

p = .001

(11)

.68

p = .044

(9)

.94

p = .001

(11)

---

Nelson Quickcheck Test

 

.67

p = .001

(24)

.41

p = .087*

(18)

.69

p = .001

(23)

.76

p = .007

(11)

*   = Not significant

 

Table 4 presents the inter-test correlations for all concurrent validity tests in Experiments 1 and 2. The 5 tests of reliability were correlated against each other to ensure their reliability as valid measures against which the SRT might be assessed.  Not all the subjects took all the tests.  The number of subjects used in the correlation is given in parenthesis. The correlations for all tests except the Nelson Quickcheck test with the Productive Levels Test were significant and showed high levels of inter-test correlation.  The Sentence and Translation correlation was 0.94. A discussion of this will be left until later.

 

Experiment 3

The intention of this experiment was to ascertain the degree of test / re-test reliability of the SRT ratings. This was done to ensure that subjects would consistently rate their knowledge in the same State from time 1 to time 2 for the same set of words tested three days apart (no change in knowledge should be evident).  15 subjects were given the same 148 item test two times and asked to rate their knowledge using the SRT. Different words were used from Experiments 1 and 2 to ascertain whether the very high reliability figures for SRT rating was not a function of that word set. Three of the 1924 items from the 15 subjects were not rated.  Table 5 is read in the following way. At t1 an average of 74.4 of the 148 items were rated as being in State A and at t2 were also rated that way. 2.1 of the items in State A at time 1 moved to State B ratings at time 2, and so on.  If we group the percentage of items rated either in the same State or ‘near States’ (those next to the diagonal which reflect relatively small changes) we find that 81.1% of the items were rated in this way.

 

                     Table 5: The mean responses by State groupings for the 14 subjects.

 

 

 

 

Time 2

 

 

 

 

 

A

B

C

D

E

Total

 

A

74.4

2.1

1.7

1.4

1.9

81.5

Time 1

B

6.2

3.3

0.7

1.1

0.7

12.0

 

C

2.6

0.6

1.1

0.9

0.3

5.4

 

D

1.4

1.6

0.5

4.6

3.8

11.9

 

E

3.4

1.6

0.5

3.1

28.5

37.1

 

Total

88.1

9.1

4.5

11.0

35.1

147.8

 

Table 6 shows the movement in and out of grouped States. Here we need to look at the percentage of words rated as known (operationalized as non-State E rated items) that stayed within that group at the two datatimes, and the percentage of items rated as not known that remained not known.

 

Table 6: The average percentage of responses for the 14 subjects on the two SRTs

 

 

 

Time 2

 

 

 

State A

States BC or D

State E

 

State A

50.4

3.5

1.3

Time 1

States BC or D

6.9

9.6

3.2

 

State E

2.3

3.5

19.3

 

Table 6 shows that consistent A, B, C or D ratings make up 70.4% of all ratings (Sum of the 4 top left cells) and consistent State E ratings (bottom left) make up 19.3% of the remaining. This makes a total of 89.7% of the items that were consistently rated within their State group, leaving 10.3% that moved between groups (known to not known, or not known to known). This will be discussed further below.

 

Discussion of the three experiments

The overall reliability results are extremely encouraging. Of particular interest are the very high correlations between the SRT and the verification of ratings with the Translation and Sentence tests at 0.87 and 0.93 in Experiment 1.  This shows that when the subjects rate a word as known the word is in fact known, and when it is rated as not known, then it is indeed not known. The coefficients of reliability with general proficiency test in Experiment 2 again show remarkably high correlations which shows that subjects who rate their knowledge of a set of words highly also have high levels of language ability. Similarly, those who rate their knowledge as low also do not perform well on these tests.  This seems to indicate that the SRT is able to be used with subjects of varying proficiency, or subjects who rate their knowledge differently.  The correlations for the State A only scores in Experiments 1 and 2 with the other tests were lower than the grouped ratings which confirms the view that partial knowledge plays a role in the choice of how one correlates these tests.

 

The results of these correlations show that the tests inter-correlate very well as well as correlating well with the SRT. This indicates that these tests were able to show high levels of reliability and thus their comparison against the SRT was vindicated. The only test which does not correlate well is the Nelson Quickcheck test with the Vocabulary Levels Test (productive) which was not significant. This is not surprising as this test is basically a grammar test not a vocabulary one like the others. All other tests correlated to a minimum of 0.67 and others as high as 0.94.  When this result is combined with the extremely high SRT correlations as well, it shows that either of these tests might function equally well as a measure for the verification of knowledge rated with SRTs.  The results from the test / retest results in Experiment 3 show that subjects report their knowledge consistently. 90% of words rated as known were verified by the corresponding same rating on the second test.

 

Most of the responses to the SRT are systematic and principled, but we do see individual variation. The individual data shows why the data at BA (B at t1 and A at t2), and CA and ED are higher than the constant State rating  (AA, BB etc.). There are considerable jumps by individuals to these States (13 to A from B by S19, 27 from A to B by S22, 23 from C to A by S4 and 13 from D to E by S20).  The individual data show that a jump in the number of 11 E to D ratings for S8 and 8 E to D ratings for S17 accounts for most of the flux (the 1.3% reported in Table 7) in ratings for the group.  What seems to be happening is that there are individual response patterns to the State ratings. For example S22, the second highest State A rater of all the subjects, rated 91 items in State A at t1, but 118 of them at t2. This difference was almost entirely accounted for by a BA change. Similarly, S4 reported that 23 items changed from C to A, but 17 items moved from other States to State C, and only 58.8% of his ratings were within the same State, by far the lowest for all subjects. However, he still maintained a 95.3% known / not known consistency in ratings. His variability was within the group of ‘known’ States. These tendencies are unusual in the overall pattern of responses in the group as a whole.  Most subjects report no large jumps at all.

 

Conclusion

The aim of this paper has been to find a plausible alternative to the use of standard objective vocabulary tests for the assessment of receptive and productive vocabulary and to assess the tool’s reliability. The three experiments have shown that the SRT can gather reliable and principled self-reports of word knowledge, which may provide us with some tantalising glimpses into the developmental patterns underlying Understanding and Use vocabulary.

 

The strengths of the SRT are numerous. It is easy to use, it allows subjects to rate items quickly, it allows subjects to demonstrate their knowledge reliably and it provides information about the interaction of Understanding and Use vocabulary.   None of these features are revolutionary, but taken together they give us a tool which allows us to ask deeper questions about vocabulary acquisition that we have been able to ask to date. However, the SRT is still in need of refinement.  My hope is that other researchers will take up the SRT to investigate Understanding and Use vocabulary, and particularly to investigate the longitudinal development of Understanding and Use vocabulary (Singleton, 1999).

 

References