Review of Diack, Hunter. 1975. Test your own Wordpower.Paladin.

Test your own Wordpower. This sort of book reminds me somewhatof the DeBono IQ tests I used to take to amuse myself in the 1980's. I vividly recall taking the IQ tests and saw my score graduallyincrease as I went through the book. To my delight at the time,by the end of the book I ended up a genius. So I tried the Mensabooks and was somewhat deflated to find that I was no longer soclever. Curiosity and the challenge were the things that attractedme to books of this type. If I had seen Diack's book at thattime no doubt I would have bought it and even tried the tests. Looking at this book now, I am convinced that I would have approachedthese tests with some awe. There are lots of words in this bookI don't know, and have never seen and will never see again. Someof them don't look like English words even now.

Test your own Wordpower. It still an alluring title even now,but for very different reasons. Years of looking into vocabularytests and the knowledge that my IQ can apparently increase withtest taking ability has taught me to be somewhat sceptical aboutvocabulary tests - especially those based on quasi-science. Ifa book of this type is to be of any practical use to anyone thenit must have a certain face validity - it must seem totest that which it is purported to test. It also must have adegree of content validity - the test must actually betesting what it says it does and be based on sound principles.

The book is effectively divided into two parts. One part containsthe instructions for the Literacy Levels Test (for adult nativespeakers of English) and then presents 50 tests to use for estimatingone's vocabulary. The second is a long discussion of the natureof vocabulary and concerns the counting of words; the uses andabuses and the joys of dictionaries; the reasons one might wantto extend one's vocabulary; vocabulary and intelligence; generaland technical vocabulary and so on. This section also exploresthe vocabulary profile of several British newspapers by comparingtheir vocabulary to that in the Literacy Levels Test.

The test design
The tests are a simple recognition test of words loosely presentedin order from the most common to the least common based on levelsof development (more on this later). The subject has to look atthe list of words in sequence order and decide if she can provide'at least one acceptable meaning' for each word. If so, thenthe subject marks the item as correct and carries on through until10 items have been found that are not known. At the endof the test the subject must look back at the words and checkthat the markings are correct because the last 5 items checkedas correct must be defined. This checking procedure is meantto ensure that the subject does in fact know the marked items. After definitions (or sketches, synonyms etc.) have been giventhe person then must check these last 5 items in a dictionaryto ensure that at least one meaning for the word was in fact known. This would then represent the persons score. If some of thoselast 5 words are wrong then it is suggested that one should simplytake those items off the test score. In order to calculate one'sscore, the person then multiplies the test score (or the averageof several tests) by 600 to reach a number that reflects how manywords are known.

Goulden, Nation and Read (1990) used this technique as thebasis for their tests (as did Hosking, 1987 and English, 1985),but changed the testing procedure slightly. Their more satisfactorymethod was to require subjects to go back through the tests fromthe end checking in the dictionary all the items in reverse orderuntil 5 correct items in a series have been found.

Test construction
Several comments need to be made about the construction, markingand validity of the test. Central to the notion of marking thewords one knows, is the assumption that the words are presentedin rank order. This is because one must check in the dictionarythe last 5 words that were checked. The test items are culledfrom theConcise Oxford, Everyman's English Dictionary, Chambers'Twentieth Century, Webster's Third, Roget's Thesaurus amongothers. Throughout this book few of the scientific methods usedin compiling the tests are referred to. There is mention of somedictionary sampling procedures but is little on which words wereselected and why. We are given insights into the sources of someof the testing procedures and the compilation of word lists byreferences to the work of Thorndike and Lorge (1944) and Seashoreand Eckerson (1941) and the comments on it by Lorge and Chall(1963). Although this work was not cited directly, it is unmistakable.

The notion of level
Diack notes (p14) in a 'flash of the obvious' that "1)everyone acquires his vocabulary in a particular order and oneword after another" and "2) though the order is differentfrom one person to another, there is a considerable amount ofoverlap, e.g. everyone brought up in English-speaking countrieslearns the word breakfast before he learns the word seriation:". This notion of the serial nature of word learning is used asthe basis of the 6 levels. Each level is said to measure 'levelsof development' (p15) from children to highly knowledgeable adults,not frequency of occurrence in English. Diack states that frequencyof occurrence in text is not the same as knowledge.

Because vocabulary is serial in nature, according to the author(see below), this means the test is attempting to test levelsof 6000 words with each level corresponding to a level of developmentfrom children's vocabularies to adult vocabularies, to superioradult vocabularies. The words are set out in the test into 6levels each reflecting a band of 6000 words and thus the testis said to measure up to 36,000 words. Diack suggests that mostadults over 20 will score between 30 and 40 and that number willreflect one's general vocabulary. A score below 18,000 wouldbe that of school children and some adults who have not had thebenefit of much higher education. A score of 18,000 to 24,000would be the average for those who are well-educated and have'lively minds'. Those above 30,000 would be rare beasts indeed.

Unfortunately, it is not clear how the items in each of the6 bands for the test were identified. The closest we get for thefirst level is the notion (uncited) that there are 12,000 conceptsin English which involve the 'lowest levels of abstraction' andit is these that the children may know as they are 'incapableof taking part in this kind of thinking we call abstract'. I wouldhave preferred the list to be prepared based on actual knowledgeof these words. Beyond the earliest levels we have little ideahow words were allocated to the Levels. In fact it seems no listswere made and intuition was the main strategy in allocating wordsto levels. This is hinted at on page 31 on where in the analysisof the British newspapers 100 items from each paper were culledwhich in '[the author's] judgement came highest in the LiteracyLevels Test'.

The notion that we learn words in a serial manner, 'one wordafter another', strikes me as being rather simplistic. Thereare at least two sides to the serial nature of vocabulary growth. On the one hand there are words which are learned beforeothers (breakfast before seriation were the examplespresented) and we tend to learn prototypes before non-prototypes.This is no doubt a result of exposure, need and development asDiack suggests. This is at the word by word level. The secondpart of the serial equation has to do with the incremental learningof individual words which unfortunately is an issue not addressedhere. If we learn words 'one after another' it negates the wholeidea of partial knowledge as it assumes we must finish the learningof one word before starting to work with the next.

The notion of partial knowledge says that we can work withlearning several (hundreds or even thousands) of words at onetime, all with varying degrees of knowledge. Some words are well-knownwhilst others are only a flicker in the lexicon. Anyone whotakes these tests will readily see that full control is not alwaysavailable. If we only have no control or full control, then tip-ofthe tongue phenomenon would not occur as we either know it orwe don't. Several times I found myself trying to find a meaningfor a form I was familiar with. Even when I did find a meaningit was not always complete. For example in test 14, pasticheis a word I knew something about at one time, and I know it hassomething to do with music or drama, but as I am not a musicianI have not learned this word. I marked it as unknown. I lookedit up in the dictionary and found I was right that it is a wordused in music (a dramatic, literary, or musical piece openly imitatingthe previous works of other artists, often with satirical intent).How should I have scored this partial knowledge? Unfortunatelythere are no guidelines.

Reliability of these tests
As intuition was used in compiling these tests we should checkthe reliability of these intuitions. We should find at leastthree things. Firstly, that no matter which test one takes, oneshould get approximately the same score. This would show thatthe distribution of items by level between tests was in fact fairlyreliable. Secondly, as it said that we learn words in order,we should find that test takers should in general progress througheach test getting the same items correct as other test takersin order. The difference between test takers should be evidentin the level at which they not to know words. Thirdly, it issuggested that the more widely-read someone is, the better theywill do on the test. This implies there is a relationship betweenthe range of texts in which the word appears and how well reada person is. That is, if a given word appears in only a few texts,then a well-read reader who will have covered a wide range oftexts will have met it and may have learned it, whereas less-wellreaders would not. Therefore an index by which this test maybe assessed is the relationship between the item's position onthe test and the range of texts in which it appears. We shouldfind that items early in the list will appear in a wide rangeof texts and those later in the list in very few texts. I setout to test these three notions.

I took 10 of these tests. I scored 35, 38, 43, 47, 37, 35,39, 40, 43 and 39. The average score was 39.6 (s.d. 3.81 andrange 35 to 47). Which shows me to have an average vocabularyof about 23760 words (plus or minus 3.81 words X 600) so my vocabularyis between 21,474 and 26,046 words. This means I am anywherebetween fairly average for a university graduate to 'among themost widely read in the country'. I would not rate myself inthe latter category. This to me is rather a wide range of assessmentand if this test had been used as part of the assessment for ajob and I was being rated against my peers, success would havedepended on which test I had taken. These tests are to be praisedin a sense because the scores were not much wider.

To test the second notion I asked 3 people to take the same3 tests and measured which words they all marked as correct. The subjects were almost unanimous up to about word 30 on all3 tests but thereafter results varied tremendously. Again thistest is to be praised for finding a cut off point at which thesenative speakers all attained.

The third notion was tested using the 850,000 British NationalCorpus token wordlist to check for the range of texts in whicheach item appeared. This word list shows the actual frequencyof occurrence of items in the 100 million word corpus and thenumber of texts (of 4214 texts) in which the word appeared froma range of texts, many general and many specialised. Followingthe logic presented here, the words early on the test should appearin more texts than words later on the test. Three tests were randomlyselected for analysis which were tests number 14, 39 and 22. Each of the words in the 3 tests was ranked by occurrence in thetest and by occurrence in the range of texts that make up the100 million word corpus. We know a rank order has been assignedto the assessment as the subject is required to stop once 10 unknownitems have been met. A Spearman rank correlation test was performedon the 3 tests. The results show that the correlations betweenactual rank and rank determined by the number of the texts inwhich a given word appears in the BNC was r=0.83 for test 22,r=0.82 for test 38 and r=0.77 for test 14 (all significant top< .001). The inter-test correlations average 0.75. Theseare remarkable figures for a test made from intuition. However,these tests would need to be much nearer to 0.95 or above to haveany kudos in the world of academic research. Nevertheless theresults are impressive.

However, in reality the test is not really a 60 item test atall. It is preseumed that most people can get at least all oflevel 2 and most of level 3 correct and almost none of levels5 and 6. This leaves only the 20 words at levels 3 and 4 to distinguishbetween the vocabulary of most adults which makes it in effecta 20 item test. As a single item represents 600 words for eachof the items at these 2 levels, the words at these levels shouldbe extra carefully chosen. It is in essence this most criticalpart of the test that determines the final vocabulary score ofa subject. A second set of correlations were performed on onlythese words at the levels 3 and 4 to see if they could reliablydistinguish between people's vocabularies. The items in the testand the frequency of occurrence in the number of texts were re-rankedfor the 20 items at levels 3 and 4. The correlations show nosignificant differences between the three tests and all correlationsare minus. For test 22 the correlation is -0.34 p <.141, test38 is -0.28 p<.235 and test 14 is -0.28 p < .239. Eventhe correlations between the three tests only averaged0.20 and all are not significant. The results are less impressivenow. The critical section of the Literacy Levels Test does notreliably distinguish between those who have read many texts andthose who have not according to the BNC word list.

The Literacy Levels Test is good at getting ballpark figures,but not so good at getting an accurate picture. We have seenfrom studies of the subjective assessment of difficulty and frequency of words (Ringeling, 1977; Arnaud, 1989) that subjects are veryreliable in their assessment of words which are high frequencyfrom those which are low frequency, but less so on the words inthe middle. The Literacy Levels Test to its credit has beenable to find words that distinguish between these two extremesbut not been able to find a reliable middle ground.

The social importance of such test
As so much credence is put, and has historically been put,on vocabulary size as a measure of intelligence (now they aretests of 'reasoning' or 'literacy') such a test must responsiblymeasure what it sets out to measure. As the test is in the publicdomain anyone could have easily bought it at the local bookstoreand it could be used by anyone who cares to use it. The usesof the test could range from being a rather naive instrument morefor amusement that anything else as I suggest is the intent here,to a test that can be used with malicious intent. Vocabularysize tests when foisted onto people, are often not welcome andcan be considered very threatening. They leave one open to thepotential for ridicule, embarrassment and can damage one's self-esteem. In extreme cases such a test can be seen to 'prove' that onedoes not need to go to university to be intelligent (or the opposite)if a well-read non-graduate gets a higher score than a graduate. This kind of potential for abuse and intellectual comparisonis unavoidable. Indeed it is hinted that these tests have beenused in this way by giving several anecdotes showing how well-readpeople using these tests who did not go to university can scorehigher than graduates. It is pointed out that graduates who scoredlower than non-graduates usually do not report their results soopenly. The potential for abuse will always remain with vocabularysize tests, and that can never be stopped. What we linguists cando is to ensure that any tests put in the public domain are atleast reliable and if abuse were to occur, then at least the resultswould not be in dispute. As shown above, with these tests itis probable that one's vocabulary size varies from test to testand but it is highly unlikely that anyone will do more than 10of the tests at most, and many people only 2 or 3. Once one hasa rough score there would be no need to go on. What then is thepurpose of having 50 tests rather than say, 5?

Hard work unrecognized
A level of validity could have been shown for the LiteracyLevels Test if there had been a presentation of a few anecdotesof when these levels seemed to work and were able to distinguishbetween various individuals ranging from the less well-read tothe more well-read. Unfortunately, the Literacy Levels Test isleft for us to use with little regard for the consequences. Thisis a shame because the book demonstrates quite clearly that alot of thought has gone into the preparation of these tests. If it had been possible to take the tests a step further and therehad been some test piloting, maybe finding funding and some goodmarketing, the Literacy Levels Test could still be in generalcirculation both as a test for the lay person and for researcherslooking for a simple measure of concurrent validity. I checkedall the major bibliographies for references to this work up tothe present day (several million references) in the academic literatureand found none. Not one. The book seems to have been forgotten,if it ever was noticed by the academic community at all. Thisis a crying shame. There is no review of it even in the encyclopaedicEducational Measurements bibliographies (Buros, 1937-1994).Several references were made however that referred to work donewith his Standard Reading Tests that he had compiled with J. C.Daniels (1970) some 5 years prior to the publication of this book.

The definition of a word
One disappointing, though understandable aspect of the bookis that the sticky business of defining a word is not addressed. The only insights we get are the choice of test items. We seeonly uninflected base verbs, no plurals, and usually the leastderived form of words with its own sense (for example justifyis given not justification , justifiable or evenjust).

General or technical vocabulary
There is a hint of inconsistency in deciding what the testsare actually testing. On page 14 it is sais that each test isa '60 word sample from the total vocabulary of English'. However,by page 27 this become "the tests in this book measure generalvocabulary - that is to say, the number of words you finally arriveat will not give you credit for knowing highly technical termsor even dialect words" (italics his). We would thereforeassume that the words should generally be readily available tothe public and not restricted to obscure and very restricted domains.When we look at some of the words on these tests we see a wholerange of words many of which look technical or specialised tome (however one defines technical vocabulary). Examples from asingle randomly selected test (31) seek to illustrate this. Ithas the following words, meninx (a membrane, especiallyone of the three membranes enclosing the brain and spinal cordin vertebrates);trapezium (a quadrilateral having no parallelsides); reticulum (the second compartment of the stomachof ruminant mammals); haustellum (a portion of the proboscisthat is adapted as a sucking organ in many insects) and noctule (a large, reddish-brown insectivorous bat of the genus Nyctalus).It is clear to me at least that these words are not general buthighly specialised.

This makes me wonder if the top end of the tests are actuallya bit of overkill. On page 14 it is said that no one knows allthe words in a modest 40,000 word dictionary, but many (10-15%) of the test items words do not even appear in the BNC 850,000token word list. I cannot imagine many children under 16 wouldtake this test (if they did, floor effects would be evident) andthus these first two levels serve no real assessment purpose,as do the last two. Would it not be simpler and more reliableto have a 60 word test for only levels 3 and 4, or even 2, 3,4 and 5, as these are the ones distinguishing between adult vocabularies?

Range
In the introduction Thorndike's list was mentioned to be ofsome "slight use". It is unfortunate that in the compliationof these tests that little if any reference was made to the statisticalnature of the range of texts words appear in. The AmericanWord frequency List for example (Carroll et al, 1971) containsrange data next to each word. As many of the items on these testsare specialised it would have helped considerably if a word listthat showed the range of texts in which these words appear hadbeen consulted. It is my contention that a statistically derivedrange figure for each word would have been a more reliable guideto level than intuition.

Polysemy
Another problem with the test is that occasionally polysemesare evident. This is a problem because we cannot be sure thatthe test taker is aware of the sense which the test is attemptingto test. And as various senses do not all have the same frequency,nor appear in an identical range of texts, we can get over orunder-estimates. In one test (38) selected at random we findbacking, rook, harrow (possibly a town),shrink (a psychologist?), and several other wordswhich may have different senses such as potential, (mydictionary lists 5 senses). This does not add to the reliabilityof the test.

A strategy component
One notion left unmentioned in the introduction is the roleof strategy in taking such a test. As the test was intended forthe general lay population, we can assume that the vast majorityof people are not test taking experts aware of test taking strategiesand so on. It is granted the test was intended as a within-subjectmeasure of vocabulary size, as mentioned above people do comparescores. If two vastly different subjects with the same vocabularysize take the same test the results may be different. One personmay be a risk taker and one whose acceptability of the correctnessof a definition is 'near enough is good enough'. The other ismore conservative and accepts a strict definition of what it meansto 'be able to give a definition of a word'. The vocabulary knowledgeis the same but the scores are different. One way round this(albeit more threatening) is to take the test with another personand decide together if the definition is acceptable.

Summary
The Literacy Levels Tests are not that bad after all. Byno means are they wonderful, but they do stand up to a limiteddegree of scientific investigation. This is not to say that thetests should be used indiscriminately, but they can be used withcare so long as there is no serious intent behind them. Oneof the most remarkable and uplifting of things from this bookis that an amateur can make something as complex as an adult vocabularysize test and do it so well - and without recourse to electronictexts, word lists and corpora. The story of this book is unfulfilledpromise. The Literacy Levels Tests are not bad, indeed they arequite good, but if only those extra few steps had been taken theycould have been very good and we'd all be using them today.

References
Arnaud, P. 1989. Estimations subjectives des frequencesdes mots. Cahiers de Lexicologie. 54 (1): 69-81.
Buros, Oscar K. (Ed.) 1937-1994. The Mental MeasurementsYearbook. New Brunswick: Rutgers University Press.
Carroll, J. B., Davies, P. and Richman, B. 1971. The AmericanHeritage Word Frequency Book. Houghton Mifflin, Boston AmericanHeritage, New York.
Diack, H. and J.C. Daniels. 1970. Standard Reading Tests.
English, Fiona. 1985. Measuring Vocabulary in non-native speakersof English. MA Thesis, Birkbeck College, London.
Goulden, R., I. S. P. Nation, and J. Read. 1990. How largecan a receptive vocabulary be?. Applied Linguistics. 11,4, 341-363.
Hosking, Patricia. 1987. Estimating vocabulary size in non-nativespeakers of English. MA Thesis, Birkbeck College, London
Lorge, I. and Chall, J. 1963. Estimating the size of vocabulariesof children and adults: an analysis of methodological issues. Journal of Experimental Education. 32, 2, 147-157.
Ringeling, T. 1984. Subjective Estimations as a Useful Alternativeto Word Frequency Counts. Interlanguage Studies Bulletin -Utrecht. 8(1) 59-69.
Thorndike, E. L. and Lorge, I. 1944. The Teacher's Word Bookof 30, 000 Words. Teachers College, Columbia University.