InstructorsStudentsReviewersAuthorsBooksellers Contact Us
  DisciplineHome
 TextbookHome
 ResourceHome
 StudentTextbookSite
Textbook Site for:
Assessment In Special and Inclusive Education, 9/e
John Salvia, The Pennsylvania State University
James E. Ysseldyke, University of Minnesota
How to Review a Test

How to Review a Test

We have often been asked how we go about analyzing and reviewing tests—both for this book and in general. So we have decided to include a how-to section. Before starting an analysis of a test, you must first assemble the materials. We find that it is best to order a specimen kit and any supplementary manuals available. Be prepared to experience difficulty obtaining material from some test publishers. When you request a specimen kit and supplementary materials, you will occasionally receive all materials. More often, when you review specimen sets, you’ll learn that additional materials must be ordered separately. Sometimes, it takes a very long time to figure out just what is published where. It may take up to six months to acquire all the material on a test. Sometimes, you just never obtain materials. Patience and perseverance are almost always required.

When materials arrive, prepare yourself properly to begin your review. The right setting is very important. A well-ventilated, well-lit room (preferably a bit on the chilly side) and a hard, straight-backed chair are essential.

Next, and more important, adopt a show-me attitude. Do not expect test authors to admit in the manuals that the test was poorly normed because there was no money to pay testers or that the test has inadequate reliability because they didn’t develop enough test items. Test authors put the best possible face on their tests, as might be expected. You simply cannot accept the claims made by test authors and their colleagues who write the technical manuals. If you accepted them at their word, they would only have to say that they had a "good, reliable, valid, and well-normed test." Test authors must demonstrate that their tests are reliable, valid, and well normed.

After assembling the relevant materials and finding a suitable place in which to ask, "Where’s the proof?" we usually follow these procedures. First, we skim through the material to get a general idea of what the test is intended to do and what is included in each document that accompanies it. We generally keep notes on several separate sheets of paper—one for each topic that we consider: background and purposes, behavior sampled, scores, norms, reliability, and validity.

Then we reread the manuals. You might expect that test authors would organize test manuals neatly so that you could turn to the table of contents, find the section on, for example, reliability, and turn to the pages indicated. Sometimes, yes—but more often, no. If a manual does not have section headings or chapters, we just begin reading and making notes under our headings. If a manual is divided into sections, we start by reading about the behaviors sampled by the test. (It doesn’t matter too much where you start, except that validity is best left until last.) Test manuals frequently contain a useful description of the behaviors sampled, but more often they merely name the domains sampled. For example, the authors of a test may say that it assesses reading, but that does not tell you whether it assesses reading recognition, reading comprehension, or oral reading. Look at the test directions (especially directions on how to score student responses) and the protocol (the answer form). These materials generally give you a pretty good idea of what behaviors are actually measured. Then, try to describe the behaviors in straightforward terms—avoid psychological and educational jargon.

Next, we look at the section on norms. When evaluating a test’s norms, first note the ages (or grades, in the case of achievement tests) of the students on whom the test was normed. Then look for statements describing the students. Also, anticipate quantification of the norm groups. For example, you should anticipate that the test author will tell you how many boys and how many girls and how many persons from various ethnic or racial groups were tested at each age or grade. You should also look for geographic information. For example, what percentage of the sample lived in big cities or in the Northwest? Finally, expect socioeconomic information about the students: parents’ occupations, parents’ educational attainment, or income of the household. Look for an explicit comparison of the characteristics of the norm group with the national population, as described in the most recent census. (Sometimes, test authors include all the data and all the comparisons in neat tabular form.) You may find substantial discrepancies between the norm sample and the population. Generally, we look for correspondence between the sample and the population within about 5 percent. Thus, if 31 percent of the sample lived in the Southwest and only 26 percent of the U.S. population live in the Southwest, we would not be overly concerned about the discrepancy. We realize that this is an arbitrary margin of error. If you prefer a different one, that’s fine.

Information on scores is apt to be located in many places: in the section on scoring the test, in the description of the norms, in a separate section on scores, in the section dealing with the interpretation of scores, or in the norm tables themselves. Generally, the best place to find information on the types of scores available is in the norm tables. These tables allow the conversion of raw scores to derived scores and subtest scores to total scores. The next best place to look is in the section on scoring the test. There you find the directions for crediting responses and combining raw scores into derived scores. In the norms section, you may find phrases such as "percentile norms" or "grade-equivalent norms," sure tip-offs that percentiles and grade equivalents will be available. In the sections on interpretation, you may find information on the proper interpretation of derived scores. For example, many test authors will tell you the mean and standard deviation of standard scores and how they are to be interpreted. However, it pays to double-check against the norm tables themselves because test authors occasionally err in their descriptions.

Finding reliability data may be more difficult. If there is a section on reliability, the task is fairly simple. You want to see whether there is evidence of each appropriate type of reliability. Demand numbers—do not settle for statements about the test’s reliability. The authors should show statistical proof of reliability. Read the tables. You can anticipate finding estimates of generalization across items (split-half, KR-20, coefficient alpha, alternate-form, and so on) and across time (test–retest reliability). If the scoring is difficult, you should also find a section on interscorer agreement. (Data on the extent to which you can generalize across scores can often be found in the section on scoring.) You should find reliability estimates for each subtest at each grade or age. In addition, tables giving the standard error of measurement for each subtest at each grade or age are occasionally provided.

The next step is very important: You must determine what scores are to be interpreted, because those are the ones you must judge for adequacy. In the sections dealing with score interpretation, you will often find the scores that the test authors think are most important. Many tests have subtests that are combined into a total score. Sometimes, the subtest scores are stressed over the total score (for example, in the Illinois Test of Psycholinguistic Abilities), whereas in other tests, the total score or part scores are stressed more than subtest scores (for example, in the Wechsler Intelligence Scale for Children–Revised. The scores that are identified as important and that are to be interpreted must meet the minimum desirable standards of reliability. Consequently, different tests are held to different standards. For example, the subtests on the ITPA must meet a higher standard of reliability than the subtests on the WISC-R because we are urged by the authors to interpret the ITPA subtests but not the WISC-R subtests.

If there is no section on reliability, check the table of contents to see whether there are tables for standard errors of measurement or reliability coefficients. You can usually find all the information that you need in the tables without reading the test. If there are no tables and no section on reliability, there may be no data on reliability in the manuals; this happens frequently. However, reliability information may be hidden in the section on validity, in the section on scores, or in the section on interpretation. Keep a lookout for it as you skim and read.

The evaluation of a test’s validity is the most difficult aspect of reviewing a test. If the norms and the reliability are inadequate, there will be severe problems with validity. Even if they are adequate, the authors must still prove that the test is valid for each recommended use. This means that you must learn how the authors recommended using the test. Do not expect to find this information in a section labeled "validity." More often, you will find such statements in the beginning of the test manuals or in the promotional materials.

You will always find a statement to the effect that the test measures some domain. How do the test authors prove this? Data on content validity is often included in a section called "the development of the test" or "selection of items." In these sections, the authors explain how they chose the items in the test. For other tests, the information will be buried elsewhere in the manual. For still others, there will be no mention of how items were chosen—no proof of content validity.

Depending on the particular type of test, you may also find information on concurrent, predictive, and construct validity. Again, you must remember that the purpose of presenting these data is to demonstrate that the test measures the domain its authors claim it measures. The data should logically bear on the issue of validity.

Beyond claims that the test assesses a particular domain, you may find assertions that the test can be used in particular ways. This is especially true of tests of achievement, which authors often assert can be used in program planning. When we see such assertions, we expect to find a large number of test items appropriate for each grade. We look at the test items and at the norm tables to get an idea of the difference in the number of test items at each grade. All you have to do is find the raw score at the fiftieth percentile at two adjacent grades. For example, suppose that 17 points correct was the fiftieth percentile at the second grade and 21 points correct was the fiftieth percentile at the third grade. Then, only 4 raw-score points would separate second- and third-grade work. This is probably too few items on which to base an educational plan, although the test may well discriminate among test takers.

Sometimes, we are told that scores can be used in particular ways. Such assertions are often found in the interpretation sections of the manuals. For example, you may find information on critical levels of performance; the authors may tell you that scores below a particular value are indicative of potential problems or that students earning such scores require special instructional interventions. Check out each assertion for the use of the test, and look for proof.

Finally, we tend to be suspicious of strange formulations of reliability, validity, or scores. You should be, too. Remember, the test author should provide all the necessary data in clear and usable form. If it isn’t there, it isn’t your fault, and you should use the test cautiously—or not at all.


BORDER=0
Site Map | Partners | Press Releases | Company Home | Contact Us
Copyright Houghton Mifflin Company. All Rights Reserved.
Terms and Conditions of Use, Privacy Statement, and Trademark Information
BORDER="0"