Psicothema was founded in Asturias (northern Spain) in 1989, and is published jointly by the Psychology Faculty of the University of Oviedo and the Psychological Association of the Principality of Asturias (Colegio Oficial de Psicólogos del Principado de Asturias).

We currently publish four issues per year, which accounts for some 100 articles annually. We admit work from both the basic and applied research fields, and from all areas of Psychology, all manuscripts being anonymously reviewed prior to publication.

- Director: Laura E. Gómez Sánchez
- Frequency:

February | May | August | November - ISSN: 0214-9915
- Digital Edition:: 1886-144X

**Address:**Ildelfonso Sánchez del Río, 4, 1º B

33001 Oviedo (Spain)**Phone:**985 285 778**Fax:**985 281 374**Email:**psicothema@cop.es

Psicothema, 1994. Vol. Vol. 6 (nº 3). 535-556

Ronald K. Hambleton

University of Massachusetts at Amherst

Measurement theory and practice have changed considerably in
the last 25 years. For many assessment specialists to-day, item response theory
(IRT) has replaced classical measurement theory as a framework for test development,
scale construction, score reporting, and test evaluation. The most popular of
the item response models for multiple-choice tests are the one-parameter (i.e.,
the Rasch) and three-parameter models. Some researchers have been quite adamant
about using only the one-parameter model and have been rather critical of applications
of multi-parameter models such as the three-parameter model. In this paper,
nine arguments are offered for continuing research and applying multi-parameter
IRT models. Also, the position is taken that both single *and* multi-parameter
IRT models (and many others) have potentially important roles to play in the
advancement of measurement practice and judgments about which models to use
in particular situations should depend on model fit to the test data.

Measurement theory and practice have changed considerably since
the seminal publications of Lord (1952, 1953a, 1953b). Birnbaum (1957, 1958a,
1958b), Lord and Novick (1968). Rasch (1960), and Fischer (1974). Since the
publication of Lord and Novick's *Statistical Theories of Mental Test Scores*
in 1968, large numbers of item response models have been proposed, estimation
and goodness-of-fit procedures developed, and small- and large-scale applications,
almost too numerous and varied to count, have followed. A quick check of the
*Journal of Educational Measurement and Applied Psychological Measurement
*since 1977, located over 200 papers! And these papers represent only a fraction
of the publications that are available to interested readers. For useful summaries
of current as well as new item response theory (IRT) models, readers are referred
to Hambleton, Swaminathan, and Rogers (1991), Lord (1980). Thissen and Steinberg
(1986), McDonald (1982, 1989), Mellenbergh (1994), Masters (1982), Masters and
Wright (1984), Goldstein and Wood (1989), and van der Linden and Hambleton (in
press). Unidimensional and multi-dimensional models to handle dichotomous as
well as polytomous data with various properties (e.g., nominal, ordinal, interval)
from the cognitive, affective, and psychomotor domains, can now be found in
the measurement literature (see van der Linden & Hambleton, in press).

Some of the largest and most influential assessment instruments
in the country are influenced in some way or other by item response models.
Arguably the most important data to address the third goal of President Bush's
Education 2000 is provided by the *National Assessment of Educational Progress,*
an assessment system for grades 4, 8, and 12 and based upon the these-parameter
logistic model. Two of the major standardized achievement tests, *Comprehensive
Tests of Basic Skills*, and the *Metropolitan Achievement Tests, *are
developed and scaled with the three-parameter and one-parameter logistic models
(or Rasch model, as it is often called), respectively. Major national selection
tests such as the *Scholastic Aptitude Test, Graduate Management Admissions
Test, Law School Admission Test,* and the *Graduate Record Exam* utilize
the three-parameter model in item selection, scoring, equating, detecting differentially
functioning test items, and other ways. The *Armed Services Vocational Aptitude
Battery* (ASVAB) was calibrated with the three-parameter logistic model.
And, numerous other achievement, aptitude, and personality tests could be added
to this list. Test applications based upon IRT principles and applications impact
on millions of students in the U.S. and in other countries each year (Hambleton,
1989).

Working within the ítem response theory field is a group of researchers led by Professor Benjamin Wright and several prominent former students from the University of Chicago, who reject many of the current IRT models and their applications. Their rejection appears to be based on theoretical as well as empirical grounds (e.g., Wright, 1968, 1977, 1984; Wright & Stone, 1979). Their special interest is in the Rasch model (and extensions) and its applications to test data (Rasch, 1960). Many European scholars, in addition to Wright and his co-workers, too (e.g., Fischer, 1974; Gustaffson, 1980a, 1980b), have been responsible for important Rasch model developments.

Whereas many IRT researchers have taken the position that the
selection of psychometric models should be based in the final analysis on their
psychological meaningfulness, as well as goodness of fit evidence and utility,
and have adopted multi-parameter IRT models in their work, Professor Wright
and his co-workers argue for «fundamental measurement.» His models preclude
item discrimination and pseudo-guessing parameters, for example, because these
models violate the assumptions of simple ordering of persons and/or items required
of conjoint measurement models and hence such models do *not* lead to «fundamental
measurement.» On the other hand, multi-parameter models have found wide use
in many testing applications. What seems clear is that researchers fall into
two categories: those who are hasically model builders and are willing to use
many IRT models in their psychometric work when they fit, and those who believe
in fundamental measurement and restrict their attention, therefore, to Rasch
models and extensions which do meet the strict criteria or assumptions of conjoint
measurement.

In the remainder of this paper, nine arguments will be offered to support the strong current interest in and use of IRT models that include more than a single model parameter to account for ítem difficulty:

1. Nearly a century of testing experience.

2. Reasonableness of ftting IRT models to data.

3. Central importance of the property of «parameter invariance.»

4. Usefulness of item discrimination as an IRT model parameter.

5. Usefulness of the «pseudo-guessing» parameter as an IRT model parameter.

6. Reasonableness of IRT multi-parameter software packages.

7. Successes in applications of multiparameter IRT models.

8. Shortcomings in the arguments to support the Rasch model.

9. Promising future of multi-parameter IRT models.

The remainder of this paper will be organized around the nine arguments. Concluding remarks will summarize the arguments and point to future IRT directions and development.

1. *Nearly a Century of Experience*

Item response theory, as clearly outlined in Lord and Novick
(1968) and Lord (1980), was developed over at least a 50-year period and evolved
from classical measurement theory. Basic concerns for item characteristic curves
and invariant item and ability parameters can be traced to work by Tucker (1946)
and Gulliksen (1950). What troubled Tucker, Gulliksen, Lawley, Lord, and other
psychometricians about their models and methods in the 1940s was that they produced
item and ability parameters which were sample dependent. For example, Figure
1 highlights the problem (see Lord, 1953b). The same group of examinees would
have relatively *low* true scores on a difficult test and relatively *high*
true scores on an easy test measuring the same ability. However, examinees approach
both tests with the same ability. Lord felt it would be useful to find psychometric
models which could incorporate the characteristics of the test items into the
ability parameter estimation process so that the abilities and a single ability
distribution could be estimated. Popular item statistics such as item difficulty
(e.g., p-value) and item discriminating power (e.g., the point-biserial correlation),
too, were sample dependent, which limited their usefulness in test design. In
fact, interest in the biserial correlation as a measure of item discrimination
increased in the 1950s because of evidence that it tended to be more sample
independent than other item discrimination indices.

The goal of these early psychometricians was to obtain *invariant*
parameters: descriptors of items which would not depend upon the particular
examinee sample, and descriptors of examinees (i.e., ability scores) which would
be independent of the sample of items from the larger domain of items measuring
the ability of interest. Lord's contributions in 1952 and 1953 (1952, 1953a,
1953b) were especially important. In these papers, he developed the normal ogive
model, offered approaches for model parameter estimation and model fit, and
formally connected IRT models to well-known classical measurement models. McDonald
(1989, p. 209) went so far as to say «...just about anything done in the field
of binary item response theory can be thought of as a footnote to the seminal
research of Frederic M. Lord» (p. 209). This statement captures the sentiments
of many psychometricians.

Lord offered a two-parameter normalogive unidimensional model
in 1952: Unidimensional because this was then, and remains today, a reasonable
assumption for many sets of test data, normal-ogive ICCs (rather than logistic)
because normal ogives were popular (at the time) «S»-shaped curves bounded by
0 and 1 and had been used successfully by other psychometricians (Tucker, 1946),
*and* two item parameters because of the long tradition of utilizing two
item statistics, difficulty and discrimination, in test development and test
analysis work. On this last point, both item statistics had simple and important
relationships to test score characteristics. Thus, the use of a second item
parameter was «not an obsession for complexity...» as one critic reported, but
rather a response to well-established and validated psychometric procedures.
An item discrimination parameter was placed in the model by Lord because of
empirical evidence of its importance and utility. To assume all discriminating
powers equal, when it was well known then, as it is today, that items typically
show variability in their discriminating power, would have been unlikely.

Later, Birnbaum (1957, 1958a, 1958b) substituted the logistic
function for the normal-ogive, and added an additional parameter to the model
to account for non-zero performance of low-ability examinees. In 1960, Georg
Rasch published his own version of a one-parameter IRT model for use with achievement
tests, which, in fact, in a different (but equivalent) form, had been known
by Lord in 1952. But Lord rejected the model because he felt it was unsuitable
for use with multiple-choice items. Rasch, who was a statistician and not a
psychometrician himself or he would have known of the earlier work by Lord and
others which appeared in *Psychometrika* and *Educational and Psychological
Measurement*, desired separability of item and person parameters, and he
achieved it with his one-parameter model. In contrast, many psychometricians
of the day valued (at least) two item parameters in their psychometric models,
and were prepared to give up sufficient statistics (which were available with
the Rasch model) to improve the fit of their models. Also, as Lord (1980) noted
later, sufficient statistics are *not* guaranteed by the Rasch model. They
are only present when this model *fits* the data.

Were it not for the lack of computer power in the 1950s, applications
of item response models would have proceeded more quickly. Conditions were more
favorable in the late 1960s and serious research began in many places, Publication
of Lord and Novick's (1968)* Statistical Theories of Mental Test Scores*
was influential, and Wright (1968) was influential, too. His publications (Wright,
1968: Wright & Panchapakesan, 1969), his level of energy, and his stimulating
Rasch model training programs at the annual meetings of AERA attracted many
researchers to the IRT area, though not always to his philosophical viewpoint.

Rasch model advocates have adopted «specific objectivity» and «sufficient statistics» as fundamental and essential in their work. To many psychometricians, their position represents a narrow basis on which to build a measurement model. It must be kept in mind that most (perhaps all) psychometricians would be willing to use the Rasch model when it can be demonstrated that it fits their data. In this way, they regard the Rasch model as one of many logistic models that lead to invariant model parameters. When it fits, the Rasch model can be used. In contrast, Rasch model advocates adhere to one model (or, more correctly, to one family of models) and they will redefine the ability measured by the test (explicitly, and, more often, implicitly, to the detriment of construct validity of the ability of interest) by deleting misfitting items.

2.* Models or Data: Which Should Come First?*

Few psychometricians would dispute the point that models are valuable in advancing psychometric work such as constructing tests and scaling and equating scores. And, many psychometricians would be quick to argue that meaningful model building follows from careful data analysis. Lord, Bock, Samejima, etc., are model builders. Lord, for example, delayed his IRT research program for 15 years (from 1952 to 1967), until he was able to derive parameter estimates for a model (the three-parameter model) that he felt he needed to fit multiple-choice test data. Justification for his position comes from several empirical studies (perhaps the best evidence is reported in Lord, 1970),

Scientists develop models to explain or fit their data, not usually the reverse. Otherwise, we may still be thinking that the sun travels around the earth and the earth is the center of the universe. Scientists in the social sciences build their models and theories from carefully analyzing their data, Awkward data may force changes in a model; but these data are not discarded for a more appropriate set.

But, as John Tukey was quoted (by Howard Wainer) as stating, «All models are wrong, but they may be useful!» Consider this quote:

That the model is not true is certainly correct, no models are - not even the Newtonian laws. When you construct a model, you leave out all the details which you, with the knowledge at your disposal, Consider inessential...

Models should not be true, but it is important that they are applicable,and whether they are applicable for any given purpose must of course be investigated.. . . we may tentatively accept the model described, investigate how far our data agree with it, and perhaps find discrepancies which may lead us to certain revisions of the model.

This is a quote with which that many psychometricians could
agree. It describes how we go about our psychometric work. Perhaps this quote
came from Fred Lord or Darrell Bock, two of the leading modelbuilders in psychometric
methods. Actually, the quote came from Georg Rasch, in his book,* Probabilistic
Models for Some Intelligence and Attainment Tests*. Georg Rasch was obviously
not in support of models being more important than the data. Those who take
this position advocating the Rasch model seem to be in opposition to Rasch's
advice about conducting psychometric research.

Does a model first, data second approach have any merit? «Specific objectivity» is a useful feature of a measuring instrument. «Sufficient statistics» are valuable, too. A psychometric model with these properties could be designed and was, in the form of the Rasch model. Is it of any valué? If the model fits data, it could be. Few would dispute this position. As Rasch himself notes, empirical evidence should be compiled to check the fit of the model, and «perhaps find discrepancies which may lead us to certain revisions of the model» (Rasch, 1960). Of course, one alternative is to discard items and/or persons which are not consistent with the model. To many, this position is not defensible. Curriculum specialists would find even less use for measurement specialists if we asked them to narrow their test content, or delete some of their most discriminating items, so as to capitalize on the features of a simple psychometric model.

A simple example of the point that sufficient statistics are
not essential in testing is easy to find. The mean of a set of test scores is
a sufficient statistic which is used to estimate the population mean in a normal
distribution. However, the sample mean is *not* a sufficient statistic
if the distribution of scores is non-normal. In fact, considerable misinformation
may be conveyed about the distribution of scores were it to be used. One popular
response of statisticians or data analysts is to report the median, which is
*not* a sufficient statistic, but which conveys more valuable data about
the test score distribution than the mean with non-normal distributions. At
the very least, statistics which are not sufficient can and do often provide
valuable information.

Whether sufficiency is essential or not, the key point with respect to ability estimation is that sufficient statistics are only available to the extent that the Rasch model fits the data. When the basic model assumptions are violated, test score is not a sufficient statistic for the estimation of ability. Lord (1980) and others have made these points often. It is not a matter of model robustness either. Either sufficient statistics are present or they are not.

Sufficient statistics are valued by statisticians but they are not the only basis on which to produce a test theory. Lord's (1980) view was that the concept of test information should be central. He felt maximizing the information provided by a test was a desirable goal, and scoring weights for items should be chosen to maximize test information. Readers are referred to Lord (1980) to see the derivation of optimal scoring weights for the one, two, and three-parameter logistic models. The loss of information in ability estimation with the one-parameter model can be substantial when items vare substantially in their discrimination power, and performance of low-ability examinees is not zero (see, for example, Lord, 1980).

3.* Parameter Invariance is the Cornerstone of IRT*

The property of invariance of ability and item parameters is
the *cornerstone* of IRT. It is the major distinction between IRT and classical
test theory (see, for example, Hambleton & Jones, 1993). Figures 1 and 2
highlight ability parameter invariance (over tests of differing difficulty)
and item parameter invariance (over two examinee samples of differing ability).
The property implies that the parameters that characterize an item do not depend
on the ability distribution of the examinees and the parameter that characterizes
an examinee does not depend on the set of test items.

The invariance property is a characteristic of *all* item
response models. Of course, the property is only present when the IRT model
fits the test data, and when model parameters are estimated properly. As Lord
(1980) reminds us, invariance is a property of the model parameters *not*
the estimates. Thus, care must be taken in practice to estimate model parameters
well. For example, a homogeneous low-performing group would be a poor choice
of examinee sample to calibrate item statistics on a set of relatively difficult
items.

Some will argue that item parameters cannot be invariant if
the choice of examinee sample needs to be considered. But this argument is incorrect.
An example may be helpful. In the linear regression model, the regression line
for predicting a variable Y from a variable X is obtained as the line joining
the means of the Y variable for each value of the X variable. When the regression
model holds, the same regression line will be obtained within any restricted
range of the X variable, that is, in any subpopulation on X, meaning that the
slope and intercept of the line will be the same in any subpopulation on X.
A derived index such as the correlation coefficient, which is not a parameter
that characterizes the regression line, is *not* invariant across subpopulations.
The difference between the slope parameter and the correlation coefficient is
that the slope parameter does not depend on the characteristics of the subpopulation,
such as its variability, whereas the correlation coefficient does. Note, however,
that the proper *estimation* of the regression line does require a heterogeneous
sample. A homogeneous sample of examinees will provide unstable estimates of
the model parameters. The same concepts also apply in item response models,
which can be regarded as nonlinear regression models.

lt is important to determine whether invariance holds, since every application of item response theory capitalizes on this property. Although invariance is clearly an all-or-none property in the population and can never be observed in the strict sense, we can assess the «degree» to which it holds when samples of test data are used. For example, if two samples of different ability are drawn from the population and item parameters are estimated in each sample, the congruence between the two sets of estimates of each item parameter can be taken as an indication of the degree to which invariance holds. The degree of congruence can be assessed by examining the correlation between the two sets of estimates of each item parameter or by studying the corresponding scatterplot. Figure 3 shows a plot of the difficulty values for 75 items based on two samples from a population of examinees. Suppose that the samples differed with respect to ability. Since the difficulty estimates based on the two samples lie on a straight line, with some scatter, it can be concluded that the invariance property of item parameters holds. Similar checks can be carried out on other item parameters in the model. Some degree of scatter can be expected because of the use of samples: a large amount of scatter would indicate a lack of invariance that might be caused either by model-data misfit or poor item parameter estimation (which, unfortunately, are confounded).

The assessment of invariance described above is clearly subjective but is used because no objective criteria are currently available. Such investigations of the degree to which invariance holds are, as seen above, investigations of the fit of the model to the data, since invariance and model-data fit are equivalent concepts (see, for example, Hambleton, Swaminathan, & Rogers, 1991).

A similar example can be given to highlight ability invariance
(see Hambleton, Swaminathan, & Rogers, 1991). Invariance of item and ability
parameters is a feature of *all* item response models, one-, two-, and
three-parameter logistic models and others, and can be obtained in the model
parameter *estimates* when the model fits the data, and appropriate data
collection designs are used. The less complex IRT models typically require simpler
designs. The parallel to fitting linear and non-linear regression models is
obvious.

4. *Item Discrimination*

One of the strongest criticisms of the oneparameter logistic model is that it does not account for variability in item discriminating power. As Traub (1983) so cogently states,

... these [Rasch] assumptions fly in the face of common sense and a wealth of empirical evidence accumulated over the last 80 years...The fact that otherwise acceptable achievement items differ in the degree to which they correlate with the underlying trait has been observed so very often that we should expect this kind of variation for any set of achievement items we choose to study. (p. 64).

Table 1 contains the means, standard deviations, and low and
high values of the item biserial correlations on the nine subtests in the *Armed
Services Vocational Aptitude Battery* (ASVAB). This is an excellent example
of data for which an item discrimination parameter is needed to fit the test
data.

Lord (1952) certainly was not prepared to proceed with his
IRT research program until he could satisfactorily estimate *both* item
difficulty and discrimination (see Bejar, 1983, pp. 12-13). Many other researchers,
too, doubt the wisdom of discarding the item discrimination parameter. Hundreds
of applications of multi-parameter IRT models attest to this point.

Two common arguments for using an item discrimination parameter
are (1) you don't want to eliminate some of your best items, and (2) a concern
that the ability of interest may be changed if non-fitting items are removed
from the test or item pool. Figures 4 and 5 highlight the first concern using
over 250 1977 NAEP Mathematics Assessment items (Hambleton & Rogers, 1990).
Items with low and high biserial correlations are *not* fit well by a one-parameter
model (see Figure 4). The fits are considerably better with the two-parameter
model and the curvilinear pattern vanishes (see Figure 5). Users of the one-parameter
model would be forced to eliminate some of the best items in the test or simply
proceed with all of the items and a model that doesn't fit their data. But,
in this latter case, the presence of the invariance property would depend upon
model robustness. Neither option seems satisfactory. The choice of the better
fitting two-parameter model is the decision many researchers would make.

Table 2 highlights the second concern. The example is from
a one- and three-parameter model analysis of the 75-item *Maryland Functional
Reading Test.* Item fit statistics (above and below 1.0) are reported for
each of five content areas measured in the test. Items measuring «Main Idea»
are fitted less well than items measuring the other four content areas. Deleting
items in the «main idea» content area would definitely distort the ability measured
by the full test.

In a recent paper, Masters (1988) alerted the measurement field
to the fact that *not* all items with high discriminating power are necessarily
useful. This is an interesting point, and it is possible that some items may
be showing high discrimination for inappropriate reasons. Proper empirical and
judgmental item bias investigations can help reduce the problem of misinterpreting
high item discriminating power. In fact, when item bias studies are conducted,
certain items show up as discriminating in the minority group and non-discriminating
in the minority group. The result is known as non-uniform bias and can be detected
with two- and three-parameter model item bias studies (see, for example, Hambleton
& Rogers, 1989). Also, results reported in the Masters' (1988) paper again
highlight the well-known point that valid test development procedures require
more than a simplistic look at itera statistics. Careful, thoughtful, systematic
test development work is needed to insure that test scores are valid.

Some have argued that item discrimination might be modelled by multidimensional IRT models. Still, the advantages and disadvantages of a one-parameter multidimensional Rasch model versus a one-dimensional, two-parameter model would need to be determined. Certainly, in 1994, the answer seems easy. Two-parameter logistic models can handle the variability in item discrimination well, and the application of any multidimensional model, even a simple multidimensional model, is fraught with problems at this time (e.g., parameter estimation, interpretations). Some recent research by Glas (1992) is responsive to concerns about this point. His solution involves fitting a multi-dimensional Rasch model in which the items measuring each dimension in the solution space have their osen level of discriminating power. The model fitting begins by first forming subtests of items with similar discriminating powers. But the procedure is very complicated, not ready for wide use at this time, and, interestingly, requires multiple item discrimination values, one for each subtest.

Finally, recently measurement with the Rasch model has been enriched by contributions from several Dutch researchers who have extended the Rasch model by adding an item discrimination parameter (Verstralen & Verhelst, 1991a, 1991b). What makes this work novel is that they have developed a method for Using «imputed values.» They found that the addition of an item discrimination parameter improved model fit, preserved the property of specific objectivity, while not complicating model parameter estimation. The important point at this juncture is that item discrimination was found to be a useful addition to the Rasch model. The topic of model parameter estimation will be considered in Section 6.

5. *Pseudo-Guessing Parameter*

The inclusion of a pseudo-guessing parameter in an IRT model has caused considerable controversy. For one, as Mellenbergh (1994) has noted, the three-parameter logistic model does not conveniently fit into a general framework of psychometric models. Another problem has to do with the psychological interpretation of the parameter. And, finally, parameter estimation has been difficult.

With respect to interpretation, the parameter is not strictly a «guessing parameter.» Lord (1974) determined this in his evaluative study of item parameter estimates from LOGIST. He coined the term «pseudochance level parameter.» For others, the parameter is simply needed to account for the non-zero item performance of low ability candidates. But numerous data analyses (see, for example, Hambleton & Cook, 1983; Hambleton & Rogers, 1990; Hambleton, Swaminathan, & Rogers, 1991) have highlighted the utility of the c-parameter in improving model fit. Hambleton, Swaminathan, and Rogers fitted the one-parameter and three-parameter logistic models to NAEP math items and then considered the (absolute-valued) standardized residuals (i.e., misfit statistics) for items sorted by format (openended and multiple-choice) and difficulty. The results, report in Table 3, are clear and compelling. When the fit is good, mean residuals around a value of .80 are expected. The findings are that the three-parameter model generally fits the data well and better than the one-parameter model for the four combinations of item format and level of difficulty. The one-parameter model fit is especially poor when guessing arises with the hard multiple-choice items.

With respect to estimation, Thissen and Wainer (1982) noted
very large standard errors for the *c*-parameter estimates.

Fortunately, proper choice of examinee sample and/or the use
of Bayesian priors in the estimation procedure (for example, such priors are
used in BILOG) can be very helpful in reducing parameter estimation errors (see,
for example, de Gruijter, 1984; Swaminathan & Gifford, 1986). Researchers
have often settled for a common *c*-parameter value across items. This
model is sometimes called a modified three-parameter model, and has resulted
in improved model fit over the two-parameter model.

An interesting problem arises with small sample sizes. Do you fit a model with a single difficulty parameter? Lord (1983) showed that when samples sizes are less than 200, a Rasch model with equal scoring weights is better than a two-parameter model with relatively poor estimates of the item discrimination parameters. On the other hand, in a very nice simulation pertaining to optimal item selection in test development, de Gruijter (1986a) showed that even with small samples, if guessing occurs, a model that handles guessing is preferable to the Rasch model.

P_{i} (θ) = c + ( 1 - c ) [ 1 + e ^{-Da(θ - bi)} ] ^{-1}

The poor estimation of *c*-parameters in small samples
is addressed in two ways: (1) only a single *c*-parameter is estimated,
and (2) a Bayesian prior on the *c*-parameter estimation process can be
used.

6*. IRT Competer Software Packages*

Solving hundreds and often thousands of non-linear equations simultaneously is a complex numerical analysis problem. This is the problem that BILOG (Mislevy & Bock, 1986), LOGIST (Wingersky, Barton, & Lord, 1982), MULTILOG (Thissen, 1986), and MicroCAT (Assessment Systems Corporation, 1988) have been designed to address. In solving these equations to obtain model parameter estimates and standard errors, it is common for numerical analysis to place constraints on some or all of the parameters, to incorporate prior beliefs about the parameters (e.g., the c-parameter cannot be less than zero, and is unlikely to exceed .30), and to use sensible initial values of the parameters (see, for example, Mislevy & Stocking, 1989).

Several facts about these competer programs are not disputed. First, they take considerably more time to provide two- and three-parameter model estimates than programs providing Rasch model estimates, such as BIGSCALE (Wright, Schultz, & Linacre, 1989) and RIDA (Glas, 1990). Second, LOGIST, BILOG, and others sometimes encounter convergence problems with some items and/or persons (see, for example, Yen, Burket, & Sykes, 1991). These events will occur with more frequency when tests are short, and examinee sample sizes are small, though some estimation methods, such as Bayesian, seem to reduce the number of problems (Swaminathan & Gifford, 1985, 1986). That the problems of convergence, large standard errors, and so on, are generally known to users is due to the substantial amount of research that has been conducted.

Considerably less technical information is available on BICAL, BIGSCALE, MICROSCALE, and other software packages which are used to obtain Rasch model parameter estimates. Even in this considerably simpler case, bias in the unconditional maximum likelihood parameter estimates has been observed (de Gruijter, 1986b, 1990; Divgi, 1986; van den Wollenberg, Wierda, & Jansen, 1988) and the goodness of fit statistics associated with the Rasch model have been challenged (see, for example, Divgi, 1986; Rogers & Hattie, 1987). Rogers and Hattie concluded, «the results reported in this study suggest that the meansquare residual and total-t person and item fit statistics will contribute very little to an investigation of one-parameter model fit and, indeed, if relied on solely, will provide incorrect information about the appropriateness of the model» (p. 56).

Divgi (1986) reported that the percent of items misfitting
the Rasch model in the famous *Anchor Test Study *of the middle 1970s (Loret,
Seder, Bianchini, & Vale, 1974) increased from 17 %, as reported with the
improper statistical tests, to 68 % with what he claimed were the correct statistical
tests. Unfortunately, however, the impression remains among many practitioners
(based upon the Rentz & Rentz [1979] study) that the Rasch model fits all
types of data well. To quote Divgi (1986, p. 284): «It seems safe to conclude
that the studies that found the Rasch model satisfactory for multiple-choice
tests did so because their methods of analysis were not powerful enough.» Fortunately,
improved statistical tests for the Rasch model appear to be on the way (Smith,
1988, 1991).

Our point is not to argue that the two studies by Rogers and Hattie and Divgi invalidate most of the conclusions about Rasch model fit using BICAL. In fact, papers by Henning (1989) and Smith (1988, 1991) have been helpful in understanding and explaining some of the technical problems with various goodness-of-fit measures.

Rather, the point is that the Rasch model estimation and goodness-of-fit
procedures, particularly as implemented, in BICAL and its successors, are *not*
without controversy themselves. Rasch model research perhaps would be enhanced
if more up-to-date information were available about the main programs in use
in the U.S. More documentation is available on several programs in use in Europe
(see, for example, Glas, 1990; Verhelst et al., 1991).

In a rapidly developing field such as IRT, it is not surprising to see constant updating of computer software packages to respond to new requests (e,g, standard errors), to technical advances (e.g., Bayesian priors), and to correct errors (e.g., goodness of fit statistics). In fact, such developments are constructive and necessary, but there developments make the task of providing timely comments on software packages difficult. Perhaps it is sufficient to report that the main software packages (LOGIST, BILOG, MULTILOG) for obtaining parameters for the two-parameter and there-parameter logistic models and the graded response model work quite well under reasonable conditions of test length and sample size (see, for example, Mislevy & Stocking. 1989; Reise & Yu, 1990; Yen, 1987). Precise numbers needed are impossible to specify because they interact with desired degree of parameter precision, estimation procedures, and examinee ability distribution. For example, a smaller heterogeneous sample is generally preferable to a larger, more homogeneous sample in item parameter estimation with the two- and there-parameter models.

What does it mean to say that a software program works? One requirement is that the estimation process recover known model parameters in simulation studies without bias or standard errors that are inconsistent with the sample sizes. In the most comprehensive study to date, Mislevy and Stocking (1989), have provided the clearest study on the relative strengths and weaknesses of LOGIST and BILOG and more generally technical problems that arise with joint and marginal maximum likelihood estimation and Bayesian estimation, and how they are addressed in the programs. Basically, the recovery of true ability and item parameters was done well by both programs with moderately long tests (n = 45) and large sample sizes. LOGIST was not very successful with a short test (n = 15), a finding which has been reported at other times also (see, for example. Lord, 1974), BILOG, with its use of Bayesian priors is more successful than LOGIST with the shorter test lengths. Other comparative studies by Yen (1987), Hulin, Lissak, and Drasgow (1982), and, more recently, Wingersky (1992), have shown the utility of the LOGIST program under many conditions that occur in practice. Still, it must be said that there is plenty of room for improvement in parameter estimation, and neither program, but especially LOGIST, is intended for the IRT newcomer or for smallscale testing applications.

Mislevy and Stocking (1989) have suggested that many of the problems observed in LOGIST and BILOG are common to other computer programs as well that attempt to solve many non-linear equations simultaneously. de Gruijter (1990), Divgi (1986), and others, for example, have noted that unconditional parameter estimation methods popular with the Rasch model are biased.

Yen, Burket, and Sykes (1991) have recently addressed the problem
of non-unique solutions to the likelihood equations for the there-parameter
logistic model and offered ways in which the problem can be addressed in practice.
Some researchers will no doubt point again to flaws in the there-parameter model.
Recall, though, as noted by Yen et al, (1991), that the problems are due to
correct guessing on the part of some examinees, *not* due to the there-parameter
model. Use of the Rasch model will eliminate the multiple maxima problem but
can worsen model fit. Choose your poison! Fortunately, there are possible ways
to identify and handle multiple maxima when they occur. And, other IRT models
which provide more psychologically satisfying responses to the guessing problem
are under development (see, for example, Goldstein & Wood, 1989).

Some, such as Wright (1984), have argued that the technical problems described above are insurmountable and are best overcome with the use of a «better behaved» Rasch model. But two points might be noted. First, these problems are being identified because of the careful and serious way in which researchers working with these multiparameter models and procedures are proceeding. Problems such as the handling of omits and «not reached» items which have seriously concerned Lord since as early as 1952 or the serious problems of «item orden» and «context effects» on item parameter estimation (Yen, 1980; Zwick, 1991) are either unknown to many users of the Rasch model or unappreciated. These problems and many others are every bit as serious for the validity of Rasch model applications as they are for applications of the multi-parameter models. Second, for many psychometricians, the goal is to find IRT models or other models that can be used to represent their data. Simple but non-fitting models are of little interest.

In sum, many research studies (Harwell & Janosky, 1991; Hulin, Lissak. & Drasgow, 1982; Lord, 1974; Ree, 1979; Skaggs & Stevenson, 1989; Swaminathan & Gifford, 1983; Vale & Gialluca, 1988; and others) have shown that logistic models can recover the true parameters in simulation studies when proper estimation designs (i.e., suitable samples in size and location on the ability scale, test lengths, and priors) are used. Generally, item difficulty parameters can be properly estimated with smaller sample sizes than other item parameters.

7. *Successful Applications of MultiParameter IRT Models*

A complete accounting of the successful applications of multi-parameter
IRT models is beyond the score of this paper. This topic itself could serve
as the basis for a long paper. Readers are referred to Hambleton and Swaminathan
(1985), Hambleton, Swaminathan, and Rogers (1991), Lord (1980), and the plethora
of articles in the *Journal of Educational Measurement, Applied Psychological
Measurement,* and *Applied Measurement in Education,* and other measurement
journals.

In view of the serious doubts expressed by Wright (1984) about
the utility of multiparameter IRT models, it seems only necessary to highlight
a few prominent examples to counter his objections. Perhaps the most visible
IRT multi-parameter application for assessments of interest to policy makers
is the *National Assessment of Educational Progress.* Since 1984, scale
construction and score reporting have been done with the there-parameter logistic
model. This year, too, multi-parameter model extensions to handle polychotomously
scored items will be introduced. CTB/McGraw-Hill/Macmillan has been using the
three-parameter logistic model in its test development, scale construction,
and score reporting of standardized achievement tests. Perhaps the «cadillac»
of state assessment systems was in California and this system, until it was
discontinued last year, used multi-parameter IRT models and carefully monitored
for technical quality by Darrell Bock from the University of Chicago. The*
Law School Admissions Test*, the *Graduate Management Admissions Test,*
the *Scholastic Aptitude Test,* the *Graduate Record Exam,* and the
*Tests of English as a Foreign Language *are a number of other high profile,
national or international tests that use multi-parameter IRT models in test
design, item bias analyses, score equating, etc.

8. *Shortcomings in the Argumenta to Support the Rasch Model*

Controversy has followed the Rasch model since its introduction to U.S. measurement specialists in a paper by Wright (1968), Goldstein (1981), Whitely and Dawis (1974), Divgi (1986), McDonald (1985, 1989), are some of the best known measurement experts to challenge the validity of the Rasch model. Shortcomings in the argumenta advanced by Wright (1968, 1977, 1984) for the Rasch model include:

1. Placing too much importance on sufficient statistics;

2. Failing to satisfactorily account for variability in item discriminating power;

3. Failing to satisfactorily account for the non-zero item performance of low-performing examinees;

4. Failing to successfully attend to problems such as «omits,» «not reached,» item context effects, item order effects, etc., which impact on model utility and validity;

5. Providing incorrect results on model fit because of the use of inappropriate statistics.

6. Failing to acknowledge (or recognize) that the properties
of item and ability invariance are characteristics of *all* IRT models
which fit the data to which they are applied;

7. Recommending the deletion of misfitting items (rather than adding model parameters) which can distort the trait underlying the test;

8. Failing to recognize serious problems with the Rasch model in important applications such as vertical equating (see, for example, Schulz, Perlman, Rice, & Wright. 1992); and

9. Overstating problems with current multi-parameter IRT software
packages. Each of the nine points has been addressed directly or indirectly
in one or more of the previous sections. It is important to add, however, that,
though there are many shortcomings in the argumenta used to Support the Rasch
model, the model (and extensions) itself has an important role to play in many
testing applications. A review of Wilson (1992) and the *JEM*, *APM*,
and *AME*, to name just four major referentes, will turn up many important
technical advances and applications of the Rasch model, including use in (1)
national standardized achievement tests, (2) item banking projects, and (3)
many state testing and credentialing exam programs. Perhaps the best argument
to support its use is that it sometimes provides a close fit to actual test
data. Also, the model may have special utility in situations where samples are
modest in size and the need for high precision in ability estimates is not great
(such as in some school applications).

9. *Promising Future of IRT Models*

While IRT provides solutions to many testing problems that
previously were unsolved or solved in a less than satisfactory way, it is not
a magic wand which can be waved to overcome deficiencies such as poorly written
test items and poor test designs. In the hands of careful test developers, however,
IRT models, the Rasch model *and* multi-parameter models, and IRT methods
can become powerful tools in the design and construction of sound educational
and psychological instruments, and in reporting and interpreting test results.
But it is highly unlikely that any single family of IRT models will be able
to meet the challenges and demands on measurement practices in the coming decade.

Research on IRT models and their applications is being carried out at a phenomenal rate (see Thissen & Steinberg, 1986, and Mellenbergh, 1994, for taxonomies of models; and van der Linden & Hambleton, in press). Entire issues of several journals have been devoted to developments in IRT. For the future, two directions for research appear to be especially important: polytomous unidimensional response models and both dichotomous and polytomous multidimensional response models. Research in both directions is well under way (Masters & Wright, 1984; McDonald, 1989; van der Linden & Hambleton, in press). With the growing interest in «authentic measurement,» special attention must be given to IRT models that can handle polytomous scoring, since authentic measurement is linked to performance testing and non-dichotomous scoring of examinee performance.

Multidimensional IRT models were introduced originally by Lord and Novick (1968), Samejima (1974), and, more recently, by Embretson (1984), Fischer and Seliger (in press), and McDonald (1989). Multidimensional models offer the prospect of better fitting current test data and providing multidimensional representations of both items and examinee abilities. It remains to be seen whether parameters for these multidimensional models can be properly estimated, and whether multidimensional representations of items and examinees are useful to practitioners.

Goldstein and Wood (1989) have argued for more IRT model-building in the future, but feel that more attention should be given to placing IRT models within an explicit linear modeling framework. Advantages, according to Goldstein and Wood, include model parameters that are simpler to understand, easier to estimate, and that have wellknown statistical properties.

Three other areas are likely to draw special attention from educators and psychologists in the coming years. First, large-scale state, national, and international assessments are attracting considerable attention, and will continue to do so for the foreseeable future (see, for example, the Third International Mathematics and Science Study involving over 60 countries). Item response models are being used at the all-important reporting stages in these assessments. It will be interesting to see what technical controversies arise from this type of application (see, for example, Zwick, 1991). One feature that plays an important role in reporting is the ICC. Are ICCs invariant to the nature and amounts of instruction? The assumption is that ICCs are invariant, but substantially more research is needed to establish this point.

Second, cognitive psychologists such as Embretson (1984) are interested in using IRT models to link examinee task performance to their ability through complex models that attempt to estimate parameters for the cognitive components that are needed to complete the tasks. This line of research is also consistent with Goldstein and Wood's goal to see the construction of more meaningful psychological models to help explain examinee test performance. See, for example, recent work by Mislevy and Verhelst (1990), and Sheehan and Mislevy (1990), which is along these general lines.

Third, educators and psychologists are making the argument
for considerably more use of test scores than simply rank ordering of examinees
on their abilities or determining whether they have met a particular achievement
level or standard. Diagnostic information is becoming increasingly important
to users of test scores. *Inappropriateness measurement,* developed by
M. Levine and F. Drasgow (see, for example, Drasgow, Levine, & McLaughlin,
1987), which utilizes IRT models, provides a framework for identifying aberrant
responses of examinees and special groups of examinees on individual and groups
of items. Such information can be helpful in successful diagnostic work. More
use of IRT models in providing diagnostic information can be anticipated in
the coming years.

*Conclusions*

There has been a considerable amount of debate in the measurement literature about the merits of various IRT models, estimation procedures and designs, and the specific steps for equating tests, identifying DIF, and constructing tests. There are even researchers who have some serious reservations about the whole IRT direction in measurement (see, for example, Hoover, 1992).

In 1994, it is nonsensical to argue that the Rasch model (and
its extensions) are the *only* models providing a sound technical basis
for constructing tests and evaluating test scores. The testing field has many
outstanding examples of successful applications of multi-parameter IRT models.
And, technical advances on many fronts (see, for example, Suen & Lee, 1992)
will make future applications even more successful. The argument in this paper
is that there is room for many IRT models, but only models which fit data, and
have parameters which can be estimated well will be of interest. One important
example of model fit was presented earlier in the paper and the results were
poor for the one-parameter model. But, our point was not to reject the Rasch
model for every application. Our only purpose was to highlight the need for,
and the acceptability of, other models. In fact, we laud the excellent work
of many one-parameter model researchers and acknowledge the role their research
has played in the technical foundations of IRT. We note, too, in passing, that
many of the improvements in the Rasch model methods and procedures have been
due to constructive suggestions from IRT researchers not exclusively interested
in the Rasch model. These are constructive actions on the part of many researchers
to make IRT models more useful in practice.

In a way, the long-standing debate between one-parameter and there-parameter model advocates has been counter-productive. There is a need for both psychometric models. In fact, there is substantial evidence to suggest that both models have been used very successfully in many types of testing applications. On the negative side, the focus of attention on the one- and three-parameter logistic models has meant that there is less familiarity on the part of practitioners with many new promising directions for psychometric model-building. Goldstein and Wood (1989), Garcia-Perez and Frary (1991), Mellenbergh (1994), and McDonald (1982, 1989) have all suggested new models or classes of models and, of course, extensions of logistic models to handle polychotomous as well as multi-dimensional data are well under way (see, for example, Embretson, in press; Fischer&Seliger, in press Reckase, in press).

The main points of this paper are that both the Rasch model and the more general (i.e., multi-parameter) logistic models have important roles to play in the field of testing. But model fit is essential, and it is far better to find models that fit the test data than to discard data simply, to fit the Rasch model. Educational and psychological testing practices are changing. New item formats and scoring schema are being introduced to measure higher-order thinking skills. These new measures should not be narrowed to enable a simple (or even extended) Rasch model to fit the resulting data, Curriculum specialists would be rightly shocked. At the same time, to argue that variations in item discrimination is really «multidimensionality» in the test data and should be handled by multidimensional Rasch models is to disregard 80 years of measurement experience. To advocate the use of multidimensional IRT models that are still in their early developmental stage seems highly inappropriate and fans to recognize the success of unidimensional multi-parameter models in fitting educational and psychological data.

**NOTAS**

^{1 }Paper presented at the meeting of NCME, San Francisco.
1992.

^{2} Laboratory of Psychometric and Evaluative Research
Report No. 235. Amherst. MA: University of Massachusetts, School of Education.

**References**

Assessment Systems Corporation. (1988). *User's manual for
the MicroCAT testing system* (Version 3). St. Paul, MN: Author.

Bejar,I. I. (1983). Introduction to item response models and
their assumptions. In R. K. Hambleton (Ed.). *Applications of item response
theory* (pp. 1-23). Vancouver, BC: Educational Research Institute of British
Columbia.

Birnbaum, A. (1957).* Efficient design and use of tests of
ability for various decision-making problems* (Series Report No. 58-16. Project
No. 7755-23). Randolph Air Force Base, TX: USAF School of Aviation Medicine.

Birnbaum, A. (1958a). *On the estimation of mental ability
*(Series Report No. 15, Project No. 7755-23). Randolph Air Force Base, TX:
USAF School of Aviation Medicine.

Birnbaum, A. (1958b). *Further considerations of efficiency
in tests of a mental ability* (Series Report No. 17, Project No. 7755-23).
Randolph Air Force Base, TX: USAF School of Aviation Medicine.

De Gruijter, D. N. M. (1984). A comment on «Some standard errors
in item response theory.» *Psychometrika, 49*, 269-272.

De Gruijter, D. N. M. (1986a). Small N does not always justify
Rasch model. *Applied Psychological Measurement, 10*(2), 187-194.

De Gruijter, D. N. M. (1986b). The bias of an approximation
to the unconditional maximum likelihood procedure for the Rasch model. *Kwantitatieve
Methoden, 20,* 49-55.

De Gruijter, D. N. M. (1990). A note on the bias of UCON item
parameter estimation in the Rasch model. *Journal of Educational Measurement,
27(3),* 285-288.

Divgi, D. R. (I986). Does the Rasch model really work for multiple
choice items? Not if you look closely. *Journal of Educational Measurement,
23*, 283-298.

Drasgow, F., Levine, M. V., & McLaughlin, M. E. (1987). Detecting inappropriate test scores with optimal and practical appropriateness indices. Applied Psychological Measurement, 11(1), 59-79.

Embretson, S. E. (1984). A general latent trait model for response
processes. *Psychometrika, 49,* 175-186.

Embretson, S. E. (in press). Multicomponent response models.
In W. J. van der Linden & R. K. Hambleton (Eds.)*, Handbook of item response
theory. *New York: Springer-Verlag.

Fischer, G. H. (1974). *Ein fuhrung in die theorie psychologischer
tests* (Introduction to the theory of psychological tests). Bern: Huber.

Fischer, G. H., & Seliger, E. (in press). Multidimensional
linear logistic models for change. In W. J. van der Linden & R. K. Hambleton
(Eds.), *Handbook of itera response theory. *New York: Springer-Verlag.

Garcia-Perez, M. A., & Frary, R. B. (1991). Finite state
polynomic item characteristic curves. *British Journal of Mathematical and
Statistical Psychology, 44,* 45-73.

Glas, C. A. W. (1990). RIDA: *Rasch incomplete design analysis*.
Arnhem, The Netherlands: National Institute for Educational Measurement.

Glas, C. A. W. (I992). A Rasch model with a multivariate distribution
of ability. In M. Wilson (Ed.), *Objective measurement: Theory into practice
*(Volume 1). Norwood. NJ: Ablex Publishing Co.

Goldstein, H. (1981). Limitations of the Rasch model for educational
assessment. In C. Lacee & D. Lawton (Eds.), *Issues in evaluation and
ac-countabilitti, *(pp. 172-I88). London: Methuen.

Goldstein, H., & Wood. R. (1989). Five decades of item
response modelling. *British Journal of Mathematical and Statistical Psychology,
42, *139-167.

Gulliksen, H. (1950). *Theory of mental tests*. New York:
Wiley.

Gustafsson, J. E. (1980a). A solution of the conditional estimation
problem for long tests in the Rasch model for dichotomous items. *Educational
and Psychological Measurement, 40,* 377-385.

Gustafsson, J. E. (1980b). Testing and obtaining fit of data
to the Rasch model. *British Journal of Mathematical and Statistical Psychology,
33, *205-233.

Hambleton, R. K. (1989). Principles and selected applications
of item response theory. In R. L. Linn (Ed.). *Educational Measurement *(3rd
ed., pp. 147-200). New York: Macmillan.

Hambleton, R. K., & Cook, L. L. (1983). Robustness of item
response models and effects of test length and sample size on the precision
of ability estimates. In D. Weiss (Ed.), *New horizons in testing* (pp.
31-49). New York: Academic Press.

Hambleton, R. K., & Jones, R. W. (1993). Comparison of
classical test theory and item response theory and their applications to test
development. *Educational Measurement: Issues and Practice, 12(3),* 38-47.

Hambleton, R. K., Murray, L. N., & Williams, P. (1983).*
Fitting item response models to the Maryland Functional Reading Tests *(Laboratory
of Psychometric and Evaluative Research Report No. 139). Amherst, MA: University
of Massachusetts, School of Education.

Hambleton, R. K., & Rogers, H. J. (1989). Detecting potentially
biased test items: Comparison of IRT area and Mantel-Haenszel methods. *Applied
Measurement in Education, 2(4)*, 313-334.

Hambleton, R. K., & Rogers, H. J. (1990). Using item response
models in educational assessments. In W. H. Schreiber & K. Ingekamp (Eds.).
*International developments in largescale assessment *(pp. 155-184). Windsor,
UK: NFER-Nelson.

Hambleton, R. K., & Swaminathan, H. (1985).* Item response
theory:* *Principles and applications.* Boston, MA: Kluwer Academic
Publishers.

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991).*
Fundamentals of item response theory. *Newbury Park, CA: Sage.

Harwell, M. R., & Janosky, J. E. (1991). An empirical study
of the effects of small datasets and varying prior variances on item parameter
estimation in BILOG.* Applied Psychological Measurement, 15,* 279-291.

Henning, G. (1989). Does the Rasch model really work for multiple-choice
items? Take another look: A response to Divgi. *Journal of Educational Measurement,
26(1)*, 91-97.

Hoover, H. D. (1992, April). *Some shortcomings of item response
models.* Paper presented at the meeting of AERA, San Francisco.

Hulin, C. L., Lissak, R. L, & Drasgow, F. (1982). Recovery
of two- and there-parameter logistic item characteristic curves: A Monte Carlo
study. *Applied Psychological Measurement, 6,* 249-260.

Lord, F. M. (1952).* A theory of test scores* (Psychometric
Monograph No. 7). Iowa City, IA: Psychometric Society.

Lord, F. M. (1953 a). An application of confidence intervals
and of maximum likelihood to the estimation of an examinee's ability. *Psychometrika,
18,* 57-75.

Lord, F. M. (1953b). The relation of test score to the trait
underlying the test. *Educational arad Psychological Measurement, 13,*
517-548.

Lord, F. M. (1970). Item characteristic curves estimated without
knowledge of their mathematical form a confrontation of Birnbaum's logistic
model. *Psychometrika, 35, *43-50.

Lord, F. M. (1974). Estimation of latent-ability and item parameters
when there are omitted responses. *Psychometrika, 39*, 29-51.

Lord, F. M. (1980). *Applications of item response theory
to practical testing problems. *Hillsdale. NJ: Lawrence Erlbaum.

Lord, F. M. (1983). Small N justifies Rasch model. In D. J.
Weiss (Ed.), *New horizons ira testing* (pp. 51-61). New York: Academic
Press.

Lord, F. M., & Novick, M. R. (1968). *Statistical theories
of mental test scores*. Reading, MA: Addison-Wesley.

Loret, P. G., Seder, A., Bianchini. J. C., & Vale, C. A.
(1974). *Anchor test study* (Final Report). Princeton, NJ: Educational
Testing Service.

Masters, G. N. (1982). A Rasch model for partial credit scoring.*
Psychometrika, 47,* 149-174.

Masters, G. N. (1988). Item discrimination: When more is worse.
*Journal of Educational Measurement, 25,* 15-29.

Masters, G. N., & Wright, B. D. (1984). The essential process
in a family of measurement models. *Psychometrika, 49*, 529-544.

McDonald. R. P. (1982). Linear versus nonlinear models in item
response theory. *Applied Psychological Measurement, 6(4)*, 379-396.

McDonald, R. P. (1985). *Factor analysis and related methods.*
Hillsdale, NJ: Lawrence Erlbaum Associates.

McDonald, R. P. (1989). Future directions for item response
theory.* International Journal of Educational Research, 13(2), *205-220.

Mellenbergh, G. J. (1994). Generalized linear item response
theory.* Psychological Bulletin, 115(2), *300-307.

Mislevy, R. J., & Bock, R. D. (1986). BILOG: *Maximum
likelihood itero analysis arad test scoring with logistic rnodels. *Mooresville,
IN: Scientific Software.

Mislevy, R. J., & Stocking, M. L. (1989). A consumer's
guide to LOGIST and BILOG. *Applied Psychological Measurement, 13(1)*,
57-75.

Mislevy, R. J., & Verhelst, N. (1990). Modeling itera responses
when different subjects employ different solution strategies. *Psychometrika,
55(2),* 195-215.

Prestwood, J. S., Vale, C. D., Massey, R. H., & Welsh,
J. R. (1985). *Armed Services Vocacional Aptitude Battery: Development of
an adaptive item pool *(AFHRL-TR-85-19). Brooks Air Force Base, TX: Manpower
and Personnel Division.

Rasch, G. (1960). *Probabilistic models for some intelligence
and attainment tests. *Copenhagen: Denmarks Paedagogiske Institut.

Reckase, M. (in press). A linear logistic model for cichotomous
item response data. In W. J. van der Linden & R. K. Hambleton (Eds.), *Handbook
of item response theory. *New York: Springer-Verlag.

Ree, M. J. (1979). Estimating item characteristic curves. *Applied
Psychological Measurement, 3*, 371-385.

Reise, S. P., & Yu, J. (1990). Parameter recovery, in the
graded response model using MULTILOG. *Journal of Educational Measurement,
27(2), *133-144.

Rentz, R. R., & Rentz, C. C. (1979). Does the Rasch model
really work? *Measurement in Education, 10(2),* 1-8.

Rogers, H. J., & Hattie, J. A. (1987). A Monte Carlo investigation
of several person and item fit statistics for item response models. *Applied
Psychological Measurement, 11(1)*, 47-57.

Samejima, F. (1974). Normal ogive model on the continuous response
level in the multidimensional latent space.* Psychometrika, 39, *111-121.

Schulz, E. M., Perlman, C., Rice, W. K., & Wright, B. D.
(1992). Vertically equating reading tests: An example from Chicago public schools.
In M. Wilson (Ed.), *Objective measurement: Theory into practice, Volume 1*
(pp. 138-156). Norwood. NJ: Ablex Publishing Co.

Sheehan, K., & Mislevy, R. J. (1990). Integrating cognitive
and psychometric models to measure document literacy. *Journal of Educational
Measurement, 27(3),* 255-272.

Skaggs, G., & Stevenson, J. (1989). A comparison of pseudo-Bayesian
and joint maximum likelihood procedures for estimating parameters in the three-parameter
IRT model. *Applied Psychological Measurement, 13*, 391-402.

Smith. R. M. (1988). The distributional properties of Rasch
standardized residuals. *Educational arad Psychological Measurement, 48,*
657-667.

Smith, R. M. (1991). The distributional properties of Rasch
item fit statistics. *Educational and Psychological Measurement, 51,* 541-565.

Suen, H. K., & Lee, P. S. C. (1992). Constraint optimization:
A perspective of IRT parameter estimation. In M. Wilson (Ed.), *Objective
measurement: Theory into practice* (pp. 289-300). Norwood, NJ: Ablex Publishing.

Swaminathan, H., & Gifford, J. A. (1983). Estimation of
parameters in the three-parameter latent trait model. In D. Weiss (Ed.), *New
horizons in testing *(pp. 13-30). New York: Academic Press.

Swaminathan, H., & Gifford, J. A. (1985). Bayesian estimation
in the two-parameter logistic model. *Psychornetrika, 50*, 349-364.

Swaminathan, H., & Gifford, J. A. (1986). Bayesian estimation
in the three-parameter logistic model. *Psychometrika, 51*, 589-601.

Thissen, D. M. (1986). MULTILOG: *Item analysis and scoring
with multiple category response models *(Version 5). Mooresville, IN: Scientific
Software.

Thissen, D.. & Steinberg, L. (1986). A taxonomy of item
response models. *Psychornetrika, 51*, 567-577.

Thissen, D., & Wainer, H. (1982). Some standard errors
in item response theory.* Psychornetrika, 47, *397-412.

Traub, R. E. (1983). A priori considerations in choosing an
item response model. In R. K. Hambleton (Ed.). *Applications of item response
theory* (pp. 57-70). Vancouver, BC: Educational Research Institute of British
Columbia.

Tucker, L. R. (1946). Maximum validity of a test with equivalent
items. *Psychometrika, 11*, 1-13.

Vale, C. D., & Gialluca, K. A. (1988). Evaluation of the
efficiency of item calibration. *Applied Psychological Measurement, 12(1)*,
53-67.

Van den Wollenberg, A. L., Wierda, F. W., & Jansen, P.
G. W. (1988). Consistency of Rasch model parameter estimation: A simulation
study. *Applied Psychological Measurernent. 12, *307-313.

Van der Linden, W. J., & Hambleton, R. K. (Eds.). (in press).
*Handbook of item response theory.* New York: Springer-Verlag.

Verhelst, N. D., Glas, C. A. W., & Verstralen, H. H. F.
M. (1991). *OPLM* (The One Parameter Logistic Model). Arnhem, The Netherlands:
CITO.

Verstralen, H. H. F. M., & Verhelst, N. D. (1991a). *The
sample strategy of a test information function in computerized test design*
(Measurement and Research Department Reports, 91-6). Arnhem, The Netherlands:
CITO.

Verstralen, H. H. F. M., & Verhelst, N. D. (1991b). *Decision
accuracy in IRT model* (Measurement and Research Department Reports 91-7).
Arnhem, The Netherlands: CITO.

Whitely, S. E.. & Dawis, R. V. (1974). The nature of objectivity
with the Rasch model. *Journal of Educational Measurement, 11(3),* 163-178.

Wilson, M. (Ed.). (1992). *Objective measurement: Theory
into practice. *Norwood, NJ: Ablex Publishing Co.

Wingersky, M. S. (1992). *Significant improvements to LOGIST*
(Research Bulletin 92-22). Princeton, NJ: Educational Testing Service.

Wingersky, M. S., Barton, M. A., & Lord, F. M. (1982).
*LOGIST user's guide.* Princeton, NJ: Educational Testing Service.

Wright, B. D. (1968). Sample-free test calibration and person
measurement. In *Proceedings of the 1967 Invitational Conference on Testing
Problems *(pp. 85-101). Princeton, NI: Educational Test Service.

Wright, B. D. (1977). Solving measurement problems with the
Rasch model.* Journal of Educational Measurement, 14(2)*, 97-116.

Wright, B. D. (1984). Despair and hope for educational measurement.
*Contemporary Education Review, 1,* 281-288.

Wright, B. D., & Panchapakesan, N. (1969). A procedure
for Sample-free item analysis. *Educational and Psychological Measurement,
29*, 23-48.

Wright, B. D., Schulz, M., & Linacre, J. M. (1989). BIGSCALE: Rasch analysis computer program. Chicago: MESA Press.

Wright, B. D., & Stone, M. H. (1979). *Best test design.
*Chicago: MESA Press.

Yen, W. M. (1980). The extent, causes, and importance of context
effects on item parameters for two latent trait models. *Journal of Educational
Measurement, 17*, 297-311.

Yen, W. M. (1987). A comparison of the efficiency and accuracy
of BILOG and LOGIST.* Psychometrika, 52,* 275-291.

Yen, W. M.. Burket, G. R., & Sykes, R. C. (1991). Nonunique
solutions to the likelihood equation for the three-parameter logistic model.
*Psychometrika, 56(1)*, 39-54.

Zwick. R. (1991). Effects of item order and context on estimation
of NAEP reading proficiency. *Educational Measurement: Issues and Practice,
10,* 10-16.

*Accepted: June, 30, 1994*