Mrs. Greenwood struggles every time she grades her students. She hates to give low grades to students who are trying, even if they are not doing very well. ... she wants to be as fair as she can, so she decides to measure the students' performance carefully at the beginning and end of the grading period on tests that are very similar. She goes so far as to make random forms of tests ... She then subtracts the pretest score from the posttest score and gives grades based on the size of the improvement. She then discusses the idea with the principal and, the principal claims that the method is unfair. Is it?
Hills responds to this by saying that yes, it is unfair. In this case, the teacher is selecting students at BOTH ends of the pretest scale. The scores of those on the pretest will be expected to be LESS far below the mean on the posttest while those scores of the highest on the pretest WILL be expected to be closer to the mean on the posttest. This is due to regression to the mean and imperfect correlation between pretest and posttest. Thus, those at the low end on the pretest will gain the most while those at the high end of the pretest will gain less ... thus, the low scorers on the pretest will get the highest grades.With Don doing some of the "math" ... we were able to show that this is NOT necessarily the case. As it turns out, Hills paper makes the implicit assumption that there will be a NEGATIVE correlation between the scores on the pretest and the gain scores ... ie, those with the lowest pretest scores will gain the most. But, this is not the whole story ... as Paul Harvey says. It happens to be the case the the two factors that determine what the correlation will be between pretest scores and gain ... and its sign ... are the correlation between the pre and post measures AND the ratio of the posttest variance to the pretest variance. Thus, if the correlation between the pre and post measures if fairly high ... say .7 or so AND there is more variability on the posttest than on the pretest, then the correlation between pretest scores and the gain can actually be positive. What this means is that those who score highest on the pretest will actually make the largest gains and, would be given the highest grades by Mrs. Greenwood.
In our paper, we actually show a graph with the correlation between pre and post on the baseline, and the correlation between pre and GAIN on the vertical axis ... with many different lines IN the graph depicting the ratio of the post variance (from 10 times to .75 times) to the pre variance. Please see the paper to get the graph ... or send me a note and I will send you one .. Roberts
One of the fallacies in Hills' paper is the apparent assumption that the distribution of scores on the pretest AND the posttest would look the same; ie, appear as though they were the distributions from parallel tests according to classical test theory (same means, variances, etc.). But, this cannot be true. At pretest time, scores will logically tend to be relative low ... and unless there are some students who know alot ... there will not be too much range to the scores either. Thus, over the course, all the students learn something and ... it is likely that the best students will learn the most. So, at posttest time, the distribution moves way up the score scale ... with higher mean AND very likely a larger variance. As long as the test (both pre and post) does not have a serious ceiling effect ... then the best students will put some distance between themselves and the weaker students ... thus creating MORE variance on the posttest than there was on the pretest. This is precisely the situation where the "math" shows that there can be a positive correlation between the pre scores and the gain. Thus, while Hills scenario CAN be true, it does NOT have to be true. In fact, it is our guess that more often than not, it will not be true because of the math involved and the way the better students can accomplish more provided that the tests that are used have sufficient ceiling for them to show their wares.
NOTE: For the next two weeks, we will have presentations ... first up will be on Stevens' levels of measurement ... stay tuned for summaries.
Note: The summary below for each of the scaling methods is done by me ... and NOT by the student who made the presention, though admittedly, I did paraphrase some of the handout material from some of the students. The purpose of this summary is to provide some overview to the method ... and not really an overview to the person's presentation. In each case, I have given you at least one seminal paper about the method. It is ALWAYS A GOOD IDEA examine the original ... and not only some 10th generation version.
Note: Any questions about any of the short descriptions below ... send me an email note ... Mail to Roberts
Stevens,S.S. (1951) Mathematics, measurement, and psychophysics. In Handbook of experimental psychology. New York: Wiley and Sons,Inc. pp 1-49.
One of the most famous issues for discussion in the area of measurement, and appropriate use of statistics ... starts here. Distinction is made between categorical and numerical responses to scaling instruments. Categorical of course would be yes or no responses such as belonging to one religion or another, etc. Numerical (which could be discrete or continuous) could be # of books in different libraries, or blood pressure.
Core Concepts:
Two other references one might pursue on this:
Key reference in this area is Stephenson, W. (1953). The study of behavior: q-technique and its methodology. Chicago: The University of Chicago Press.
The Q sort grew out of a general methodology of Stephenson for the study of verbalized attitudes, self descriptions, and preferences. A basic notion what that it is more important to examine WITHIN person variations statements about preferences than BETWEEN persons. Let's say that you have a series of photographs that represent different men from different cultures. Now, the task in Q sort is to arrange these fotos in "piles" ... say 10 to 15 ... according to which you prefer most (to least). You might start at a gross level ... put them into 3 piles (really prefer, really DON'T prefer, not sure) ... and then gradually work into more piles with finer gradations between piles. In general, Ss are intstructed to work from the extremes to the middle ... with increasing numbers of photographs being placed in the middle piles. The general assumption in this method is that in the general population, there will be essentially a normal distribution of preferences.
If each S sorts into the same number of piles and puts about the same # of cards into each pile ... then we would have similar means and standard deviations across Ss though ... where each S places the cards could obviously be different. Thus lies the way in which different Ss are "placed" along some continuum as to the preferences.
Many of the criticisms about Q sort are more related to the methods of analysis rather than the Q sort technique. For example, lets say that the same person is asked to sort the stack of photographs into two orderings ... one where they are asked to base it on "how others see me" and the other on " how I see myself". Thus, each cluster of photos will be perhaps be in a different ordered location ... under the two sets of conditions. Then ... we will have X and Y columns where X shows the ranking number under condition 1 ... and Y will represent the ranking value under condition 2. A simple correlation between the two will show the degree of consistency, or lack of it, across the two conditions. Stephenson argues that this comparions WITHIN the S can bring about valuable information about that person.
However, while there maybe situations where the Q methodoly is very relevant to identifying preferences within Ss under different conditions, the technique itself is very cumbersome. It may appear that the card sorting task would be easy ... and if there are only 10 or 15 cards, that might be true ... to sort into succesively finer grained piles when you have say 100 or 120 stimulus cards is NOT easy.
Another reference of interest for reading about Q sort is Nunnally, Psychometric Theory, McGraw-Hill Book Company ... 1978, pp 613-625.
Primary reading for this is: Guttman (1950). The basis for scalogram analysis. In S. Stouffer, et al (Eds). Measurement and prediction. Princeton: Princeton University Press.
Central to Guttman scaling is the notion of a unidimensional scale AND the idea that there is a gradient on which Ss will be located. Like other scaling methods, it is assume that a S who has a higher value on the gradient does indeed have a higher level/amount of that attitude in question. A wide range of attitude scalings is possible with the Guttman technique ...
However, developing or coming up with items on which to scale people is not necessarily easy. But ... it is assumed that a S who agrees with a more general statement (I think education is valuable to society) would also agree to a more specific item ALONG THE SAME DIMENSION (ie, I think that high school education is valuable to society). Thus, the goal in Guttman scaling is to create a set of items that can be ordered from more general to more specific ... when assessing the attitude or characteristic under investigation.
The task then after this set of X number of ORDERED statements is to have Ss look at each statement ... and then ask the S whether he/she agrees with the statement. While the statements are generally mixed up in their delievery to the Ss, there is an implicit predetremined scaled order to them. The question is: as we go from more general statements to more specific, will the S agree to each more general one ... but get to a point where he/she does NOT agree to ANY of the more specific statements? THAT is the empirical question. If the scale is completely valid and reproducible, then the answer will be yes. Thus ... if we reorder the items and see the responses of the S ... there will be a series of AGREES down to some statement ... and then DISAGREES FOR EVERY ITEM AFTER THAT. Thus ... the score that is given to the S will tell how many statements down from the most general the S has agreed to ... and thus also the point at which the responses all change to disagree.
Assume that there are 5 items X1 to X5 ... where X5 is the most general and X1 is the most specific. Now ... according to the pure Guttman scaling view ... here are the possibilities. Person 1 agrees with X5 and all below .. give a score of 5. P2 agrees to X5 down to X2 but .... can't agree to X1 ... score = 4. Now, finally, a person who can't agree to any of them gets a score of 0. We assume of course that the higher the score, the stronger the attitude AND ... based on the score ... we know PRECISELY how the person responded.
But ... the problem is ... what if P1 agrees with X5 down to X3 ... but disagrees with X2 but AGREES with X1? This does not jibe with the pure Guttman scaling notion and ... the score of 4 in that case will NOT allow you to perfectly reproduce their item response pattern. And, here lies the crux of the problem: HOW MANY SCALES OF MULTIPLE ITEMS CAN YOU CREATE WHERE THE RESPONSES ARE THAT REPRODUCIBLE BASED ON THE TOTAL SCORES? The answer is ... not many ... except in somewhat constrained situations. However ... it has been found that if you can create sets of ordered items where the reproducibility coefficient is about .85 or higher ... that this method that can yield interesting information about the attitudes of Ss.
The Likert scale method .. pronounced LICKert ... is perhaps the most popular and "famous" of all the techniques. The seminal reading is: Rensis Likert, A technique for the measurement of attitudes, Archives of Psychology, #140, New York, 1932. If you want to get the REAL scoop ... read this paper.
Likert was, as many others were of that time, interested in the measurement of attitudes. Thurstone had done much of the work but, his methods of equal appearing intervals not only had some questionable assumptions but, was cumbersome. Thus, and something folks tend to overlook, a major purpose in developing the Likert scaling method was to see if simplification could be accomplished in this important assessment area.
To explore several alternatives to traditional Thurstonian scaling Likert examine different response techniques and examined several content domains: International relations, race relations, economic conflict, and religion. Several sources were tapped to obtain items for each of the scales above. Here are some of the response methods he looked at:
Likert provided several pages on psychological interpretations and, an appendix on methods of constructing attitude scales.
Anyone who uses Likert methods ... should read this paper so as to get a better appreciation for what he did .... and did NOT do.
Two seminal sources are Lord, F. M. (1952). A theory of test scores, Psychometric Monograph, #13, 517-548 ... and Gulliksen, H. (1950), Theory of mental tests. New York: Wiley ... and one recent introductory source is ... Suen, H.K. (1990). Principles of test theories, Hillsdale: Lawrence Erlbaun Associates.
No "theory" has dominated general test contruction and the use of tests as has classical test theory (even though Item Response Theory has come onto the scene in recent times ... and has made a big dent). Of course, the main way in which scores are obtained in classical test theory is by summing across items ... the separate item scores. The question then is: how do we interpret these scores?
First of all, classical test theory assumes that items on a scale more or less have been randomly sampled from the larger domain or universe of ALL possible items. Therefore, the score one obtains on the specifice test is simply an indicator of ones true score ... which we might think about as being either a score with no error or ... the score on would have gotten IF he/she had been forced to respond to ALL possible items from the domain. In this sense ... classical test theory is similar to traditional statistical sampling theory in that the statistic (based on sample data) are taken as best estimates (under reasonable sampling conditions) of the parameter.
The score like the statistic in classical test theory is called the OBSERVED score ... or O for short. But, underneath that O ... is the truth ... or the true score or T for short. The T is that mystical latent value that we would see if we had been able to test the person on all possible items. Thus, any difference between T and O is called error ... or E for short. Therefore, a simple but fundamental equation in classical test theory is the following:
O = T + E ...
where E of course could be + or - ... and could add to or subtract from the true score ... which would make the O either larger or smaller than it should be. It is very important to realize that in classical test theory ... one can ONLY see the O ... and the T and E concepts are just that .... latent concepts and unobservable. To that extent ... there is a bit of mysticism in classical test theory but ... that's nothing new ... the same applies in all other scaling methods too ... even in IRT!
Fundamental to the tenets of classical test theory is the notion of reliability or ... the precision which which the O comes close to being the same thing as the T. Reliable measures are ones where ... the rank order or alignment of the examinees in terms of total test scores ... is identical or nearly identical with the alingment of their T scores. UNreliable measure would be where the O's canNOT be taken as very good estimates of the positions of the T values amongst the group tested. Think about it this way: if we had a very UNreliable method of assessment in classrooms for the purposes of assigning grades ... the person who you give a B to may in fact be an A student ... or an F student.
Statistical reasoning plays a large role in the development of the basic "equations" of reliabilty ... and related indices. For example, the RELIABILITY INDEX is simply the Pearson corrlation between the O and T values ... Of course, since T is unknown ... this is a hypothetical concept. Now ... in correlation statistics ... we would like the O to predict the T ... with as little error as possible. Thus ... in regression lingo ... we can think of O as a predictor of T. Therefore, we can ask: what is the proportion of variance that is attributable to T, based on knowing O? That value is simply r SQUARED ... and that would give us:
r square (O,T) = var(T)/var(O) = and this is the reliability COEFFICIENT. In most sources, the reliability coefficient is listed as an r .... for Pearson correlation ... and there will be a double subscript of the same symbol ... like r(xx) or r(11).
Another notion important to classical test theory is that of parallel tests. If in fact a RANDOM of sample of items yields an O for a particular examinee ... then another random sample of items of the same size from the same domain should result in an O for that examinee that is about the same. Since this would be true across all examinees, then the means and variances on parallel versions of the test should be about the same. Thus ... the commonality between the two parallel tests, which can be assessed by a simple correlation between the two sets of scores, is seen as the reliability coefficent. Thus ... r(X,Y) ... where X and Y are parallel forms of a test ... is THE estimate of the r(XX) or r(11) ... which is an estimate of var(T)/var(O). Thus ... we could use the square root of that as an estimate of the correlation between the Os and the Ts ... and in fact we could predict the T based on this correlation ...
There are numerous ways to estimate the reliability coefficient but, they are all based on the fundamental idea of parallel tests. We could find the correlation between two administrations of the SAME test ... or the correlation between alternate forms of the same test ... or could find the correlation betwen half tests ... and then make a Spearman-Brown adjustment to boost the reliability estimate back up to what the FULL length test should be.
As far as test construction goes ... central to the devlopment of good items ... and therefore good tests ... are the methods of item analysis which examine in particular ... item difficulty and discrimination. More on these methods can be found in most elementary measurement books.
As a final note ... it should be mentioned that since classical test theory and the estimation of reliability depends so heavily on correlational statistics ... any factor that has an impact on correlations will ultimately have a spillover impact on reliability estimation: ie, restriction of range of scores is but one important example.
FATHER (Construct) (Scale) kind _____:_____:_____:_____:_____:_____:_____ cruel Distance (Intensity) +3 +2 +1 0 -1 -2 -3 ORIGIN Direction (Quality) Positive NegativeThe item itself is called a scale, and several scales on the same concept represents a multidimensional space. The respondant places an "x" on the one equal-interval segment which best represents his/her feelings towards the concept. As one moves further away from the origin of the semantic space (i.e., the middle segment of the scale), the intensity or strength of feeling the individual has towards one of the adjectives with respect to the concept increases. Additionaly, the quality of the affect is represented by the direction of the response, either towards the positive or negative adjective.
Osgood et al. (1957) analyzed 20 different concepts which were rated by 100 individuals using 50 sets of bipolar adjectives. The resulting factor analysis revealed three dimensions of the semantic space which account for rougly 50% of the variance: evaluative (e.g., "nice-awful"), potency (e.g., "strong-weak"), and activity (e.g., "excitable- calm"). Several other factors exist in general and specificly to the particular construct being examined.
A few references:
Key to all of this would be to estimate somehow what the M1 and M2, etc. values would be AND .. if we knew the S value ... and assumed it to be the same for all Ms .... we could then estimate how many S UNITS APART EACH SUBSEQUENT M is from each other M. For example, M2 and M3 might be estimated to be .5 S units apart ... while M3 and M4 might be estimated to be .7 S units apart. In the overall scheme of things therefore, we would assume that M2 and M3 are closer together on the underlying psychological continuum ... whereas M3 and M4 are further apart on the underlying psychological continuum. Now ... without boring you to MORE tears ... let's simply assume that through some simplifying moves ... we derive a system that will first set S = 1 ... a common unit of measurement or the standard deviation value or the discriminal dispersion value ... and then start all the Ms at a point of 0 ... the LEAST favored stimulus ... and then scale the other Ms to the right of 0 ... based on the S value of 1 AND the proportionate favorings of M2 compared to M1, and M3 compared to M2, etc. For example, let's assume that M1 is the least favored stimulus object and we set it to 0 ... and the next favored stimulus is say preferred 98% of the time to M1 ... remember the z score and the PR value above? We would then call M2 = a a score value of 2 ... 2 of the S units of 1 away from M1. We could then scale M3 on up the list ... in a similar way. (Note: I have simplified the method above ... actually ... in finding the final scale differences ... instead of simply asking how much is M2 favored over M1 .... we look at how much is M2 favored over ALL the others ... and go from there.The net effect of that is to scale down the differences among the stimuli but .... this is but a minor detail ... trust me!)
Well ... how do we do this? First, we have a set of n stimuli ... let's say for our purposes ... 7. For 7 stimuli ... there are actually n(n-1)/2 number of pairs ... or 7(6)/2 = 21 different combinations taken 2 at a time. Thus ... for these 7 ... we make up all the 21 possible pairs ... and then present the set (in some random order) to each subject ... and have the subject for EACH pair ... select the one he/she favors or views more positively ... if they can't make a choice ... you MAKE them make a choice .... we want NO whimps here. After all subjects have responded to all 21 pairs of stimuli ... then you are ready to begin the process of analysis ... and ultimately derive the scale values of the M1, M2, etc. up to M7 ... starting at 0 ... and going from there to the right. So .... to illustrate an approximation to this process ... let's do the following.
The first thing we need to do is for each of you to find which stimulus you favored more often over the others ... Max value is 6 ... and which one you favored least. Then, we will summarize that information across all of us .... over the 7 stimuli. This will then enable us to order the stimuli in a matrix .... from MA to M7 ... from least to most. Then we will move on from there to a matrix of frequencies .... where the top half of the square represents the number of times M whatever is favored over another M. On the bottom half of the square ... you will have the opposite number ... ie, how many times M whatever was NOT favored over the other M. And, what would go along the diagonal? Well ... we would assume that each M when compared to the same M (though we don't actually do this) would be split 50/50 so ... 1/2 of the total number of responses would go there. For example, if we had 50 raters of these 21 pairs ... we would put 25 down the upper left to lower right diagonal ....
To try a demo in class, I made up a sheet where I paired all combinations of Hot Dog (1), Big Mac (2), Grilled Cheese Sandwich (3), Corned Beef Sandwich (4), Pizza (5), Clam Chowder (6), and Salad (7)... and then had the class and myself do the 21 paired choices ... which one would you prefer to have for lunch? After this, the first thing I did was to make a matrix where I tabulated how many times each stimulus was preferred over every other stimulus across the 7 raters we had ... and converted these into proportions and the table looked like:
Stimulus 1 2 3 4 5 6 7
1 .5 .43 .71 .57 .86 .29 .86
2 .57 .5 .57 .71 .71 .14 .57
3 .29 .43 .5 .57 .57 .57 .71
4 .43 .29 .43 .5 .57 .43 .43
5 .14 .29 .43 .43 .5 .29 .29
6 .71 .86 .43 .57 .71 .5 .71
7 .14 .43 .29 .57 .71 .29 .5
The .5 value down the diagonal is an assumption that is made that if you pair one stimulus up with itself ... that one half of the time you would pick 1 or the other. To read this table above, the .86 value under the "5" heading means that 86% of the class FAVORED pizza over #1 or Hot Dog. Of course, if that is true, then if you look at the heading "1" down to the intersection of 5 ... this means that 14% favored Hot Dog in that dual choice. Other values are read accordingly.
The next step in this process is to convert the above proportions to z values in a normal distribution ... that is, assuming that the proportion is like a percentile rank in the normal distribution, what z value would that produce? Of course, the diagonal values would all be 0s since that is the z score that has a percentile rank of 50. For a PR of 43, the z would be -.18, and for PR of 71, the z would be .55, etc. Thus, we will have a z score looking like table that corresponds to the proportion table above ... and I will "spare" you of that for the moment.
Then what we do is to SUM each column ... sum for column "1" and "2" , etc. and take and average of those sums. What you get is the following:
Col 1 2 3 4 5 6 7 Sum -1.16 -.56 -.34 1.09 3.09 -2.73 1.63 Ave -.17 -.08 -.05 .16 .44 -.39 .23 Stim Hot BigM GrCS CoBS Piz ClCho Sal Scale .22 .31 .34 .55 .83 0 .62
Now, what we do is to make the lowest rated item be 0 and we can do that by ADDING the + value for the largest NEGATIVE ... -.39 ... to all Averages ... and if you do that and then finalize the scale values, you will get what is listed as SCALE above. Guess what? My class favors having pizza for lunch ... but don't try to make them go to Kern (our local food hangout here close to our building at Penn State) and have clam chowder! (though personally, I LOVE that stuff!). What is important here are two things: 1) the relative ordering of the stimuli, and 2) the approximate distances between them. Note that there appears to be less difference in our preferences for Big Macs and Grilled Cheese sandwiches .. than between say Pizza and Salad.
As a simplification of the above method, I simply found the totals of preferences (each could be favored over another 6 times ... and there were 7 of us ... so the max preference possible would have been 42) and, if you line them up below the SCALE values ... you would have seen ... 16, 19, 20, 24, 29, 14, and 25. If you correlate the scale values with the simple preference totals, you get an r = .99!!!!
Another issue is what type of scale values you might have arrived at if you had been using a Likert approach ... where you would have say a 1 to 5 rating scale for EACH of the 7 foods ... For example, you could have had statements like ... I like to have Pizza for lunch, or I do not like to have Clam Chowder for lunch ... and then seen if the ordering and approximate relative distances between the foods is very similar to what we did more formally above.
Finally, there is a logistics problem with paired comparisons: if you have many, many stimuli, the actual task of doing all the paired comparisons becomes cumbersome. For example, if you had 20 stimuli ... there would be 20(19)/2 or 190 choices that would have to be made AND placed on pieces of paper for Ss to respond to.
A few references to IRT are the following:

For item B, we see that even at low levels of ability, examinees have a good chance of responding correctly and even before approaching midrange ability, it appears that Ss will be answering the item correctly. For item C, only the highest in ability will be having a good chance of answering the item correctly.
Now we see two strange ICCs ... for items D and E. For item D, regardless of the ability, Ss are responding correctly at about the same rate (maybe .7?). This seems weird ... you would think that the better students would get the item correct more often but, that is not happening in this case. Finally, for item E, we see an opposite pattern where the low ability examinees are answering the item correctly but .... higher and higher levels of ability Ss are doing worse and worse. Say what???
In IRT lingo, items A, B, and C have different difficulties ... with B being the easiest and C being the hardest. For item D, it does not discriminate at all since the chance for correctly answering the item stays the same at ALL levels of ability. And finally, item E would be the worst ... where the Ss who you would suspect would NOT answer it correctly do ... and those who you would think could answer it correctly ... don't. More on this later.
In IRT, there is a stress on ICCs and knowing as much about how an item functions, as possible. The notion is ... the more you know about how the items function, the better you are able to put together a test that will accomplish what you need. For example, if you want make finer distinctions among lower ability Ss, surely you would not want to clog up your test with harder items ... this will do you NO good. So, if we have a collection of ICCs ... we can select from those that show easiness and put THEM on the test along with just a few that are slightly harder ... so, look at the top graph in the figure below.

Another characteristic that we can examine in ICCs is the DISCRIMINATION value ... via IRT. So, look at the bottom graph in the figure above. Notice in this case, I have made 3 different ICCs ... that all have the same b or difficulty value; ie, if you follow the horizontal line over and down, it hits 0 for all 3 curves. But, the ICCs are clearly different despite that fact. It appears that they differ in relative slope .. with A being the steepest and C being the relatively flatest. So, what does that mean? In IRT jargon, the relative steepness is a measure of the item's discrimination value. Here is how it works. Notice that I have put two vertical lines down to the baseline at about -.5 and about +.5 so, the distance between those two points is about 1 theta unit of ability; think of it like 1 full standard deviation value in z score terms. An item that discriminates better will be one where between that distance of -.5 and +.5 ... there is a SHARPER difference in the liklihood of answering the item right ... or a bigger difference in p values. So, lets look at that for item C. If you follow the vertical line at -.5 up to the C line ... and then go to the left to find the p value, it would be about .45. Now, going up the vertical line at +.5 we find the p value (to the left) where it hits the C line to be about .6 ... or a change of about .15 in the probability of answering the item right. If we do the same for line B ... we find p values (for the -.5 and +.5 vertical lines) to be about .38 and .65 or a change of about .27; and for line A ... the comparable values would be ps of about .2 and .8 or a change of about .6. Now ask yourself the following question: which line betwen ... A or B or C ... will make us most confident that a person who has ability of -.5 is CLEARLY DIFFERENT than a person with ability of +.5 ... in terms of correctly answering the question? Certainly, it would be line A since the difference in ps is largest and, we are more certain that the +.5 person is really of HIGHER ability than the person of -.5 ... BASED ON THIS ITEM. Line A of course has the sharpest slope and, the slope or steepness value is called 'a' in IRT and is the discrimination value. For most ICCs, discrimination values range from about .3 (pretty flat) to 2 or more (pretty steep).
In classical test theory, discussed earlier, the difficulty value is the proportion of examinees who answer a question correctly but ... in IRT, it is a location parameter on the ICC. Also, in CTT, the discrimination value is usually a correlation between the item score AND the total test score with higher correlations meaning better discrimination ... in IRT, discrimination or 'a' is a slope parameter. Thus, in IRT ... the concepts of item difficulty and item discrimination come from parameters of the ICCs. This is fundamental if you want to understand IRT ... and other latent trait theories.
First, in CTT, the normal way one scores tests is to have a key ... overlay it on the responses of an examinee ... then total up the item scores where each item is given the same weight; you either get it right or wrong ... and each right response is = 1 point. However, in IRT, that may NOT be the case. In the 1 parameter model ... scoring treats each item with the same weight since each item is assumed to be equally discriminating. But ... what about the 2 parameter model? Here ... we assume that items will vary as to their 'a' or discrimination values and ... why should give the same score weight to an item that does not discriminate as well as to one that discriminates better? So ... in the two parameter model ... those items that are responded to correctly are weighted by some function of the discrimination value ... thus two people who had 5 items correct may NOT wind up with the same scores depending on how discriminating the items were that they answered correctly. Scoring in the 3 parameter model is even hairier and I won't discuss it here.
Secondly, IRTers take great pride in saying that in IRT ... the parameters are INVARIANT ... ie, don't change from sample to sample of examinees. For example, in CTT ... the difficulty value could be quite different when given to the advanced class ... and much lower when given to the remedial class. However, in IRT ... the item has ONE ICC shape ... and there is a specific equation for that particular curve. Now ... it you happen to be using it with a low ability group ... you may only have the lower part of the ICC but ... IF THAT ICC WERE EXTENDED and you went over from the vertical axis at .5 and down to the extended baseline ... you come up with the same theta baseline value for that ICC. Hence, IRTers claim that ICCs are invariant across subsamples in terms of difficulty. I won't get into the arguements that detract from that simplicity here but ... there are several. Well ... that's about all the time we have for IRT .. though there is alot more to tell.