Note: My buddy Bob Frary at VPI ... and his friend Miguel Garcia-Perez .. have done some excellent work related to the none of the above (NOTA) option. There position is that the presence of NOTA in items tends to make them somewhat more difficult and, because of that, potentially more discriminating. One paper is: Frary (1991). The NOTA option: an empirical study, Applied Measurement in Education, 4(2), 115-124. Anyone interested in this topic ... I would encourage you to have a look at this paper (OR SEE THE EXPANDED VERSION DOWN BELOW) ... or if you have other questions, send a note to Bob ... Bob the NOTA fellow
Eliminate incorrect options: With this method, the examinees are told to mark through all the options that they think are INCORRECT ... being careful not to also mark through the correct answer. For example, say that a person knows for sure the correct choice ... then he/she marks (if there are 5 options for example) the 4 incorrect choices .... and leaves standing the correct choice. On this item, the score is 4. If one clearly knows that choices A and C are not correct but is not sure about the others ... then he/she marks through those 2 ... and leaves the others standing alone. Assuming that the correct choice is not A or C ... his/her item score is 2. On a 20 item test with 5 options for each item, the max score possible is 4 times 20 = 80 ... and theoretically the min is 0 times 20 = 0 ... assuming that the examinee marked through the correct choice (but this is not likely) on each item. Method tends to spread the scores out more and allow better accounting for partial knowledge.
Mark until correct: Another variant of the one above is to have the examinee do the opposite ... mark until they hit the correct answer. Now, if the person really really knows it for sure, they will get this on the first shot ... if not, it might take 2 or more stabs until reaching "bingo!". The method when proposed used a punch device that provided immediate feedback to the examinee. Here, LOW scores mean best ... since that represents fewer stabs to get to the correct answers. Again, tends to spread the variance out more ... and give more credit for partial knowledge.
Confidence or probabilistic testing: There are several varieties of this but, in essence, one is told to allocate confidence (or p's or money) to the choices in terms of the liklihood (to the examinee) of being correct. For example, let's say that a person really really knows the answer. Well, he/she would put all they have to bet on the correct choice and receive max credit for the item. If one can narrow the choices down, let's say to A and C (and assume that one of these is correct), then he/she might opt for putting 50% on each ... In this method, IF ONE DOES NOT KNOW ANYTHING AT ALL ... rather than guess, he/she is supposed to equally allocate to the choices (say 20% to each). Scoring is based on how much confidence or p or money has been allocated TO THE CORRECT ANSWER. In this method, how much you put on any other option is totally irrelevant. And, usually in this system, one attempts to discourage random guessing (when the examinee does not know) and therefore, penalizes severely the examinee who puts 0 bet on the correct choice. Thus, if one wants to take many risks and guesses, one quickly finds out that his/her score could be NEGATIVE ... and a total score in that area surely means alot of guessing went on.
There is alot of literature on this topic and a couple of not too recent references are the followoing: (1) Hambleton, Roberts and Traub (1970). A comparison of the reliability and validity of two methods for assessing partial knowledge on MC tests. Journal of Educational Measurement, 7, pp 75-82 ... and (2) Reid (1979). An investigation of factors underlying the utility of confidence testing procedures. Doctoral dissertation, Penn State University.
While to me the above variations on conventional MC testing are more interesting ... and almost to a tee, examinees who are given the chance to use the methods will tell you that they like them better and feel that they are getting a better shake at being fairly evaluated, the evidence for the increase in reliability and/or validity of these is rather dismal. One of the problems is that they tend to take MORE testing time ... and time is money AND ITEMS ... when testing. So, even if there were a little gain in say reliability ... the time needed for examinees to respond to some fixed number of items is longer ... and one could use conventional methods with MORE items in the same amount of testing time ... as a way to compensate. But, even considering that ... there still is not much evidence showing that these variations are better. But ... they ARE more interesting and worth trying out on examinees ... even if ONLY to show that there are more ways to handle MC tests.
13. We also know that even if assumption 2 is correct, there is no guarantee that you will select the choice that is keyed as the correct choice. Why? That would assume that the items are so well constructed that the knowledgeable examinee can see through the item and find the correct answer. But, due to ambiguity or a number of other things, this may not happen. Now, this is not as major a threat to the violation of the assumptions as the first but, none-the-less, it is such that the second assumption is not completely true.
14. Because of the invalidity of the first assumption, the third assumption cannot be true also. If for example, you have some knowledge about the item, and can eliminate say 2 options of the 5 ... then you MAY guess among the remainder of 3 but ... that's a wild guess out of 3 ... not a wild guess out of 5. In essence, you have made a 3 choice item out of a 5 choicer and, clearly, you have increased the odds of your getting it correct even though you don't have full knowledge.
The bottom line is that the assumptions that are necessary to develop the correction for guessing formula and make it valid are not true, at least not totally.
A. For the 30 that you had no idea of, given 5 options, we would expect that you would get about 1/5 correct or 6. From that ... you would miss about 24. But, from the 20 that you were able to turn into TF kinds of items, you would get about 10 correct and 10 wrong. So, on this test, the total number of wrong would be expected to be about 24 + 10 = 34. Now, putting this into a correction for guessing formula:
Corrected score = # correct - (# wrong/# options - 1) = 16 - 34/4 = 16 - 8.5 = 7.5
B. In this case, we would expect your observed test score to be about 7.5 but, if you were really fully guessing at all items, we would expect your score to reduce to 0. Thus, clearly on average, the correction for guessing formula makes assumptions that would UNDERCORRECT YOUR SCORE compared to what it assumes to be happening in the case (like the scenario paints) that you either fully know it or don't ... with no inbetween ground possible. That is, the score you actually get will look better than what the correction for guessing formula would assume if you don't have full knowledge on each and every item. In one sense, this is good in that the formula does not make matters worse.
But, of course, if there are no omits and the r between # right and corrected score, then what is the real reason why or benefit from using the correction formula? While the corrected scores will for sure be lower across the class of examinees, the rank order of examinees will not change. So ... ??? This is a good question and, about the only "excuses" for using the correction formula are: a) to get your score closer to the "truth" and b) to discourage random guessing. After all, guessing adds to the "measurement error" and, that is not good for such things like reliability.
Normally, the instructions will inform examinees that they should not engage in complete random guessing on an item. That is, if they have no idea whatsoever about the right answer is ... THEN OMIT THE ITEM. However, if they are really able to eliminate one or more options as being incorrect, then go ahead and make a response since on average, you will be ahead of the game. Such a direction is only fair. But, does it really work?
Frary, Bob (1988). Formula scoring of multiple-choice tests (correction for guessing). No. 3 in the series; Instructional Topics in Educational Measurement, B. S. Plake, Editor. Educational Measurement: Issues and Practices, 7(2), 33-38.
Iteman does traditional classical test theory item analysis ... and works from a text .dat file that you create in a word processor or text editor. If you have large sets of data .... you can scan answer sheets with code information for examinees and then go from there. What I did for the demo was to handload a small data set from a course I teach ... it had 18 students and there were 30 four option MC items. The way you set up the data file looks as follows:
30 8 9 2 <---- Control line ... #items, etc. 141131113241231311424331324133 <---- Key 444444444444444444444444444444 <---- How many alternatives yyyyyyyyyyyyyyyyyyyyyyyyyyyyyy <---- Item inclusion for scale key 01141431113243232313224333324233 <--- First 2 digits examinee code ... others item responses 02141131113242243311424331322234 03141131123241242311424331322132 04121131213441232311424331322323 05141131113243231311424331324333 06131132113411232313424332124233 07131131413241231311424331324123 08141131213243211311424111324123 09141132213244232311124331324133 10131131113241223312424331324134 11131132113241232211324331324133 12141131113244243313424331312134 13141131213241233311324331324133 14121432213244234313424341324333 15121131113241233311424331324333 16141131113241231311424331324133 17141132113241233311424331324133 18131131113241232311424331324333After creating the .dat file ... you will run ITEMAN and store the results of the item analysis in an .out ... or output file. Here is the output for a few items ...
MicroCAT (tm) Testing System
Copyright (c) 1982, 1984, 1986, 1988 by Assessment Systems Corporation
Item and Test Analysis Program -- ITEMAN (tm) Version 3.00
Item analysis for data from file c:\550\406.dat Page 1
Item Statistics Alternative Statistics
----------------------- -----------------------------------
Seq. Scale Prop. Point Prop. Point
No. -Item Correct Biser. Biser. Alt. Endorsing Biser. Biser. Key
---- ----- ------- ------ ------ ----- --------- ------ ------ ---
1 0-1 1.000 -9.000 -9.000 1 1.000 -9.000 -9.000 *
2 0.000 -9.000 -9.000
3 0.000 -9.000 -9.000
4 0.000 -9.000 -9.000
Other 0.000 -9.000 -9.000
Item 1 was a bust ... everyone got it right ... so things like point biserials are screwed up!
2 0-2 0.556 0.311 0.247 1 0.000 -9.000 -9.000
2 0.167 -0.415 -0.278
3 0.278 -0.057 -0.043
4 0.556 0.311 0.247 *
Other 0.000 -9.000 -9.000
Here is what is in the output. Seq No is the chronological order of the item on the test. Scale-item would give subscale order ... 0-1, 0-2, 2-1, 2-2, etc .... if there are multiple subscales. P value (item difficulty) is next ... followed by Biserial and Point Biserial values (item discrimination). Then you get item option information ... which option, how many selected each (17% opted for choice 2), and the point biserials ASSUMING each choice were correct: for choice 4 (the correct choice), r = .25 ... some positive r between item score and total scores.
Scale Statistics
----------------
Scale: 0
-------
N of Items 30
N of Examinees 18
Mean 25.167
Variance 5.806
Std. Dev. 2.409
Skew 0.019
Kurtosis -0.705
Minimum 21.000
Maximum 30.000
Median 25.000
Alpha 0.483 <---- Don't tell my class about this pathetic reliability! Have to make the test harder!
SEM 1.733
Mean P 0.839
Mean Item-Tot. 0.295
Mean Biserial 0.459
ITEMAN will allow you to work with tests with right/wrong answers, like classroom tests AND will allow you to do analysis for measures like attitude scales (multipoint items). You can define with the item inclusion key above several subscales ... and then the scale statistics at the bottom of the output file will have these separated by subscale ... and will show you the intercorrelations of the scale scores. This is a nice program and is easy to use. Again, for more info .. follow the link above to Assessment Systems.
At VPI, Bob then solicited instructors who either used or would be willing to use NOTA on some of the items on their tests. In addition, when the NOTA options were used for items, he made sure that the NOTA option WAS ACTUALLY THE CORRECT CHOICE about the same proportion of times that one would expect just by chance alone. In this context, he found 20 tests using NOTA from 10 instructors ... and these represented about 1000 items ... split about 725 to 275 for non NOTA items and NOTA items respectively.
For the analysis, Bob first compared the difficulties and item discriminations for the the batch of non NOTA items versus the NOTA items. p values were slightly higher for non NOTA item ... .66 versus .61. There were no differences in item discrimination values. Then, he compared within the NOTA items, the difficulties and discrimination values for NOTA items when it was NOT the correct choice, to those NOTA items where it was the correct choice. Here, the NOTA items where it was the correct choice turned out to be slightly more difficult than the NOTA items where it was not the correct choice (.58 versus .61). Again, the item discrimination values were not different.
Thus, from a purely empirical point of view, it appears (if you buy his results!) that the presence of NOTA items actually makes the tests a tad more difficult ... whether this really translates into more discriminating items (which is more important really) is still up for debate. Bob and I have debated this back and forth ... and regardless on one's views ... his really is about the best empirical study around on this issue.
So, the first thing I did (1st in a 3 stage study) was to ask students and faculty what they thought a trick test item was ... and also a few other related items like: did they tend to appear on hard or easy tests, did they tend to be items like MC or essay, and did these items get on tests by accident or were they written deliberately. I also asked them to try their hand at constructing an item they thought was tricky but ... that was a bust! I made up a simple front and back of a page survey form, handed it out (NO RANDOM SAMPLING PLAN HERE!) to 174 students and 41 faculty, and eagerly awaited the results. What did I find? First, most felt that selection type items like MC and TF were the places where one would most likely see trick items. Second, they also felt that trick items tended to appear on hard tests (sure .. that's what make them tricky, right?). And third, they also amost overwhelmingly felf that trick items that got on tests were placed there DELIBERATELY ... and did not just drop down from the sky by accident.
Most important in this first phase was the compilation of the open ended responses to what they considered to be the important components of a trick item: I had them define them and then I sifted
Category Ab Ratings Paired Comp --------------------------------------------------------------------------------------- Intention 5.53 4.02 Trivial content 4.11 2.33 2 Fine Disc 4.87 3.09 Noise in stems 4.30 2.30 Mulp Corr Answers 5.31 3.72 Opp principle 4.16 2.40 Ambiguous 4.96 3.02 --------------------------------------------------------------------------------------- Hey .... don't mind the wiggly cols of numbers! I just threw you a curve!The two methods showed almost identical orderings of the 7 categories ... and most interestingly, the intention one was on top in both cases. None of the categories to my view were considered to be unimportant to the definition of trick questions.
Now, the real TRICKY part of this study was phase 3 where we tried to construct some trick and non trick items, and then test whether students could actually correctly sort them into the appropriate piles. First, we picked intro stat .... and then constructed some items what would have covered material that all students would have been exposed to in their courses. We started by making up 20 trick and 20 non trick items, and then narrowed these down to 35 ... and then picked at random 25 to make up the final form. As it turned out, 13 were trick items and 12 were non trick, and these were mixed up on a single test form. Then, we gave the form to over 100 students. It is important to note that students were NOT asked to work the problems ... but rather were to indicate on a scale of 1 to 4 ... their feeling about whether it was a trick item or not: (4=def think it is a trick item, ..... 1=def think it is NOT a trick item).
For scoring, we had a predetermined key as to which items were trick and non trick. Then, we would score each person's responses as: if they put 4 or 3 AND it was a trick item, they got it correct and a value of 1 or if they put 2 or 1 and it was NOT a trick item, they also got a score of 1. The other response matches were given 0 and considered to be incorrect.
The first thing we did was to test the hypothesis that Ss were able to correctly sort items to trick or non trick using a null value of 12.5 (chance in this situation). They were ... garnering a score of 14.6. They did do better than 12.5, significantly, BUT .... considering that 25 as the max score, NOT much better by some absolute criterion. We also looked at whether they had a more or less difficult time sorting trick (out of 13) or non trick (out of 12) items. They correctly sorted 5.4 out of 13 for trick, and 9.2 out of 12, for non trick. They could significantly get non trick items sorted better than trick items. Finally, since the intentionality of the item was considered to be the most important component in defining a trick item, we sorted the trick items into those that appeared to have some delieberate intentionality factor versus those that did not. Our hypothesis was that since the intention was in the eye of the test item constructor, it would be harder for a S to find them, compared to other more obviously tricky items. However, there was no significant difference in the correct sort rates for these 2 types of trick items.
Anyone interested in more detail about this study would check out: Roberts, DM (1993). An empirical study on the nature of trick test questions, Journal of Educational Measurement, V30, #4, pp331-344.
TQ allows you to work in 2 broad areas: item maintenance, and printing tests/items. It can work easily with recognition types of items ... and allows the input of essay types ... and helps you form an item pool where you can select subsets of items for a test. Item entry is easy and the printed copy of tests is good.
For item entry, you are given a screen that allows 10 or so lines of input for the stem ... word wrap is in operation ... and there are some minimal graphing capabilities. First you type the stem. Then, when that is ok, you hit ESC and go to the first alternative. Concurrent ... you see the stem while you are typing each alternative. After typing the alternative, you will have to indicate if it a fixed choice or moving type of alternative ... fixed means it will be printed in that EXACT order ... moving means it could be randomly put in any spot. So, you will categorize the options as fixed incorrect or fixed correct, or moving correct or moving incorrect. After you finish the complete item, you go on to the next ... or exit.
When you want to print a test, you will go into a test creation screen where you will indicate which topics and items you want to use: ta 1,2,4,5 tb 3,4,6,7, ... etc. ... and keep specifying items until your test is complete. After this is done, you will be able to add headers, and other information about the test. Also, you can get an answer sheet printed out for their filling in of the answers, it you want. After one test is printed, you will be prompted if you want another form ... and if you answer yes, whether to randomize the answers. For those items that are "moving" items, it will rearrange the alternatives for the item ... though the sequence of the items will stay the same. Overall, this is an easy program to use ... has reasonable flexability, and makes a nicely formatted printed test. I use it in my stat courses and it works fine.
TestQuest is ONLY an item banking program ... it does no analysis nor does it allow the inputting of analysis information (like item difficulty) on which you can later use to make item selections. But, I don't see that as a problem. Anyone interested might contact Chris Mayer at the above phone/address ... and ask for more info. I have written a letter recently and will post an update here if I here of any additional information about price, availability, support, etc.
THE FACTOR ANALYSIS BASIC TOUR
FACTOR PATTERNS IN CORRELATIONAL DATA
Note: Assume that r's of .3 or .4 = variables go together
Matrix 1 Matrix 2
1 2 3 4 5 6 1 2 3 4 5 6
1 - 72 62 58 74 82 1 - 17 56 72 21 03
2 - 53 62 47 58 2 - 14 03 53 64
3 - 63 71 58 3 - 63 01 09
4 - 53 54 4 - 17 09
5 - 68 5 - 72
6 - 6 -
Matrix 3 Matrix 4
1 2 3 4 5 6 1 2 3 4 5 6
1 - 03 19 02 11 07 1 - 03 72 01 19 15
2 - 53 63 16 01 2 - 17 03 14 59
3 - 51 08 20 3 - 20 08 13
4 - 22 08 4 - 68 09
5 - 17 5 - 17
6 - 6 -
Matrix 5
1 2 3 4 5 6
1 - 03 07 02 13 09
2 - 21 17 09 17
3 - 03 14 07
4 - 02 09
5 - 16
6 -
So, here are some data from 4 variables with n=10
MTB > prin c16-c19 <--- Cols in MTB where data are
ROW M1 M2 M3 M4
1 36 41 45 45
2 52 58 57 41
3 51 51 48 65
4 47 46 40 63
5 56 49 48 36
6 66 72 65 56
7 48 48 44 43
8 61 55 50 70
9 50 39 50 55
10 33 40 37 46
MTB > corr c16-c19
M1 M2 M3
M2 0.805
M3 0.782 0.843
M4 0.329 0.178 0.055
If you examine the intercorrelations, M1-M3 correlate highly with each other but, M4 seems not to correlate with M1-M3.
MTB > let c20=(.8*c16)+(.7*c17)+(.8*c18)
MTB > prin c16-c20
ROW M1 M2 M3 M4 A
1 36 41 45 45 93.5
2 52 58 57 41 127.8
3 51 51 48 65 114.9
4 47 46 40 63 101.8
5 56 49 48 36 117.5
6 66 72 65 56 155.2
7 48 48 44 43 107.2
8 61 55 50 70 127.3
9 50 39 50 55 107.3
10 33 40 37 46 84.0
MTB > corr c16-c20
M1 M2 M3 M4
M2 0.805
M3 0.782 0.843
M4 0.329 0.178 0.055
A 0.933 0.942 0.928 0.211
Well, the way we defined Factor A and the resulting r of the factor A scores and M1-M4 shows that our definiton worked well for M1-M3 but, didn't do much for predicting or explaining M4. Residuals give you some idea about how good the "fit" was (Factor A rule and factor scores TO the data ... M1-M4). But, to see those, you have to do regressions and then examine the residuals. The A factor scores variable is used to predict M1-M4 ... then residuals are examined.
MTB > regr c16 1 c20;
SUBC> resi c21. <---- leftover M1 after Fact A scores removed
MTB> regr c17 1 c20;
SUBC> resi c22. <--- leftover M2 after Fact A scores removed
MTB > regr c18 1 c20;
SUBC> resi c23. <--- leftover M3 after Fact A scores removed
MTB > regr c19 1 c20;
SUBC> resi c24. <---- leftover M4 after Fact A scores removed
MTB > name c21='Er1A' c22='Er2A' c23='Er3A' c24='Er4A' MTB > prin c21-c24
ROW Er1A Er2A Er3A Er4A
1 -4.56888 0.52419 4.11021 -4.5705
2 -4.62285 1.48202 3.32608 -12.7061
3 0.41494 0.51537 -0.86589 12.8493
4 2.54634 1.64227 -3.98333 12.4288
5 4.19802 -2.70065 -1.83495 -16.4642
6 -3.44730 2.66698 1.11369 -1.0098
7 1.01890 1.11668 -1.99599 -8.2223
8 4.61118 -1.28413 -3.48756 16.3542
9 2.97209 -7.93010 3.96674 3.7656
10 -3.12245 3.96736 -0.34900 -2.4250
Above is called first residual matrix. Note that the errors are relatively smallER using factor A scores to predict M1-M3 but, the errors are much larger for M4. This means that when we applied the Factor A rule, not that much was left over (residuals) when partialing it from M1-M3 but, substantial residuals were there when partialing it from M4. BACK TO THE DRAWING BOARD FOR M4!! We now need to start the cycle over again but, remember, we now have a first residual matrix so ... this is the point at which we need to work. Taking Factor A out of the original M1-N4 data matrix left us with the first set of residuals: how much of THAT can we now explain or eliminate by now defining a second Factor B?
Working with the first residual matrix, let's define Factor B as: .7 * Er4A
MTB > let c25=.7*c24
Now we use new FACTOR scores to predict first residual matrix
MTB > regr c21 1 c25;
SUBC> resi c26. <--- leftover for resid M1 after Fact B taken out
MTB > regr c22 1 c25;
SUBC> resi c27. <--- leftover for resid M2 after Fact B removed
MTB > regr c23 1 c25;
SUBC> resi c28. <-- leftover for resid M3 after Fact B removed
MTB > regr c24 1 c25;
SUBC> resi c29.<- leftover for resid M4 after Fact B taken
MTB > name c26='Er1B' c27='Er2B' c28='Er3B' c29='Er4B'
How do the factor scores from the Factor B rule related to the first residual matrix?
MTB > corr c21-c25
Er1A Er2A Er3A Er4A
Er2A -0.601
Er3A -0.624 -0.249
Er4A 0.373 -0.065 -0.388
C25 0.373 -0.065 -0.388 1.000
Note that our Factor B rule and factor scores don't correlate too well with residuals on M1-M3 but perfectly (and this clearly is not usually the case) with the residuals on M4. What about the second residual matrix?
MTB > prin c26-c29
ROW Er1B Er2B Er3B Er4B
1 -4.01530 0.43479 3.63486 0
2 -3.08386 1.23347 2.00458 0
3 -1.14138 0.76672 0.47050 0
4 1.04095 1.88539 -2.69066 0
5 6.19220 -3.02272 -3.54732 0
6 -3.32499 2.64723 1.00866 0
7 2.01480 0.95584 -2.85115 0
8 2.63033 -0.96422 -1.78664 0
9 2.51599 -7.85643 4.35839 0
10 -2.82872 3.91993 -0.60121 0
So, how do the Factor A scores correlate with the variables, M1-M4? On the second page, you saw how A correlated with M1-M4. On the page above, you see how Factor B scores correlate with the first residual matrix. We can put this information in a table.
Fac A Fac B
M1 .933 .373
M2 .942 -.065
M3 .928 -.388
M4 .211 1.00
Anyone who wants to input the data from above for the variables M1-M4 ... and subject them to a principal components and/or fact analysis using something like Minitab will see the same GENERAL pattern come out ... though obviously the exact values for loadings, etc. will NOT be the same. All this goes to show is that with a small intercorrelation matrix AND a fairly clear pattern ... one could do a logical FACT ANALYSIS and have it be close to the "real thing" ... but try that with a 60 by 60 matrix for example where there is NO way to look at the correlations and figure out what is happening.