ITEM WRITING RULES: added March 19

We have completed our discussion of general item writing rules, primarily for paper and pencil type tests, and here is a list of some and some brief comments. If you have any ideas to add or comments, send me a note at: Instructor Guy Roberts
GENERAL RULES
I first like to put items into 2 large pots: recognition and supply. A recognition item is one where the answer options are given ... then you select one; or the supply type where you have to give the answer ... write something out, etc. Some authors prefer to call these fixed choice or complex type ... you pay your money and take your choice! So, given that ... here are some things that seem to apply across the board, no matter what the specific type of test is.

  1. Item should test some important learning objective: bottom line, don't waste space testing trivial stuff
  2. Problem should be clearly formulated
  3. The simpler the language, the better: don't try to be a word wizzard!
  4. Avoid extraneous information ... ie, get to the point!
  5. Avoid textbook quotes: certainly, we can do better than this!
  6. Avoid opinion items: when you are testing for knowledge that is
  7. Use good grammar: Ain't a bad idea!
  8. Items should be independent of other items: if one item gives clues to another item or DEPENDS on informatioin from another item ... then score on test is confusing
  9. Use appropriate type of item format for what you are testing
  10. MOST IMPORTANT ... Let someone else review items prior to use ... I can't stress this too much!

    MULTIPLE CHOICE
    We need to remember that a MC item has a stem ... where you set up the problem ... and then options, including both the correct choice and the distractors. In fact, all the recognition items are multiple choice ... the distinction then becomes one of HOW many choices are provided for you to select from.

  11. Put as much of the item in the stem as possible: Reason? Cuts down on item reading time
  12. Put item in positive terms ... avoid negative phrases
  13. But, if you need to put in a negative, use EMPHASIS like underline or bold
  14. Answers need to be CORRECT ... or CLEARLY THE BEST
  15. Distractors should be plausible ... not just fillers that no one pays any attention to
  16. Lengths of options should NOT give a clue to right choice ...
  17. POSITION of correct choice should NOT give a clue to right answer ... you can double check this after finishing the test ... and make adjustments if necessary
  18. Use "all of the above" and 'none of the above" carefully.

    Note: My buddy Bob Frary at VPI ... and his friend Miguel Garcia-Perez .. have done some excellent work related to the none of the above (NOTA) option. There position is that the presence of NOTA in items tends to make them somewhat more difficult and, because of that, potentially more discriminating. One paper is: Frary (1991). The NOTA option: an empirical study, Applied Measurement in Education, 4(2), 115-124. Anyone interested in this topic ... I would encourage you to have a look at this paper (OR SEE THE EXPANDED VERSION DOWN BELOW) ... or if you have other questions, send a note to Bob ... Bob the NOTA fellow
  19. Avoid complex item types like using 3 options ... then making more options like: A and B, B and C only ... etc. Too confusing and does not really accomplish much
    BINARY TYPE (alias True False)
  20. Only include ONE item per item
  21. Keep items short
  22. Write so item is unequivocally true or false ... NO partly true and partly false items!
  23. Avoid negative wordings at ALL costs!
  24. Avoid specific determiners like ALL, or ALWAYS, or NEVER ... in TF items
    MATCHING
    Matching items are perhaps underutilized ... Keep in mind that a matching item has 2 parts: the stimulus set part and the response set part.

  25. Use homogeneous material in both sets ... ie, don't try to make 5 different kinds of items in one
  26. Make relatively short items ... but MORE ITEMS (ie, don't have 1 page with 25 in the stimulus and 30 in the response set ... totally unmanageable!)
  27. Have more or less in both sets .... DON'T HAVE EQUAL AND THEN SAY ... EACH STIMULUS GOES WITH 1 AND ONLY 1 RESPONSE! Why? Well, say you have 5 on each side ... and you know only 4 ... you get the 5th free as a bonus ... or say you miss 1, you necessarily miss 2 ... so score is not very clear in this case. Better to say have 4 on stimulus side ... and 6 on matching side .. and say that any response may be used 0 times or more than once. KEEPS THE student THINKING .. !
  28. Use some logical ordering (alphabetical for example) for both sets
  29. Be clear on the BASIS that the match is to be made ...
    SHORT ANSWER OR COMPLETION
  30. Silly to say but, ONLY short answers to be required! (Kind of obvious right?)
  31. Make sure that only 1 answer is acceptable
  32. For completion, only key words or phrases should be where the blanks are ... would be rather silly to make a blank for an AN or a THE .... wouldn't it? (Of course, there are exceptions ... I can see a fill in the blank in an english class where an AN or THE would be THE key word)
  33. Make blanks the same lengths ... and put near end. Reasons? We don't want length of blank to give a clue ... and putting near end makes student able to do items faster.
    EXTENDED RESPONSE OR ESSAY
    These items are the ones where you force the examinee to write a paragraph or 2 ... or more, in response to some general (or itemized list of subparts) question.

  34. Restrict these to higher level objectives ... ie, don't use essay to test facts!
  35. Make sure the task is clearly defined. I prefer to have a stem part ... which sets the tone .. then 2 or 3 specific items .... so the examinee knows exactly what he/she has to respond to
  36. Provide adequate time for responding ... and perhaps suggest times for each item and/or give point values so examinee can use his/her time to their best advantage
  37. DO NOT GIVE A CHOICE AMONGST THE ITEMS ... like answer any 4 of the 7. This is really very bad measurement practice and ... is usually in response to the previous caution ... that the test is too long for the time given. It is bad practice for several reasons: as many combinations of 4 out of 7 that there are, that is how many different tests you will have, and some combinations are definitely harder than others. Also, there will for sure be some examiness who can ONLY ANSWER 4 .... and breathe a sigh of relief .... while others can do all 7 and you do NOT allow them to show the extent of their knowledge. Under such methods, test scores are higher on average and show LESS variance (not good for reliability) across examinees.
  38. Really, essays are good for seeing if the examinee can WRITE and ORGANIZE thoughts ... and should be restricted to those kinds of objectives
    Scoring and other problems with Extended Response Items
  39. Use point allocation method if possible ... creating model good and bad anwers is a good idea
  40. Score ONE item for ALL examinees .... then go to the second, third, etc. This helps for better consistency in scoring
  41. Keep examinee unknown to you if possible ... not easy to do but means bias is less likely to creep in
  42. Have more than one scorer if possible ... again not easy but a good idea
  43. Essays are EASY to construct but HARD to score
  44. The comparable thing to guessing on selection items is BLUFFING
  45. Essays restrict content sampling ... only so much you can cover on 1 test
  46. Scores on essays are influenced by factors (neatness of writing, etc.) OTHER than content of response
  47. Essays can actually promote POOR WRITING SKILLS since the time limits are usually not adequate for making a thoughful response ... thus, hurry up and write it fast sets in
    OTHER ASSESSMENT STRATEGIES
    We went through some other methods for making assessments ... and quickly gave some plusses and minuses.

  48. Open book exams: many times, examiner makes this part more difficult
  49. Oral exams: problem of test time ... to cover all examinees, and consistencly of questions asked
  50. Take home tests: who does the work?
  51. Retests: great idea for attempling to assess "learning"
  52. Collaborative testing: like group projects where one score goes to all doing the project
  53. Journals: nice to have students keep chronology of what they are doing
  54. Portfolios: nice to have students keep records of their products ... too bad we don't practice that more with college students ... would help them when looking for a job!
  55. Performance tests: if you want to see if they can change a tire, watch them change a tire ... really this is best general method for seeing if examinee knows his/her stuff but ... complicated to pull off in classroom settings

ALTERNATIVE WAYS TO GIVE MC TESTS

On Monday, we talked about several ways to give/take/score MC tests beyond the standard conventional way ... which is to select the best answer and you get a point if your answer matches the key. Here is a brief description of the alternative methods.

Directions Same/Scoring Different
Here we have the method of assigning differential weights to the alternatives depending on expert opinion as to which is second best, third best, etc. For example, the correct choice gets a value of 1, second best a value of .8, etc. The directions to the examinee is the same ... select the choice you think is best BUT, you tell them that there are different scores possible for the item depending on whether they select the correct, next best, etc. This method tends to reward examinees for partial knowledge ... better than conventional scoring (either 1 or 0).

Directions Different/Scoring Different
There are several that fit into this category.

Eliminate incorrect options: With this method, the examinees are told to mark through all the options that they think are INCORRECT ... being careful not to also mark through the correct answer. For example, say that a person knows for sure the correct choice ... then he/she marks (if there are 5 options for example) the 4 incorrect choices .... and leaves standing the correct choice. On this item, the score is 4. If one clearly knows that choices A and C are not correct but is not sure about the others ... then he/she marks through those 2 ... and leaves the others standing alone. Assuming that the correct choice is not A or C ... his/her item score is 2. On a 20 item test with 5 options for each item, the max score possible is 4 times 20 = 80 ... and theoretically the min is 0 times 20 = 0 ... assuming that the examinee marked through the correct choice (but this is not likely) on each item. Method tends to spread the scores out more and allow better accounting for partial knowledge.

Mark until correct: Another variant of the one above is to have the examinee do the opposite ... mark until they hit the correct answer. Now, if the person really really knows it for sure, they will get this on the first shot ... if not, it might take 2 or more stabs until reaching "bingo!". The method when proposed used a punch device that provided immediate feedback to the examinee. Here, LOW scores mean best ... since that represents fewer stabs to get to the correct answers. Again, tends to spread the variance out more ... and give more credit for partial knowledge.

Confidence or probabilistic testing: There are several varieties of this but, in essence, one is told to allocate confidence (or p's or money) to the choices in terms of the liklihood (to the examinee) of being correct. For example, let's say that a person really really knows the answer. Well, he/she would put all they have to bet on the correct choice and receive max credit for the item. If one can narrow the choices down, let's say to A and C (and assume that one of these is correct), then he/she might opt for putting 50% on each ... In this method, IF ONE DOES NOT KNOW ANYTHING AT ALL ... rather than guess, he/she is supposed to equally allocate to the choices (say 20% to each). Scoring is based on how much confidence or p or money has been allocated TO THE CORRECT ANSWER. In this method, how much you put on any other option is totally irrelevant. And, usually in this system, one attempts to discourage random guessing (when the examinee does not know) and therefore, penalizes severely the examinee who puts 0 bet on the correct choice. Thus, if one wants to take many risks and guesses, one quickly finds out that his/her score could be NEGATIVE ... and a total score in that area surely means alot of guessing went on.

There is alot of literature on this topic and a couple of not too recent references are the followoing: (1) Hambleton, Roberts and Traub (1970). A comparison of the reliability and validity of two methods for assessing partial knowledge on MC tests. Journal of Educational Measurement, 7, pp 75-82 ... and (2) Reid (1979). An investigation of factors underlying the utility of confidence testing procedures. Doctoral dissertation, Penn State University.

While to me the above variations on conventional MC testing are more interesting ... and almost to a tee, examinees who are given the chance to use the methods will tell you that they like them better and feel that they are getting a better shake at being fairly evaluated, the evidence for the increase in reliability and/or validity of these is rather dismal. One of the problems is that they tend to take MORE testing time ... and time is money AND ITEMS ... when testing. So, even if there were a little gain in say reliability ... the time needed for examinees to respond to some fixed number of items is longer ... and one could use conventional methods with MORE items in the same amount of testing time ... as a way to compensate. But, even considering that ... there still is not much evidence showing that these variations are better. But ... they ARE more interesting and worth trying out on examinees ... even if ONLY to show that there are more ways to handle MC tests.

CORRECTION FOR GUESSING FORMULA IS UP NEXT!

Have you ever wondered about what the correction for guessing formula is and, perhaps more important, what does it do? Well, here is a "brief" discussion about that topic.
ASSUMPTIONS
We first need to go over the assumptions.

HOW THE FORMULA IS DEVELOPED
Let's outline a scenario and then show how the formula is developed based on that scenario. Assume that you take a 50 item MC test, where there are 5 options for each item: 4 incorrect and 1 correct. You happen to get 38 correct .... and you missed 12.

ARE THE ASSUMPTIONS CORRECT?
12. We know of course that the first assumption is NOT true. While some items you might know for sure, others you are not so sure even though you might have SOME information about the content of the item. Thus, if you had some knowledge, then you are likely able to eliminate one or more options as being incorrect ... and thus increase the odds that you will obtain the correct answer.

13. We also know that even if assumption 2 is correct, there is no guarantee that you will select the choice that is keyed as the correct choice. Why? That would assume that the items are so well constructed that the knowledgeable examinee can see through the item and find the correct answer. But, due to ambiguity or a number of other things, this may not happen. Now, this is not as major a threat to the violation of the assumptions as the first but, none-the-less, it is such that the second assumption is not completely true.

14. Because of the invalidity of the first assumption, the third assumption cannot be true also. If for example, you have some knowledge about the item, and can eliminate say 2 options of the 5 ... then you MAY guess among the remainder of 3 but ... that's a wild guess out of 3 ... not a wild guess out of 5. In essence, you have made a 3 choice item out of a 5 choicer and, clearly, you have increased the odds of your getting it correct even though you don't have full knowledge.

The bottom line is that the assumptions that are necessary to develop the correction for guessing formula and make it valid are not true, at least not totally.

HOW DO THE VIOLATIONS IMPACT ON THE CORRECTED SCORE?
Well, if you really had been fully guessing on all items, we would expect you to get 10 correct out of 50 items where there were 5 choices each time. So, you would get 10 correct and the correction formula would take 1/4 of the 40 wrongs and subtract from 10 and get you back to 0. Afterall, under this scenario ... since you know NOTHING .... your score should be ZERO! But, because of violation of assumption 1 primarily, there will be many items where you know more than nothing but less than everything. So, in those cases, you will eliminate some options you know are incorrect ... and gain a better chance of guessing among the remainder. So, just for illustration sake, assume that on these 50 items, about 20 are ones you know enough to eliminate 3 options. Now, for these 20, assuming 1 of the remaining 2 is correct and 1 is incorrect, if you guess at all 20 ... you should get about 10 correct and miss about 10. So, in this scenario ... we would have the following:

A. For the 30 that you had no idea of, given 5 options, we would expect that you would get about 1/5 correct or 6. From that ... you would miss about 24. But, from the 20 that you were able to turn into TF kinds of items, you would get about 10 correct and 10 wrong. So, on this test, the total number of wrong would be expected to be about 24 + 10 = 34. Now, putting this into a correction for guessing formula:

Corrected score = # correct - (# wrong/# options - 1) = 16 - 34/4 = 16 - 8.5 = 7.5

B. In this case, we would expect your observed test score to be about 7.5 but, if you were really fully guessing at all items, we would expect your score to reduce to 0. Thus, clearly on average, the correction for guessing formula makes assumptions that would UNDERCORRECT YOUR SCORE compared to what it assumes to be happening in the case (like the scenario paints) that you either fully know it or don't ... with no inbetween ground possible. That is, the score you actually get will look better than what the correction for guessing formula would assume if you don't have full knowledge on each and every item. In one sense, this is good in that the formula does not make matters worse.

WHAT ABOUT OMITTED ITEMS?
If you get to an item and don't make any response ... the correction for guessing formula is impotent to act. While we could arbitrarily call this an error, since you did not respond, we have no basis to assume that you DON'T know it. Afterall, it is possible that you simply skipped the item on the first encounter but, forgot to go back to it. Therefore, omitted items cannot be assumed to be wrong and therefore, we cannot subtract from the # correct based on omits being considered wrongs. Thus, omits play no role in using the correction for guessing formula. You might say ... WOW .... what a loophole! But, don't get overly excited ... if you don't respond ... you can't get any credit for it either!
WHAT IMPACT CAN OMITS HAVE ON THE CORRECTED SCORE?
Even though omits don't count, they can have some impact on corrected scores. Consider the following. Let's assume that in the scenario above, 2 examinees get 38 correct but ... for the remaining 12 ... which neither really knows ... one omits 4 and the other one does not. For one, we have 38 correct, and 12 wrong; for the other, we have 38 correct and 8 wrong. For the first, the corrected score would be 38 - 12/4 = 35 but for the second, the corrected score would be 38 - 8/4 = 36. In this case, there is a differential correction made and puts the two examinees in different final score positions ... one 35 and the other 36 ... so it would stand to reason that the instructor would treat them differently (ie, perhaps assign different grades) based on their different corrected scores. This is a problem when using the correction for guessing formula IF some examinees omit and some perhaps do not. Assuming the same # of correct responses, differential omitting can (assuming the same amount of remaining knowledge) can force the correction formula into placing one examinee in a different score position than another with the same amount of knowledge. So, what is done?
RESPONDING TO ALL ITEMS
Given the above problem, what is the relationship between corrected scores when omits differentially occur and when omits don't occur? Well ... assuming there are NO omits, if we put into one column your uncorrected score, and into another column your corrected score, there will be a perfect correlation between # correct and corrected score. Try it and see for yourself! Thus, if there are no omits, there is a perfect r between # right and corrected score. But, that is not necessarily the case if there are differential omits.

But, of course, if there are no omits and the r between # right and corrected score, then what is the real reason why or benefit from using the correction formula? While the corrected scores will for sure be lower across the class of examinees, the rank order of examinees will not change. So ... ??? This is a good question and, about the only "excuses" for using the correction formula are: a) to get your score closer to the "truth" and b) to discourage random guessing. After all, guessing adds to the "measurement error" and, that is not good for such things like reliability.

WHAT ROLES DO THE DIRECTIONS TO THE EXAMINEE PLAY?
When you take a test, there are some directions. And ... part of those directions should address the fact of whether or not the correction for guessing formula will be used. Clearly, if you don't inform students of its use, it is sort of NOT telling them the truth about how their test scores will be handled. Such practice would border on unethical behavior .... So, if it is to be used, examinees should be told of that. What should they be told in that case?

Normally, the instructions will inform examinees that they should not engage in complete random guessing on an item. That is, if they have no idea whatsoever about the right answer is ... THEN OMIT THE ITEM. However, if they are really able to eliminate one or more options as being incorrect, then go ahead and make a response since on average, you will be ahead of the game. Such a direction is only fair. But, does it really work?

RISK TAKING AND DIRECTIONS ABOUT GUESSING
If it is planned to use the correction formula, you MUST inform the examinee. But ... who will follow the directions correctly? The ones who will tend to follow the directions and NOT respond when they are not sure are those who tend not to be risk takers; ie, those who are timid about guessing and possibly getting a penalty applied. But, what about those who are risk takers? Well, they will forge "full steam ahead" and more or less ignore the directions NOT to guess if you really don't know it. What will this do? Well, we saw that the first assumption about you either know it or don't cannot possibly be true. Therefore, even on items that you are not completely sure of ... you probably do know something. In that case, if you carefully examine the item, you probably can (across many items like this on the test) can eliminate some of the incorrect response options and therefore increase the likelihood of getting the item correct. In this sense, you beat the correction formula. But, if your are timid ... you are likely to ignore and omit items which you could have perhaps gotten correct if you further considered them. So ... what tends to happen is that those who take the risk tend to gain relative to those who are timid ... BUT WHO HAVE THE SAME LEVEL OF OVERALL KNOWLEDGE! Thus, in a sense, the directions about proper response strategy to the examinee will work against those who are less takers of risks ... ie, FOLLOWING THE DIRECTIONS IS ACTUALLY NOT IN THE BEST INTERESTS OF EXAMINEES!
WRAP UP
Because of the invalidity of the assumptions that "make the correction formula work" and because of the problem that use of directions to the examinee about the application of the correction formula tends to discriminate against timid examinees, the use of the formula is NOT generally recommended. In fact, most major testing companies that used to use the formula, do not anymore.
References
Two good references on this are:
Rowley and Traub (1977) Formula scoring, number right scoring, and test-taking strategy. Journal of Educational Measurment, Vol 14, # 1, pp 15 - 22.

Frary, Bob (1988). Formula scoring of multiple-choice tests (correction for guessing). No. 3 in the series; Instructional Topics in Educational Measurement, B. S. Plake, Editor. Educational Measurement: Issues and Practices, 7(2), 33-38.

ITEMAN ITEM ANALYSIS PROGRAM

I gave a demo in my office on the ITEMAN item analysis program from Assessment Systems ... if you want more info ... you can check their homepage at ... ITEMAN PROGRAM

Iteman does traditional classical test theory item analysis ... and works from a text .dat file that you create in a word processor or text editor. If you have large sets of data .... you can scan answer sheets with code information for examinees and then go from there. What I did for the demo was to handload a small data set from a course I teach ... it had 18 students and there were 30 four option MC items. The way you set up the data file looks as follows:


 30 8 9  2   <---- Control line ... #items, etc.

141131113241231311424331324133  <---- Key

444444444444444444444444444444   <---- How many alternatives

yyyyyyyyyyyyyyyyyyyyyyyyyyyyyy  <---- Item inclusion for scale key

01141431113243232313224333324233   <--- First 2 digits examinee code ... others item responses

02141131113242243311424331322234

03141131123241242311424331322132

04121131213441232311424331322323

05141131113243231311424331324333

06131132113411232313424332124233

07131131413241231311424331324123

08141131213243211311424111324123

09141132213244232311124331324133

10131131113241223312424331324134

11131132113241232211324331324133

12141131113244243313424331312134

13141131213241233311324331324133

14121432213244234313424341324333

15121131113241233311424331324333

16141131113241231311424331324133

17141132113241233311424331324133

18131131113241232311424331324333



After creating the .dat file ... you will run ITEMAN and store the results of the item analysis in an .out ... or output file. Here is the output for a few items ...


                     MicroCAT (tm) Testing System               

Copyright (c) 1982, 1984, 1986, 1988 by Assessment Systems Corporation



       Item and Test Analysis Program -- ITEMAN (tm) Version 3.00



Item analysis for data from file c:\550\406.dat                    Page  1





                 Item Statistics             Alternative Statistics

             -----------------------   -----------------------------------

Seq.  Scale   Prop.           Point            Prop.            Point

No.   -Item  Correct  Biser.  Biser.   Alt.  Endorsing  Biser.  Biser. Key

----  -----  -------  ------  ------   ----- ---------  ------  ------ ---



  1   0-1     1.000   -9.000  -9.000     1     1.000    -9.000  -9.000  *

                                         2     0.000    -9.000  -9.000  

                                         3     0.000    -9.000  -9.000  

                                         4     0.000    -9.000  -9.000  

                                       Other   0.000    -9.000  -9.000  

Item 1 was a bust ... everyone got it right ... so things like point biserials are screwed up!



  2   0-2     0.556    0.311   0.247     1     0.000    -9.000  -9.000  

                                         2     0.167    -0.415  -0.278  

                                         3     0.278    -0.057  -0.043  

                                         4     0.556     0.311   0.247  *

                                       Other   0.000    -9.000  -9.000 
Here is what is in the output. Seq No is the chronological order of the item on the test. Scale-item would give subscale order ... 0-1, 0-2, 2-1, 2-2, etc .... if there are multiple subscales. P value (item difficulty) is next ... followed by Biserial and Point Biserial values (item discrimination). Then you get item option information ... which option, how many selected each (17% opted for choice 2), and the point biserials ASSUMING each choice were correct: for choice 4 (the correct choice), r = .25 ... some positive r between item score and total scores.

Scale Statistics

----------------



  Scale:           0   

               -------

N of Items          30

N of Examinees      18

Mean            25.167

Variance         5.806

Std. Dev.        2.409

Skew             0.019

Kurtosis        -0.705

Minimum         21.000

Maximum         30.000

Median          25.000

Alpha            0.483   <---- Don't tell my class about this pathetic reliability! Have to make the test harder!

SEM              1.733

Mean P           0.839

Mean Item-Tot.   0.295

Mean Biserial    0.459



ITEMAN will allow you to work with tests with right/wrong answers, like classroom tests AND will allow you to do analysis for measures like attitude scales (multipoint items). You can define with the item inclusion key above several subscales ... and then the scale statistics at the bottom of the output file will have these separated by subscale ... and will show you the intercorrelations of the scale scores. This is a nice program and is easy to use. Again, for more info .. follow the link above to Assessment Systems.

RESEARCH ON NONE OF THE ABOVE OPTIONS

My buddy Bob Frary some time ago ask about the empirical verification for the advice that is sometimes given discouraging item writers from using none of the above (NOTA) as a choice on items. He argued:

  1. When NOTA is the answer to a MC item, it tends to prevent correct answers by examinees who might recognize the correct completion of the item but who could NOT produce the answer otherwise
  2. It would motivate examinees who are uncertain to consider the correctness of all the (non NOTA) options rather than simply select the option that seems most plausible
  3. When a computational answer is calculated, the presence of the NOTA option prevents informing the examinee that he/she has an incorrect answer. Also, NOTA discourages examinees from working backwards from the choices because the correct answer might not be present.
For these and other reasons, Bob hypothesized that tests that had items with NOTA as choices would be more difficult than comparable tests without items using NOTA ... and also this increased difficulty would lead to potentially more discriminating items (because of wider test score spread).

At VPI, Bob then solicited instructors who either used or would be willing to use NOTA on some of the items on their tests. In addition, when the NOTA options were used for items, he made sure that the NOTA option WAS ACTUALLY THE CORRECT CHOICE about the same proportion of times that one would expect just by chance alone. In this context, he found 20 tests using NOTA from 10 instructors ... and these represented about 1000 items ... split about 725 to 275 for non NOTA items and NOTA items respectively.

For the analysis, Bob first compared the difficulties and item discriminations for the the batch of non NOTA items versus the NOTA items. p values were slightly higher for non NOTA item ... .66 versus .61. There were no differences in item discrimination values. Then, he compared within the NOTA items, the difficulties and discrimination values for NOTA items when it was NOT the correct choice, to those NOTA items where it was the correct choice. Here, the NOTA items where it was the correct choice turned out to be slightly more difficult than the NOTA items where it was not the correct choice (.58 versus .61). Again, the item discrimination values were not different.

Thus, from a purely empirical point of view, it appears (if you buy his results!) that the presence of NOTA items actually makes the tests a tad more difficult ... whether this really translates into more discriminating items (which is more important really) is still up for debate. Bob and I have debated this back and forth ... and regardless on one's views ... his really is about the best empirical study around on this issue.

TRICK TEST QUESTIONS

For years, I had been hearing from students ... mine and others ... that certain tests or tests items or even professors ... were tricky in their test construction practices. After taking this for so long, I become interested in just exactly what were trick tests or test items AND could students really tell them apart. So, as any good researcher would do ... I dug into the literature on the subject but ... you know what? There is basically none! I first went to texts on test construction and about the most I could find was some mention of the notion that some considered ambiguous items to be tricky. Other than that ... NADA!

So, the first thing I did (1st in a 3 stage study) was to ask students and faculty what they thought a trick test item was ... and also a few other related items like: did they tend to appear on hard or easy tests, did they tend to be items like MC or essay, and did these items get on tests by accident or were they written deliberately. I also asked them to try their hand at constructing an item they thought was tricky but ... that was a bust! I made up a simple front and back of a page survey form, handed it out (NO RANDOM SAMPLING PLAN HERE!) to 174 students and 41 faculty, and eagerly awaited the results. What did I find? First, most felt that selection type items like MC and TF were the places where one would most likely see trick items. Second, they also felt that trick items tended to appear on hard tests (sure .. that's what make them tricky, right?). And third, they also amost overwhelmingly felf that trick items that got on tests were placed there DELIBERATELY ... and did not just drop down from the sky by accident.

Most important in this first phase was the compilation of the open ended responses to what they considered to be the important components of a trick item: I had them define them and then I sifted

During phase 2, I then tried to get some verification of the importance of these 7 categories by using 2 different rating techniques: absolute ratings, and paired comparisons (remember this from the scale methods discussed earlier in the course?). For the absolute ratings, I listed all 7 of the terms down the side of a page, and asked them to rate on a scale of 1(least important) to 7 (most important) for the definition of a trick item, and in the paired comparisons approach, I took all 21 pairs (trivial content with ambiguous item) ... and had them select the one that was most important. To collect the data, I used a different sample of students ... and randomly assigned to each method. The results were as follows:


Category                       Ab Ratings             Paired Comp

---------------------------------------------------------------------------------------

Intention                            5.53                         4.02

Trivial content                     4.11                         2.33

2 Fine Disc                         4.87                         3.09

Noise in stems                    4.30                         2.30

Mulp Corr Answers               5.31                        3.72

Opp principle                       4.16                         2.40

Ambiguous                          4.96                         3.02

---------------------------------------------------------------------------------------

Hey .... don't mind the wiggly cols of numbers! I just threw you a curve!

The two methods showed almost identical orderings of the 7 categories ... and most interestingly, the intention one was on top in both cases. None of the categories to my view were considered to be unimportant to the definition of trick questions.

Now, the real TRICKY part of this study was phase 3 where we tried to construct some trick and non trick items, and then test whether students could actually correctly sort them into the appropriate piles. First, we picked intro stat .... and then constructed some items what would have covered material that all students would have been exposed to in their courses. We started by making up 20 trick and 20 non trick items, and then narrowed these down to 35 ... and then picked at random 25 to make up the final form. As it turned out, 13 were trick items and 12 were non trick, and these were mixed up on a single test form. Then, we gave the form to over 100 students. It is important to note that students were NOT asked to work the problems ... but rather were to indicate on a scale of 1 to 4 ... their feeling about whether it was a trick item or not: (4=def think it is a trick item, ..... 1=def think it is NOT a trick item).

For scoring, we had a predetermined key as to which items were trick and non trick. Then, we would score each person's responses as: if they put 4 or 3 AND it was a trick item, they got it correct and a value of 1 or if they put 2 or 1 and it was NOT a trick item, they also got a score of 1. The other response matches were given 0 and considered to be incorrect.

The first thing we did was to test the hypothesis that Ss were able to correctly sort items to trick or non trick using a null value of 12.5 (chance in this situation). They were ... garnering a score of 14.6. They did do better than 12.5, significantly, BUT .... considering that 25 as the max score, NOT much better by some absolute criterion. We also looked at whether they had a more or less difficult time sorting trick (out of 13) or non trick (out of 12) items. They correctly sorted 5.4 out of 13 for trick, and 9.2 out of 12, for non trick. They could significantly get non trick items sorted better than trick items. Finally, since the intentionality of the item was considered to be the most important component in defining a trick item, we sorted the trick items into those that appeared to have some delieberate intentionality factor versus those that did not. Our hypothesis was that since the intention was in the eye of the test item constructor, it would be harder for a S to find them, compared to other more obviously tricky items. However, there was no significant difference in the correct sort rates for these 2 types of trick items.

Anyone interested in more detail about this study would check out: Roberts, DM (1993). An empirical study on the nature of trick test questions, Journal of Educational Measurement, V30, #4, pp331-344.

TESTQUEST ITEM BANKING PROGRAM

There are many item banking programs on the market but, the one I use quite frequently (sorry folks ... on in IBM format) ... is called TestQuest put out by get this ... Snowflake Software in Rhinebeck NY. The last phone and address I have is: AC 914-876-3328 ... 8 Cedar Heights Road, Rhinebeck NY 12572. I can't vouch for the up to datedness of these now but, you can give a try.

TQ allows you to work in 2 broad areas: item maintenance, and printing tests/items. It can work easily with recognition types of items ... and allows the input of essay types ... and helps you form an item pool where you can select subsets of items for a test. Item entry is easy and the printed copy of tests is good.

For item entry, you are given a screen that allows 10 or so lines of input for the stem ... word wrap is in operation ... and there are some minimal graphing capabilities. First you type the stem. Then, when that is ok, you hit ESC and go to the first alternative. Concurrent ... you see the stem while you are typing each alternative. After typing the alternative, you will have to indicate if it a fixed choice or moving type of alternative ... fixed means it will be printed in that EXACT order ... moving means it could be randomly put in any spot. So, you will categorize the options as fixed incorrect or fixed correct, or moving correct or moving incorrect. After you finish the complete item, you go on to the next ... or exit.

When you want to print a test, you will go into a test creation screen where you will indicate which topics and items you want to use: ta 1,2,4,5 tb 3,4,6,7, ... etc. ... and keep specifying items until your test is complete. After this is done, you will be able to add headers, and other information about the test. Also, you can get an answer sheet printed out for their filling in of the answers, it you want. After one test is printed, you will be prompted if you want another form ... and if you answer yes, whether to randomize the answers. For those items that are "moving" items, it will rearrange the alternatives for the item ... though the sequence of the items will stay the same. Overall, this is an easy program to use ... has reasonable flexability, and makes a nicely formatted printed test. I use it in my stat courses and it works fine.

TestQuest is ONLY an item banking program ... it does no analysis nor does it allow the inputting of analysis information (like item difficulty) on which you can later use to make item selections. But, I don't see that as a problem. Anyone interested might contact Chris Mayer at the above phone/address ... and ask for more info. I have written a letter recently and will post an update here if I here of any additional information about price, availability, support, etc.

SHORT DISCUSSION OF FACTOR ANALYSIS ... BY DEMAND OF CLASS!


THE FACTOR ANALYSIS BASIC TOUR

	FACTOR PATTERNS IN CORRELATIONAL DATA



Note: Assume that r's of .3 or .4 = variables go together

Matrix 1                            Matrix 2

      1   2   3   4   5   6               1   2   3   4   5   6



 1    -   72  62  58  74  82        1     -  17  56  72  21  03

 2        -   53  62  47  58        2         -  14  03  53  64

 3             -  63  71  58        3             -  63  01  09

 4                 -  53  54        4                 -  17  09

 5                     -  68        5                     -  72

 6                         -        6                         -

 

Matrix 3                            Matrix 4

      1   2   3   4   5   6               1   2   3   4   5   6



 1    -  03  19  02  11  07        1      -  03  72  01  19  15

 2        -  53  63  16  01        2          -  17  03  14  59

 3            -  51  08  20        3              -  20  08  13

 4                -  22  08        4                  -  68  09

 5                    -  17        5                      -  17

 6                        -        6                          -



Matrix 5

      1   2   3   4   5   6



 1    -   03  07  02  13  09

 2         -  21  17  09  17

 3             -  03  14  07

 4                 -  02  09

 5                     -  16

 6                         -



        

So, here are some data from 4 variables with n=10



 MTB > prin c16-c19 <--- Cols in MTB where data are

  ROW    M1    M2    M3    M4



    1    36    41    45    45

    2    52    58    57    41

    3    51    51    48    65

    4    47    46    40    63

    5    56    49    48    36

    6    66    72    65    56

    7    48    48    44    43

    8    61    55    50    70

    9    50    39    50    55

   10    33    40    37    46

  MTB > corr c16-c19

              M1       M2       M3

 M2        0.805

 M3        0.782    0.843

 M4        0.329    0.178    0.055

If you examine the intercorrelations, M1-M3 correlate highly with each other but, M4 seems not to correlate with M1-M3.

A FACTOR IS A RULE THAT COMBINES THE VARIABLES TOGETHER IN SOME LINEAR FASHION

Note Assume that we define Factor A as .8M1 + .7M2 + .8M3 and give 0 weight to M4 ... since it seems not to fit.



 MTB > let c20=(.8*c16)+(.7*c17)+(.8*c18)

 MTB > prin c16-c20

  ROW    M1    M2    M3    M4       A

 

    1    36    41    45    45    93.5

    2    52    58    57    41   127.8

    3    51    51    48    65   114.9

    4    47    46    40    63   101.8

    5    56    49    48    36   117.5

    6    66    72    65    56   155.2

    7    48    48    44    43   107.2

    8    61    55    50    70   127.3

    9    50    39    50    55   107.3

   10    33    40    37    46    84.0

THE VALUES IN A ... THE RESULTS OF APPLYING THE FACTOR RULE ... ARE CALLED FACTOR SCORES.

How does our set of factor scores correlate back with each of the original measures ... M1-M4?

 MTB > corr c16-c20

 

              M1       M2       M3       M4

 M2        0.805

 M3        0.782    0.843

 M4        0.329    0.178    0.055

 A         0.933    0.942    0.928    0.211

Well, the way we defined Factor A and the resulting r of the factor A scores and M1-M4 shows that our definiton worked well for M1-M3 but, didn't do much for predicting or explaining M4. Residuals give you some idea about how good the "fit" was (Factor A rule and factor scores TO the data ... M1-M4). But, to see those, you have to do regressions and then examine the residuals. The A factor scores variable is used to predict M1-M4 ... then residuals are examined.

MTB > regr c16 1 c20;
SUBC> resi c21. <---- leftover M1 after Fact A scores removed

MTB> regr c17 1 c20;
SUBC> resi c22. <--- leftover M2 after Fact A scores removed

MTB > regr c18 1 c20;
SUBC> resi c23. <--- leftover M3 after Fact A scores removed

MTB > regr c19 1 c20;
SUBC> resi c24. <---- leftover M4 after Fact A scores removed

MTB > name c21='Er1A' c22='Er2A' c23='Er3A' c24='Er4A' MTB > prin c21-c24

 

  ROW      Er1A      Er2A      Er3A      Er4A

 

    1  -4.56888   0.52419   4.11021   -4.5705

    2  -4.62285   1.48202   3.32608  -12.7061

    3   0.41494   0.51537  -0.86589   12.8493

    4   2.54634   1.64227  -3.98333   12.4288

    5   4.19802  -2.70065  -1.83495  -16.4642

    6  -3.44730   2.66698   1.11369   -1.0098

    7   1.01890   1.11668  -1.99599   -8.2223

    8   4.61118  -1.28413  -3.48756   16.3542

    9   2.97209  -7.93010   3.96674    3.7656

   10  -3.12245   3.96736  -0.34900   -2.4250

 

Above is called first residual matrix. Note that the errors are relatively smallER using factor A scores to predict M1-M3 but, the errors are much larger for M4. This means that when we applied the Factor A rule, not that much was left over (residuals) when partialing it from M1-M3 but, substantial residuals were there when partialing it from M4. BACK TO THE DRAWING BOARD FOR M4!!

We now need to start the cycle over again but, remember, we now have a first residual matrix so ... this is the point at which we need to work. Taking Factor A out of the original M1-N4 data matrix left us with the first set of residuals: how much of THAT can we now explain or eliminate by now defining a second Factor B?

Working with the first residual matrix, let's define Factor B as: .7 * Er4A

MTB > let c25=.7*c24

Now we use new FACTOR scores to predict first residual matrix

MTB > regr c21 1 c25;
SUBC> resi c26. <--- leftover for resid M1 after Fact B taken out

MTB > regr c22 1 c25;
SUBC> resi c27. <--- leftover for resid M2 after Fact B removed

MTB > regr c23 1 c25;
SUBC> resi c28. <-- leftover for resid M3 after Fact B removed

MTB > regr c24 1 c25;
SUBC> resi c29.<- leftover for resid M4 after Fact B taken

MTB > name c26='Er1B' c27='Er2B' c28='Er3B' c29='Er4B'

How do the factor scores from the Factor B rule related to the first residual matrix?


 MTB > corr c21-c25

 

            Er1A     Er2A     Er3A     Er4A

 Er2A     -0.601

 Er3A     -0.624   -0.249

 Er4A      0.373   -0.065   -0.388

 C25       0.373   -0.065   -0.388    1.000

Note that our Factor B rule and factor scores don't correlate too well with residuals on M1-M3 but perfectly (and this clearly is not usually the case) with the residuals on M4. What about the second residual matrix?


 MTB > prin c26-c29

 

  ROW      Er1B      Er2B      Er3B   Er4B

 

    1  -4.01530   0.43479   3.63486      0

    2  -3.08386   1.23347   2.00458      0

    3  -1.14138   0.76672   0.47050      0

    4   1.04095   1.88539  -2.69066      0

    5   6.19220  -3.02272  -3.54732      0

    6  -3.32499   2.64723   1.00866      0

    7   2.01480   0.95584  -2.85115      0

    8   2.63033  -0.96422  -1.78664      0

    9   2.51599  -7.85643   4.35839      0

   10  -2.82872   3.91993  -0.60121      0

So, how do the Factor A scores correlate with the variables, M1-M4? On the second page, you saw how A correlated with M1-M4. On the page above, you see how Factor B scores correlate with the first residual matrix. We can put this information in a table.

                        Fac A      Fac B



              M1        .933        .373

              M2        .942       -.065

              M3        .928       -.388

              M4        .211        1.00

CORRELATIONS BETWEEN FACTOR SCORES AND VARIABLES OR MEASURES ARE CALLED FACTOR LOADINGS.

For our data, a two factor solution seems reasonable and, the variables M1-M3 load highly on Factor A, but M4 does not ... while M4 loads highly on Factor B ... but M1-M3 don't. Keep in mind of course that we could have defined the factors differently and the numbers we got above would not be the same.

Anyone who wants to input the data from above for the variables M1-M4 ... and subject them to a principal components and/or fact analysis using something like Minitab will see the same GENERAL pattern come out ... though obviously the exact values for loadings, etc. will NOT be the same. All this goes to show is that with a small intercorrelation matrix AND a fairly clear pattern ... one could do a logical FACT ANALYSIS and have it be close to the "real thing" ... but try that with a 60 by 60 matrix for example where there is NO way to look at the correlations and figure out what is happening.

FINAL COGNITIVE TEST PROJECTS

The class has turned in its final versions of their classroom type cognitive tests ... each had a draft that was copied and circulated to all class members, including myself, and we made comments about the tests .... gave them back, and what is listed below represents revisions to the drafts based on our feedback. If you have any questions about any of these ... send me a note and I will pass it along to the test author. For more informartion ... Note To Roberts