NOTE: The above file is getting a little large to scroll through so, I have started a second file ... for the next few weeks of classes.

Go to Jan 22 ... Jan 22 Start Gain score grading, scaling methods: Levels of meas, Q sort, Guttman, Likert, IRT, Classical test theory, Semantic differential, Paired comparisons

Feb 21 ... A new file for Cognitive Test Topics had been started ... Cognitive Test Topics 1 Scale construction model, table of specifications

A second cognitive topics file was created, including none of the above research, Iteman item analysis program, TestQuest item banking program, trick test questions, and factor analysis ... go to ... Cognitive Topics 2

April 5, 1996 ... Beginning of Attitude Scale Construction ... Attitude Scale Construction

In the 550 class, we are currently working on "Attitudes about Salaries of College Faculty" and, in the 450 class down at our graduate center in Great Valley (outside of Philly), we are working on a scale about "Attitudes about English as the Official Language in the US of A" ... Have a look!

1. Jan 6, 1996! and Happy New Year!

Welcome to the test construction course summary page. It is Sunday in snowy State College PA and, I thought I would load up the sylabus for the 550 course ... to get things going. Take a look and if you have any questions or comments, please send me a message at ... Dennis Roberts

Note: At times, I will NOT be too fussy about how the documents are formatted ... seeing as how I think it is more important to get the documents IN to the page ... and not worry and spend too much time making them look "sharp". My standard practice will be to make a text file from my original documents prepared for the course, then save as .txt ... then import into my HotDogPro program, and then add a few quick format changes prior to sending it out on the web. For more info about the excellent HotDogPro html editor (Windows) program, see ... HotDogPro


Class Meetings: MWF MWF 9:05-9:55AM, 214 Rackley Building Instructor: Dennis Roberts 208 Cedar Building, AC 814-863-2401 email dmr at psuvm.psu.edu world wide web http://www2.ed.psu.edu/espse/staff/droberts/drober~1.htm

NOTE: You MUST make sure that you have access to your ACCESS accounts since, we will be using email and the World Wide Web from time to time. If you have any doubts/questions about this ... go to the CAC and make sure your account is activated. Besides, you cannot do stuff in the labs without your id and password.

Required Books:

Here are a few other books that you may want to look at from time to time.

Some helpful journals And, note: EVERYONE will have to read

I WILL TRY TO GET A PHOTOCOPY AND PUT IN EDUCATION LIBRARY Nature of Course: This course is practice oriented and focuses on you developing two instruments: one a cognitive achievement test and an affective instrument. The hope is that ONE of these you will take to the point of trying out and doing an item analysis. This is a "hands on" course; reading will NOT be sufficient (nor anywhere close!) to pass this course.

Class Sessions: This is up in the air at this point. But, as a general guide, I would like to subdivide the course in the following "very rough" modules.

If you can think of a better arrangement, let me know. Cheating:University policy states that I am to state my policy on cheating. Here it is: if I catch you cheating, you are "out o here" with an F! Computer Programs: I will discuss TESTQUEST (an item banking program) and ITEMAN (an analysis program) in this course. Both have been "loaded" on the NEC machine in 217 Cedar. I will give you handouts on using the programs, Computer Accounts Use your ACCESS accounts to contact me: DMR at PSUVM.PSU.EDU. I am an email junkee!

Tests: There will be TWO tests in the course; 1 in the middle, and 1 at the end. Tests will be of the short answer discussion type. Projects: Here is where you earn your "spurs" in 550. The handout and class presentation will count about 20 percent of the grade. Creating an achievement test will account for about 20 percent of the grade. Creating the affective instrument development will count for about 20 percent of the grade, and each of the tests will count for 20 percent of the grade. . Grades : Unless you force me into it, I plan on NOT giving grades other than A's and/or B's in this course BUT ... Don't make me add to the categories (and I can!)!

World Wide Web: I have a home WWW page, at the URL listed above. Now, what I have decided to do this term is to put some (perhaps weekly) of our discussions in a section on my page. I have already mentioned this on a couple of email nets and, there has been a lot of interest in my doing this. BUT, IT MEANS WORK ... FOR ME AND FOR YOU! At the moment, I am not exactly sure about how much I will do related to this ... or exactly how. My hope is to have the main theme topics put on the web ... and hopefully some of the "readers" will be able to comment directly to us via email. If one or two of you have some special interest in webbing, let me know ... I am sure I will need some help in pulling this off.

Final Comments 1. If you need help, ask. 2. On documents you produce for me, I expect "professional looking" reports. Hand written stuff and/or lots of cuts and pastes are NOT acceptable. 3. You MUST set aside enough time to do outside reading. YOU MAY NOT CHECK ANY OF THE MATERIALS IN THE EDUCATION LIBRARY ON RESERVE FOR 550 OUT!!! 4. The instruments you develop are assumed to be ORIGINAL, and started from "scratch". The best of the lot I plan on putting on my home page section MEAS SCALES .... at the end of the semester. 5. There are many item banking and analysis programs; I am simply introducing you to TWO of them. List of Course Activities I provide you with the following list of activities to be accomplished in this course (I hope!) Activities in 550

Topics for Module 1 Rules of the game A. Class presentation to take full class period B. Topics to be covered include: overview to method, example of items or method given in class, and how scores are arrived at for scaling people and/or objects C. Prepare a 1 or 2 page handout to distribute to class members that includes A from above and at least 3 external references (not including the books on reserve) for student's followup study D. Presentations to take place during weeks 3 and 4; I don't really care who goes first. If you can't decide, I will decide for you.

2. Jan 10, 1996

Since Monday Jan 8 was cancelled due to snow, this was our first class session. 6 students were there ... and 1 more is coming. I hope to have a "pic" next week.

We went over the syllabus ... and the stress was on the idea that 550 is a course where we DEVELOP SCALES ... not just talk about them (though we talk alot too!). The class has till Monday Jan 15 to decide on one of the Module 1 topics ... and then I am hoping to begin having class presentations by Monday Jan 22 if possible. After class presentations, I will try to post some short summary of the discussion.

At some point in time, we will begin to discuss cognitive and affective scale development. Any ideas you might have out there in cyberland ... for class discussion purposes ... would be welcomed ... contact me at ... Roberts the prof

That is about all to report on at the moment ... not a very content rich message but ... all there is. Keep tuned!

3. Jan 12, 1996 ... Intro Notes on Scaling

Today, I went over a little handout I have related to some intro notes on scaling. I have scanned some of the images ... they are not great and I might not do this again ... but bear with me for this time. Here is the handout with some additional comments.

SOME INTRO NOTES FOR SCALING What if you want to "scale" 4 different objects (books) in terms of weight? We could put each book on a weight scale, read off the values, and then put them on the PHYSICAL weight scale as follows. See part A on graph.

But, what if no physical weight scale existed; Toledo was on strike! We could get people to "psychologically" estimate the weights and then, based on averages across people for the same book, order the books on a psychological scale of weight as follows. See part B on graph above.

As you can see, the relative positioning of the books is the same when using the physical scale versus the psychological scale but, there still is some discrepancy between the EXACT placements under the two systems. PSYCHOPHYSICAL methods examine the relationship between the placement of objects on the two scales and attempts to establish principles or laws that connect the two. On the other hand, many stimuli or scalings we want to do involve some psychological (latent) dimension of people without any connection to a direct counterpart "physical" dimension. That is, we are interested in the second scale without any reference to the first scale. Such interests are called PSYCHOLOGICAL SCALINGS, and this is what this course is primarily about. A psychological scale looks similar to what is above and may be generically represented as follows: See part C on the graph above. P> If the "dimension" is permissiveness or dogmatism or intelligence or introversion ... where do people A, B, C, and D fall along this scale? Are they in the positions shown above ... or located differently? If their relative positions are correct, are the DISTANCES amongst them correct? In general, the goals of psychological scaling are: 1) to create a meaningful and valid scale, 2) to locate persons [or objects] in their correct relative positions, and 3) to arrive at distances amongst persons [or objects] that reflect true differences amongst them.

A scaling MODEL is a plan to develp a scale on which to place people or objects. For example, a scale could be:

Low ___________________________________________ High (permissiveness)

Where do people A, B, C, and D fall on this scale? That is the task of developing a scale using some scaling method or model. While we normally want to scale people, scaling is not limited to that. Concepts or notions could be scaled: where do different educational philosophies fit in on the scale of "permissiveness"?

If your ultimate goal is to scale people on some scale (permissiveness for example), how can we do it? One way of course would be to DIRECTLY ASK THEM to place themselves on the scale. You could give them the scale above with several reference points, and then see where they place themselves. Another way would be to DIRECTLY OBSERVE them in situations where permissiveness plays a role, and then make an assessment of the extent to which he or she is "permissive". Another way would be to develop stimulus items that have been calibrated for their "degree of permissiveness" and then see which items he or she agrees with or feels are most like themselves. In this way, the subject him or herself is NOT the one to decide on the extent of permissiveness but rather, some totalling of the item calibration values. There are positives and negatives with all these approaches; what are some?

A general schematic (see graph E above ... yeah, I know I got them labelled wrong!) for a scaling problem can be represented (in a simple form) as above. When scaling, there are at least 3 diminsions that come into play: people or objects, stimuli, and responses. "People" are those who you want to scale (P1, P2, etc.), by presenting them with various stimuli (S1, S2, etc.), and based on that ... people make responses (R1, R2, etc.) to the stimuli.

One example would be the presentation of the S1 ... "A good teacher gives assignments back in a timely manner"; that you give to P1 (Joe), who then makes R1 ... "I agree with that". Another example would be to present S1 and S2 simultaneously (two different objects) to P1, and ask which object is heavier ... "P1 says S2 is heavier".

It would be advantageous in this model to be able to simplify it by "collapsing" one of the 3 dimensions. For example, it is common to keep the response format or type CONSTANT in a given study; each item may be responded to on a Likert-type scale where 1 means "Strongly Disagree" up to 7 that means "Strongly Agree". In this case, there is only one type of R ... R1 is it. Thus, instead of having multiple possible responses (rating scale, observable response to stimulus that occurs in realistic context, writing an essay self describing how one views him or herself on the dimension, etc.), there is only ONE slice on the R dimension. If the response dimension is collapsed, then this model becomes primarily a two dimensional model, looking like Graph D below.

There are several basic and universal steps involved in a scaling problem. First and foremost is the "scale concept". What is it that you are trying to scale people or objects on? Intelligence? Dogmatism? Attitudes about statistics? Second, what is the "target" of the scaling? People or objects? Third, what is the plan or method by which the scaling will take place? If people, direct questioning, observation, or using calibrated stimulus items? Fourth, what instrument and/or collection of stimulus items will be used? Fifth, what is the mechanism by which scale values or scores for people (or objects) be determined and assigned? And finally, given the purpose of the scaling, how does one determine if the scaling model is implemented in a reliable and valid way? Does this scaling model or plan accomplish what it is suppose to do? Well, the above are SOME initial things to worry about!!!!!

Questions or comments? Send a note to me ... Roberts

3. Monday Jan 15, 1996

The class today dealt with some administrative details ... deciding who would be doing what topic for module 1, going over a few more points about the intro notes re: scaling, and "alerting" students that the picture person would be in Wednesday to "snap a pic" ... need to get this class out there as a "visual" on the Web!

Another short handout was distributed that dealt with some matters related to behavioral observation. This material was taken from my colleague's books ... Hoi Suen ... Principles of test theories, Erlbaum, 1990 .... and Analyzing quantitative behavioral observation data, by Suen and Ary, Erlbaum, 1989.

The main point I wanted to make with this material is the notion that measurement/scaling can be near or far ... from the real behavior or characteristic under study. For example, let's assume that we are interested in the brake and tune-up ability of a number of "mechanics". Now ... if we had the right probes, we could "tap" into the brain and seek out the section there that has the deposit of "brake and tune-up" ability ... check its volume ... and this would give us about as direct a measurement of the trait under consideration. At this level, we would be dealing with the TRUE SCORES of the mechanics and by this method, we would be able to order the mechanics in terms of their ability ... and the scaling is over.

However, this process would be rather impossible in this case ... don't you imagine? So, what are some alternatives? One possibility would be to WATCH the mechanics work on a car that supposedly had brake and tune-up problems ... and make various recordings and observations as to how they went about fixing the brakes and tuning up the car. Usually, at this level, we call the process one of "behavioral observation" and, as we can see .... it is about as close as we can come to knowing the mechanics' true ability ... other than that internal probe. While the behavioral observation does not provide us with exactly the same assessment of the true skill as the internal probe would, it is not very far away from it. So ... this is an example of "close" measurement or scaling ... though not THE closest it could be.

As one last example, we might administer to each of these mechanics a paper and pencil test where they have to respond to a series of multiple-choice items about various aspects of fixing brakes and tuning up the car. In this way, and without too much effort, we could find out the relative ordering of the mechanics on this test ... and let THAT proxy measure tell us about their brake and tune-up ability. Of course, it should be obvious that this way of doing the "scaling" is a far cry from either actually observing the mechanics do the work ... or implanting that internal probe. Thus, the paper and pencil test is a long way off from the real thing ... but we use it more or less as a good estimator of the real differences among the different mechanics.

Thus ... in the previous cube model that had "responses" as one dimension of the overall scaling problem, different responses are analogous to the different ways of making the assessment, and, it is important to realize that some of these response modes provide a closer to or further away ... type of assessment of the REAL characteristic under consideration. And ... the general importance of this distinction is the fact that the further away the proxy measure is from measuring DIRECTLY the trait or behavior under question ... the greater the chance that the value (ie, score) we give to one person based on our assessment ... could be in error.

Questions or comments ... send note to .. Dennis

4. Midweek Jan 17, 1996

In addition to administrative matters, our class focused on 2 issues today. Photographer comes Monday for class pic ... keep tuned! .... and scaling topics have been divided up ... more on that later.

Why Scale?

The first issue I threw out for discussion was the question: Why scale in the first place? Or, stated differently, if we somehow stopped all scaling tomorrow ... would that be any great loss? What I was hoping for ... and got ... was some discussion about why scaling is important ... some justifications for it. And here is a little of what came out.

The first thing was that it is important when making decisions about allocating people to tasks that we have some methods by which to make GOOD ALLOCATIONS ... and to do this in an efficient manner. We tossed out several examples such as ... if you were going to have your gall bladder out, heaven forbid ... it would be nice to have some assurance that the person who was going to do it had training and COULD perform the task with minimal risk to YOU, etc.

Another reason was to make diagnoses of a personal nature ... to better know oneself ... regardless of whether or not that impacts on decisions of what you might do. For example, scales can provide a person with personal information that might enable him or her to change what he or she does ...

A third reason was to provide some baseline information about things. For example, what if we had a task that needed to be done and, we had decided or come to the conclusion that it would take X amount of skill in order to perform the task. A scale might provide baseline information about whether ANYONE had sufficient skill to carry out the task ... or the baseline might simply provide information about how many might be available in a pool that COULD do the task IF we needed it done.

Finally, while no student mentioned this ... the old prof suggested that ... we might scale to MAKE MONEY! In the free market system ... we probably don't REALLY need another IQ test (as an example) but ... some think they can build a better mousetrap so go into the business to see if they can convince others that they have ... and hence contribute to their own livelihood.

We also mentioned that while scaling is valuable ... it can be done well or not ... and the act itself falls on some continuum ... from "Done Awfully" to "Done Really Well" and, it segments in society demand the use of scaling ... it should make sure that people are trained well to do it correctly. That's one reason why we have this course!

Measurement of Academic Learning

Another topic I raised was that of link between scaling efforts and whether we measure "learning" in a typical classroom. I published a paper many moons ago .... Measurement of academic learning, EDUCATIONAL FORUM, January 1969, pp 207-211 .... that discussed this point. I first asked students to babel out some views on what they thought the definition of learning was and, we came to some agreement that at a minimum ... CHANGE IN BEHAVIOR POTENTIAL ... was a good start. In order to claim that person X learned "something" ... either he/she had to demonstrate a change from not being able to do it at time 1 to NOW knowing how to do it at time 2 ... or at least agree that there was a change in the potential for doing it at time 2 .... even though we might not ASK the person to perform the task.

Then ... we looked at how we normally "measure" in typical classrooms and all agreed that grades that are given normally are assumed to reflect how MUCH the students had learned .... but, we readily admitted that if we did not know what the students knew coming in to the class .... we would not be in a position to assess how MUCH they gained from time 1 (start of course) to time 2 (end of course). Thus .... in general, what we assessed in classes or courses was where students appeared to be AT THAT POINT IN TIME but ... that the level they demonstrate at time 2 does NOT mean that they gained more than other students nor does the grade indicate how MUCH was gained.

Now, there are a host of both technical measurement related problems and ethical problems with instituting an across the board pretest and posttest system to assess change (see Roberts and Burrill 1995, GAIN SCORE GRADING REVISITED in Educational measurement: issues and practices, V14, #1, Spring, pp 29 and 30), but I did suggest that instructors might want to try now and then to give a final exam at the beginning of a course ... followed with some alternate form of that at posttest ... to see how much THEIR perceptions of how much students learned is supported by the evidence. I also stressed that at least we need to realize that HOW we measure is not equal to assessing how much learning occurred ... it certainly does not.

Web Pages

Friday is a day where I have invited students into my office to demo a bit about making web documents ... and we will look at how one creates source code (if you pull down VIEW and click on SOURCE ... you can see it too!) ... and look at the code for my first page ... and then I will show them about the HotDogPro program I use to keep adding to my pages.

Everyone out there have a nice weekend! And ... if you need to contact me ... That Roberts guy