Assessment Literacy: Item Statistics

At the 2018 Connecting Communities of Education Stakeholders Conference in Greensboro, I was pleased to see several sessions on using Schoolnet data to identify skill weaknesses and to form remediation student groups. The presenter began by listing listed the reasons teachers struggle with data analysis, with the reasons including:

  • Not enough time
  • Not knowing how to interpret the data
  • Too many numbers, and
  • Can’t get past the negative data (results).

The presenter then outlined the steps and strategies which are essential for making the most of the Schoolnet student scores.  These steps included looking at the overall average score of the assessment and the percentage of students who correctly answered each item. I completely endorse the approach the presenter recommended for examining the assessment data and identifying students in need. I recommend taking look at item statistics by someone at the district level and communicating information about “weak” items to the teachers to avoid misinterpretation of the student assessment results.

Digging Deeper Into the Scores

I sat next to an instructional coach who was following along with her own district’s benchmark data as she shared her reports with me for discussion. As we looked at the student scores, I asked if she had examined the item discrimination values. She was not aware of this term. I explained that item discrimination was the ability of the item to differentiate high performing students from low performing students.  For example, let’s say an item had a p-value of .50, which means that 50% of the students got the answer correct. So we could conclude it was an item of medium difficulty. However, if the item has a very low discrimination value, it means that the higher performing students got the item incorrect, while the lower performing students got the answer correct.

Here is the sample data for the item:
Answer A – incorrect 40% of the high performing students selected that answer.
Answer B – correct 42% of the lower performing students selected that answer.
Answer C – incorrect 10% of the lower performing students selected that answer.
Answer D – incorrect 8% of the lower performing students selected that answer.

As you can see in this oversimplified example, the item was probably answered correctly by chance and the item did not add to our understanding of student performance on that skill. The question to be asked is: What led the high performing students to select answer A?

  1. Was the item worded in such a way that led high performing students to mistakenly select answer A?
  2. Was there some misunderstanding of what was taught – learned that resulted in the high performing students’ error?
  3. Select items
Item Discrimination Statistics

The statistical procedure for computing item discrimination is called a point biserial correlation. This procedure transforms the responses to the item from A-D to a 0 or a 1 for incorrect and correct. A correlation is then performed using the 0 and 1 score against the total percent correct score. For this example shown in the graph, the point biserial correlation is -0.511.

 When more students in the lower performing group than in the upper performing group select the right answer to an item, the item actually has negative validity. Assuming that the criterion itself has validity, the item is not only useless but is actually serving to decrease the validity of the test.
See: http://ericae.net/ft/tamu/Espy.htm   for an excellent discussion of this topic.

 My Recommendations:
  1. Design your assessments so that each answer choice will provide meaningful information about the students’ understanding of the underlying skill.
  2. Include enough items on the assessment so that the skills are adequately sampled. The instructional coach showed me an assessment of 15 items and the average percent correct was 45% correct. This small number of items means that decisions were being made about students who got only 5-7 items correct.
  3. Field test each benchmark with a few students to identify problems before administering the test to hundreds of students.
  4. Always look at the p-value and item discrimination values for each item.
  5. Ask student why they selected an answer, especially if the item has a low discrimination value.
  6. Train the teachers in how to interpret the data.
    1. Provide protocols to guide the teachers in the interpretation process.
    2. Provide item analysis information such as if the student selected the incorrect foil A, it probably means there is a misunderstanding of a particular concept. That way all students who got that item incorrect and answered A can be remediated on the misunderstanding.
    3. Provide the data in a way that it is easy to view and manipulate.
    4. Have teachers COLLABORATIVELY examine their data and have a shared discussion so one teacher can scaffold other teachers’ understanding of the test results.

Dr. Lewis Johnson
Lead Consultant
Data Smart LLC