Thursday, July 11, 2013

Construction, administration and quality assurance of student assessment in HPE - Part II

Kathmandu, Nepal

In the last post, I have described the rationale and steps followed for creating SISAQ or PBQ. This post deals with the cut-off score (pass mark) determination of the SISAQ. Item analysis of SISAQ will be discussed in the next post.

SISAQ mostly test the Higher Order Thinking Skills (HOTS) and it will be usually difficult for the minimally competent "borderline" candidates. Since it does not allow the test-takers to see the options as in MCQ, it eliminates the chance of "guessing". Further, SISAQ questions are NOT direct so the test-takers should extract the answers based on the clinical/lab/public health based scenario and vignettes, which is again a challenging task for the borderline students.

Unlike OSPE/OSCE where students are observed for 5 - 8 minutes while doing the task and then awarding a holistic "global" rating at the end to facilitate the calculation of the pass-mark, SISAQ cannot be observed for such as short span of time for each students. Thus, SISAQ pass mark is determined based on the compromised method using standard setting done a priori of the exam by the concerned faculty and performance of students (marks obtained) in the same SISAQ.

SISAQ standard setting of faculty:
 
Once the SISAQ construction is done as per the rules laid down in the previous blog post in this thread, group of faculty responsible for the particular SISAQ decides on the range of the expected marks for five mutually exclusive groups using required competencies.
 
The five groups are as follows:

Very Poor:
Poor:
Borderline:
Good:
Very Good:   

The most important group is the borderline group as it is used to determine the pass mark. Thus, it is very important to assign the range of marks for this group as it is based on difficulty level of scenario, vignettes and questions from each disciplines. Faculty trained in criterion-referenced student assessment method also know that they have to think of the "borderline" group of students as those who could either pass or fail the SISAQ while referring to the students in this range.

Once the SISAQ is administered and marked by the faculty then a compromised method known as "borderline regression method" is used to calculate the pass mark. Actually, it is a simple linear regression analysis where marks obtained by the students in the SISAQ are taken as the dependent variable (y) and five mutually exclusive groups for the SISAQ are taken as independent variable (x). As the five standard setting groups are ordinal in nature, they are coded from 1 to 5 for data analysis. Code of the borderline group (i.e. 3) is then used to estimate the pass mark using the regression equation obtained.
 
Illustration:
 
SISAQ # 2
 
Full marks: 15 marks
Total time: 25 minutes

Criterion-referenced standard setting:

1. Very Poor:    0 - 3
2. Poor:             3 - 5
3. Borderline:   5 - 7
4. Good:           7 - 11
5. Very Good:  11 - 15

I found Google docs very easy for running the simple linear regression using the in-built "forecast" function to determine the pass-mark of SISAQ. You can see the details following this link:

https://docs.google.com/spreadsheet/ccc?key=0AoJkXbvD07ECdHh0NmUyUkFjNnA4aXozWnlJajM4QWc&usp=sharing

The borderline-regression (BLR) method reveals the pass-mark of this hypothetical SISAQ as 6.242 (41.61%) and based on this pass-mark there are 4 students (20%) fail and 16 (80%) pass it. The pass-mark (6.242) was obtained from using performance of the students in the exam and it lies between 5 and 7 which are the minimum and maximum criteria set for passing the SISAQ by faculty.
 
The Google Spreadsheet link above also shows the outcome for this SISAQ if the conventional 50% pass mark is used. It reveals that only 60% of the students would pass and 40% of them will fail it, which is different than the one obtained from using BLR method above. This happened because the "fixed" criteria of 50% does not take account of the "difficulty" level of the SISAQ in reference to the "borderline" group of students and thus is "biased". We all know that every SISAQ cannot have same level of difficulty. Thus, it is "injustice" to have a "fixed" pass-mark for each and every item and tests. So, I strongly suggest to have fluctuating pass-mark based on its difficulty in terms of content and construct for SISAQs.
 
In sum, theory paper should have both MCQ and SISAQ to test the knowledge from LOTS to HOTS. They must be based on the Angoff or Hofstee or Beuk cut-off score for MCQ and BLR cut-off score for SISAQ in order to get "holistic" and "criterion-referenced" assessment of the knowledge for any competency based curriculum. This method is defensible as they are "always" based on the decision of a group of faculty rather than a single faculty and "mostly" they also used students exam scores.
 
Most importantly, these criterion-referenced methods provide defensible and justifiable grounds for scientific and evidence based student assessment method for the undergraduate as well as post-graduate health professions education program.
 
Rest later, happy SISAQing ...

Friday, July 5, 2013

Construction, administration and quality assurance of student assessments in HPE - Part I

Kathmandu, Nepal

Two years back, I delivered a talk on "Ensuring quality of student assessment" at KIST Medical College (KISTMC), Lalitpur, Nepal. Basic Science coordinator at KISTMC, Prof. Dr. P Ravi Shankar, specifically asked me to focus on the construction, administration and quality assurance of the Short Answer Questions (SAQs). I am sharing the talk that I delivered there in writing here.
 
SAQs are not new to health professions educators, particularly in South Asia, as they have been used independently and extensively to assess the knowledge in depth and breadth of various basic sciences disciplines as well as community health sciences/community medicine. It is logical as these disciplines are delivered independently and the concerned discipline faculty or department want to test the knowledge imparted mostly through the large groups lecture sessions. These types of SAQs are mostly unstructured (i.e. model answers are neither made nor discussed) apart from not being integrated with other disciplines and of little relevance to clinical sciences. Consequently, SAQ becomes a mere tools for assessing  "memorization" of knowledge over the time, which can easily and reliably done with the MCQs. Similarly, very difficult SAQs also become abundant in the student assessment once the faculty exhaust with the "factual recall of knowledge" type of questions. These two situations are problematic and non-defensible as SAQ pass-mark does not changes between exams because of the fixed-criterion of 50% cut-score (pass mark) in Nepal and South Asia. Logically, SAQ containing difficult and easy questions should have lower and higher pass-mars respectively. Fortunately, there is a remedy called "standard setting" for these problem.
 
In addition, SAQs are constructed/collated mostly by a single senior faculty for high-stake exams in Nepal and South Asia. Although it recognizes the integrity, honesty and experience in teaching and research(?) of the single respected/reputed faculty, it may not fulfill validity and reliability criteria. Furthermore, when these SAQs are constructed in isolation and without using examination blueprint, they undermine the validity of the not only the SAQs but the whole test as well. Thus, it is recommended to construct/collate the SAQ items for any test only after having discussion with the concerned faculty individually and in group in terms of the content and construct. The SAQs compiled through this process should be "standard set" using suitable criterion-referenced method to produce the valid and defensible cut-off score for formative and if possible for summative exams as well. SAQs must contain concise and precise model answers with clear marking schemes to avoid the inter-rater reliability bias, which is also known as structured SAQs.
 
It is well understood that the medical students need to appreciate the integration of the various basic sciences disciplines so that they can see the "links" with each other. These links are essential for medical relevance in terms of prevention, diagnosis and management modalities including Behavior Commination and Change (BCC) at larger scale. Yet, this integration should be started right from the beginning rather that waiting for the students to learn them in the clinical sciences phase where there are so much to learn in terms of skills and attitude. This provides a strong rationale for using integrated SAQ in the early phase of the undergraduate medical education (MBBS) course.
 
Ideally, undergraduate medical education should have an integrated basic sciences curriculum with introductory clinical medicine (early clinical exposure) and public health (community health/medicine) throughout the two years period. It then enables introducing appropriate and innovative teaching/learning methods to exploit the opportunities to instill the life-long and self-directed learning habit among the students, which are crucial attributes for a good physician. Thus, SAQ should assess not only what was imparted but also what was learnt through self-directed learning, which in turn advocates for the introduction of teaching/learning and assessment methods that promotes self-directed learning among the students. Thus, a hybrid Problem-Based Learning (PBL) method becomes a rationale choice as the main teaching/learning method in the integrated basic sciences phase of curriculum as it enables the students to appreciate the teaching/learning of various disciplines in terms of patient and community care perspectives. Students also gain knowledge from large group lectures that are delivered to cover important and difficult concepts and competencies not covered through PBL cases. On the other hand, carefully constructed and administered self-assessment tests also promotes the self-directed learning among the students.
 
Therefore, it is recommended to use an "integrated" curriculum and appropriate teaching/learning methods to construct structured integrated SAQs from the beginning of the undergraduate medical education (MBBS) course as integrated SAQs are extremely important tool for assessing the short term and long term retention of knowledge. It also enables testing Higher Order Thinking Skills (HOTS), which are usually difficult to assess using Multiple Choice Questions (MCQs).


Since Bloom's Taxonomy classifies the knowledge into six varying degree of difficulties (see figure on LOTS & HOTS), use of MCQs and PBQs together gives a holistic assessment of knowledge imparted and/or acquired through the PBQ tutorials and large groups lecture sessions. It is the PBL cases and sessions that gives lieu to assess the higher level of knowledge as students participate and identify the learning issues related to each PBL case and later independently prepare and present their findings in the small group tutorials facilitated by a trained faculty.

 
Meta-analysis (Walker and Larey, 2009) and meta-synthesis of meta-analysis (Strobel and van Barneveld, 2009) comparing PBL and conventional classroom have shown that long term retention of knowledge is better among PBL students than the students exposed mostly to the large group lecture sessions. Learning pyramid also shows that retention of knowledge is much more higher when one prepare for discussion and/or teaching (see figure on learning pyramid) and students experience them both in the PBL tutorials.

Nonetheless, there will a dilemma on using SAQ initially for assessing knowledge in the hybrid PBL curriculum due to fear of repeated assessment of knowledge through MCQs and then SAQs. Thus, a logical and viable solution should be developed to ensure the quality of the SAQs being used for the learners, teachers and program. Such method can be developed and I call it Structured Integrated Short Answer Questions (SISAQ) or Problem Based Question (PBQ). 
 
In this method, a half-day or full-day PBQ writing workshop is organized and all the faculty involved in the teaching and facilitating in the particular organ-system based blocks will be formally invited to participate in the process. Once they gather  in an isolated area, curricular contents form each disciplines will be listed in the white board/spreadsheet to produce the examination blueprint for PBQs. This should be done by the concerned block director as far as possible. Then relevant clinical cases are listed to cover various disciplines and contents laid down in the blueprint. Only few (4-6) cases are chosen in consensus for constructing required PBQs for the formative and summative exams as laid down by the examination policies of the institute. It should be known that PBQs demand nearly two minutes per one marks and more that 6 PBQs will be very demanding in terms of cognitive ability of the students in the stipulated time. Once the PBQ case themes are selected through consensus then workshop faculty are divided into 2-6 groups where each group will have at least one faculty from each of the basic, community and clinical sciences disciplines to cover their contents/competencies in the PBQs. They will now work in groups to construct and discuss the PBQs. 
 
Before breaking to the groups, faculty must be informed/aware on there rules:
 
1) Factual recall questions are NOT allowed in the PBQs, questions should be of application level and beyond
2) Direct questions are NOT allowed in the PBQs, questions should be constructed from the PBQ vignette as it is desirable to test knowledge deepening and knowledge creation rather than knowledge acquisition
3) There will at most 10 questions in each PBQ and each question will have minimum of 0.5 marks
4) Total marks of the PBQ should be between 15 - 25 and total time be 1.5 - 2.0 times the total marks
5) Each item should have a concise as well as precise model answer/s with clear marking schemes so that any faculty can mark it but remains reliable and valid
6) Each PBQ questions should be labeled properly with the concerned discipline to evaluate the disaggregated performance so that proper feedback can be given to the specific student/s
7) Each PBQ should be standard set using as suitable criterion-referenced method so that pass-mark changes from exam-to-exam reflecting the construct (easy, moderate, difficult) rather than content
8) Each PBQ should be handed over to the block director and must be destroyed/deleted permanently from the flip-chart and/or computer used by the group
 
Once the PBQs are constructed, the concerned block director should review them for face and content validity with the basic sciences coordinator/chair. These PBQs should be submitted to the examination section immediately along with the comments and suggestions. The exam section should send them to the internal and external reviewers first and then to the moderation committee with all the comments/queries/suggestions gathered so far. The moderation committee should finally decide on each of the items of each PBQ and hand it back to the examination section for further administrative process.
 
If done correctly, SISAQ or PBQ will be able to assess the higher level of knowledge, which in turn will provide important information on the success of the knowledge transfer, especially long term knowledge retention and its application, in the students of traditional and/or hybrid PBL curriculum.
 
I will cover the cut-score (pass mark) determination and item analysis of the PBQs in the next blog.
 
Until then, happy PBQing ...

References:

Walker A, & Leary, H. (2009). A Problem Based Learning Meta-Analysis: Differences Across Problem Types, Implementation Types, Disciplines, and Assessment Levels. Interdisciplinary Journal of Problem-based Learning, 3(1).

Strobel, J. & van Barneveld, A. (2009). When is PBL more effective? A Meta-synthesis of Meta-analysis Comparing PBL to Conventional Classrooms. Interdisciplinary Journal of Problem-based Learning, 3(1).

Wednesday, July 3, 2013

Massive Open Online Course for HPE&R

Kathmandu, Nepal

Massive Open Online Course (MOOC) has changed the way learning happens in the globe. Distance learning is becoming a norm now as internet has empowered us to do courses from the institutions and instructors of our choice without leaving the comfort and safety of our room, family and country. It might also contribute reducing the much talked/reported "brain drain" in the long term for education as these courses are "free" so far. In my opinion, two MOOC players, EdX and Coursera, are currently most active in this scenario. EdX is open-source whereas Coursera is closed-source.
 
Both of them have some excellent courses related to Health Professions Education & Research (HPE&R) but the one recently announced course in the Coursera entitled "Instructional Methods in Health Professions Education" is getting lots of attraction here in Nepal and South Asia. Many of the colleagues from the Patan Academy of Health Sciences (PAHS) and FAIMER fellows and faculty from PSG FAIMER Regional Institute (PSG-FRI) have already registered for this course within a month of getting knowledge about it through the "group mail" and "listserve" of PAHS and PSG-FRI respectively.
 
The introduction of this course says that "This course provides those involved in educating members of the health professions anasynchronous, interdisciplinary, and interactive way to obtain, expand, and improve their teaching skills."

With this course, the instructor claims that the learners will:

1. Understand educational theory as it relates to health professions education
2. Match instructional methods with desired educational outcomes
3. Learn a variety of applied teaching techniques
4. Share successful teaching strategies

It covers broad areas of HPE and I especially like the starting of the course with the adult learning theory. There is no doubt that it will provide the theoretical and practical bases for all the faculty involved in the HPE.
 
 
The course is starting from August 5, 2013 so kindly register if you want to improve your teaching, assessment and research in HPE if it turns out to be the effective one, which remains to the judged by you later.
 
I would like to thank to the first voter/s who have voted their choice in the poll entitled "What do you want me to write (next)?" I will wait some more time and then prioritize my writing based on the "e-vox populi".
 
I am planning to write next post on "Andragogy: Adult Learning Principles".

Testing Post-partum hemorrage management skills using OSCEs

Kathmandu, Nepal

Developing and establishing Reliability and Validity of OSCE stations on PPH Management skills:
 
1. Write a checklist /series of checklists that can be assessed and feasible in 5 minutes (remember OSCE is usually of 5 minutes duration). If you want to test many things then break them down to multiple checklists to assess the "holistic" assessment of skills related to the topic. This will now become "modified OSCE".
 
2. You can take advantage of the currently available tools from articles, books, websites etc. to supplement/complement you own checklists. Make the skills and checklists "mutually exclusive" i.e. don't assess same thing on different checklists.

3. Once you develop the tools then discuss it in a team (experts in PPH management) to ensure the "content validity". Remember this will be iterative process and you will need to document the discussion in each step.

4. Give the tool/tools to review by the other content and non-content experts (GPs, Emergency Medicine, Public Health including statistician, Allied Health etc.) to ensure "face validity".

5. Think wisely if you want to use Standardized patients or real patients. You need to train them beforehand to get the "standardized" outcomes. Real patients are difficult to manage and handle (we recently faced a serious problem with our Obs/Gyn OSCE during formative exam as only 2 out of 8 volunteer patients were willing and available at the end of the OSCE circuit!).

6. Train the interviews rigorously to ensure inter-rater reliability.
 
7. Standard set each OSCE station using holistic/subjective approach i.e. use global rating along with the objective checklist ratings. These scores in combined will give you the "competency" required to pass each OSCE station using Borderline, Contrast group or Borderline Regression Method. One can also use Angoffing but it will be tedious process.

8. Once it is done, then pilot test the OSCE stations with volunteer participants (students, interns etc.) to know about the feasibility, problem in checklist and reliabilities. Use at least 10-15 volunteers in this process. Pilot test will also give you the internal construct reliability commonly known as Cronbach's alpha.
 
9. Use at least 2 raters in each station to calculate inter-rater reliability.
 
10. If you use more than 1 station then you can calculate the Generalizability (G) Coefficient which shows how much reliable the whole OSCE circuit. You need special software for this but it is available free from PERD, McMaster University, Canada.
 
11. If you also want to assess the test-retest reliability of your PPH OSCEs then run the pilot test two times using same interviewers and same candidates immediately after a break of, say, 30 to 60 minutes in the same day. If possible make sure not to contaminate your test-re-test reliability estimate i.e. don't let students refer the OSCE related resources from any references and interact on the OSCEs).

12. Modify the checklists after the pilot test if required and document them properly. Re-run the pilot test if the reliability estimates are not in the acceptable range. If they are acceptable then you are ready to run the "full fledge" OSCE on PPH management.
 
13. Calculate the relevant statistics, interpret the results, document them and publish it on MedEdPortal so that others can assess it when required.
 
This will complete all the reliability assessment for your PPH OSCEs:

I. Inter-rater reliability
II. Internal construct reliability
III. Test Re-test reliability

And remember that "Rome was not built in a day" so go slowly and document all the process. Your tool development and/or pilot testing phase can be your curriculum innovation project.

Once you run the full OSCEs then you can proceed with more complex statistical analysis to establish the "construct validity", "Predictive validity" & "Criterion validity" later ...
 
N.B. -
 
a. Same process holds true for pre-validate and/or validate the OSPE/OSCE stations properly. 
 
b. If the tool needs to be used in the local language (say Nepali) then it should be translated from English to Nepali first by a professional translator/expert bilingual followed by back-translation into English using translated Nepali tool by another professional. Then original English tool and back-translated English tool should be compared by the study team and modifications should be reflected in the Nepali tool as well. This process will also be iterative one and must be documented properly. It can be resource demanding as well.

Tuesday, July 2, 2013

Standard Setting in Health Professions Education

Kathmandu, Nepal

Health Professions Education (HPE) is embracing the competency based curriculum, teaching and assessment cycle, which is quite different from the conventional objective driven approach.
 
The competencies laid down in the course and/or curriculum require a criterion-referenced standard setting system for an effective and defensible student assessment. Standard setting sessions require subject matter / content experts to determine the competencies for each item such as Multiple Choice Question (MCQ), Short Answer Question (SAQ), Objective Structured Practical/Clinical Examination (OSPE/OSCE).
 
One of the most widely used criterion-reference method of standard setting in HPE is Angoff method where judges review the items in terms of its content and difficulty levels and award score between 0 and 1. The underlying assumption of this score is related with the "Minimally Competent Borderline Candidate" or simple "Borderline" candidate who are those students who "either can barely pass or fail" in any examination. These are "hypothetical" students and thus a common understanding among the judges must be obtained a priori to the actual Angoff scoring session (Angoffing).
 
When the  individual scores of the judges for an item are averaged then the resulting value becomes the "cut-off score" or pass-mark for that particular item. So, if a test consist of 50 standard set items then the pass mark of the test is obtained as sum of the Angoff score of each of the items included in the test. This shows the "competency" required to pass the test.
 
Below I present a hypothetical example for a test with 5 items where each item is judged by a "mixed" panel of  six judges. With six judges, Angoff scores become reliable further adding weight in favor of its validity.
 
Example:
Judge1
Judge2
Judge3
Judge4
Judge5
Judge6
Angoff
Item1
0.45
0.5
0.55
0.4
0.6
0.55
0.51
Item2
0.4
0.35
0.45
0.4
0.45
0.4
0.41
Item3
0.5
0.65
0.6
0.55
0.6
0.5
0.57
Item4
0.7
0.75
0.8
0.65
0.7
0.65
0.71
Item5
0.65
0.55
0.75
0.7
0.55
0.6
0.63
Pass Mark
2.70
2.80
3.15
2.70
2.90
2.70
2.83
 
 
 
 
 
 
Pass % =
56.5
 
One of the main problem of the Angoff method is the "content expert bias" which means that when standard setting is done by a group of experts belonging to the same discipline they tend to give higher score resulting the "upward bias" in the Angoff score by shifting the pass-mark higher. It has big consequences if the test is "high stake" as adjustments in the Angoff scores are not permitted. Thus, it is recommended to use a "mixed judge" panel to balance the scores in the standard setting process.
 
Another problem occurs when Angoff score is done for the very first time and there is a little discussion on the concept and meaning of the "borderline" student. As most of the HPE courses in South Asia including Nepal uses 50% cut-score, faculty here tend to give at least 50% marks to maintain this norm. I call this a "novice bias".
 
When "content expert bias" or "novice bias" occurs then a compromised methods like Hofstee and Beuk are recommended for formative examinations. These methods use the students' actual scores and Angoff scores to determine the adjusted pass-mark after correcting these bias if present. It is actually recommended to use these methods in the initial phase of the formative student assessment until a normalization effect takes place among the judges. For summative assessment, it is recommended to standard set the items with item analysis results and/or repeat Angoffing to reduce the biases.
 
If standard setting is done correctly then a test with pre dominantly easy items will have higher pass mark whereas a test with mostly difficult items will have lower pass mark. In other words, "competency" for a test is known to the students and faculty before the test is conducted.
 
At last, a test should typically be compiled based on examination blueprint as it allows the teachers/faculty to sample the contents from different sections and of varied difficulty. It also allows the faculty to choose the knowledge and skills using appropriate methods, which in turn increase the validity of the test. As knowledge can also be assessed in varying degree of difficulty, it is always better to use the "mixed bag" approach to get the sample of curricular contents, item difficulty and assessment methods for producing technically "competent" human resources for health.
 
More can be found here:
 
 
2. www.act.org/research/researchers/reports/pdf/ACT_RR89-2.pdf

Rest later ...

So, happy Angoffing!