Help Berkeley Lab Training

Evaluating Effectiveness of Training

There is a well-established and practical model for measuring whether your training is effective. it is called Kirkpatrick's Four Levels of Evaluation. Purpose of this document is to explain the process and how it is applied for EHS training courses at Berkeley lab.

Kirkpatrick's four levels of evaluation

Kirkpatrick's evaluation model is often represented as a set of tiers representing a hierarchy as shown below.

picture of Gagne's nine events of instruction

The Kirpatrick's model is used to evaluate the effectiveness of a training program in order to determine if a training is yielding the intended outcomes and results? In simple terms it determines the extent to which the training "hit the mark" in balance to the cost/effort (was it worth the time, effort and cost to provide the training). This is answered differently for each program but the methods used can be applied consistently cross-program.

Level 4 Results

To what extent did the training improve business/performance results (ROI/ROE) or in the case of safety, improved metrics as evidenced by a decrease in incidents/accidents, for example.

Level 3

To what extent did participants apply (incorporate) what they learned during training when they are back at work (transfer)?

Level 2

To what extent did participants acquire intended knowledge, skills, and abilities as a result of training (exams, practicals, activities)

Level 1

To what extent did participants respond favorably to the training?

A risk-based approach

EHS employs a risk-based approach to determine the extent (effort) used to evaluate training effectiveness. The reason is that evaluating training at levels 3 and 4 can take a lot of effort, time and cost so this is balanced to risk. In short, the higher the cosequence to error, the more effort is used to evaluate effectiveness. Using Rad worker as an exmaple, if it is vital that radiological workers be able to use survey instruments effectively in order to determine if a work area is contaminated or not, and the cosequence of not being able to use a survey instrument is high (impacts worker safety, and/or impacts LBL's reputation or results in program penalties) then this would be a good candidate for level 3 evaluation because you would want to verify (in place of work) that those trained are actually capable of using a survey instrument (wouldn't leave this to chance).

Risk Level 1 hazards:
Trainings that are associated with risk-level 1 hazards are evaluated using Level 1 and Level 2 methods. However this is not a hard-and-fast rule. Ergonomics, for example, has a high consequence to personal health and safety, as well as institutional cost, but is a risk-level 1 hazard. I mention this because, it is important to apply judgement when determingin the value for performing Level 3 Evaluation (it requires greater effort). Level three can also be helpful in identifying if there are other factors that impact dewsired peformance (other than training) which could include inadequate resources, human factors, management/supervisor issues, etc.

Risk Level 2 hazards:
Trainings that are associated with risk-level 2 hazards are evaluated using Level 1 and Level 2 methods, and can benefit from employing Level 3 methods. Again, judgement is required since risk-level 2 "topics" constitute a wide range of hazard classes and therefre a wide range of knowledge/skill sets and competencies. This can make evaluation difficult so it is very important to align evaluation to specific objectives and evaluation used in the training (discussed later).

Risk Level 3 hazards:
Trainings that are associated with risk-level 3 hazards are evaluated using Level 1 and Level 2 and Level 3 methods. THis is not a hard-and-fast rule, but rather a best practice. Why? If, for example, it is critical that workers who complete Lockout/Tagout (LOTO) are able to perform the procedure without error, it is not only important to validate this as part of the training, but also to validate that all critical steps are being applied in the course of work. This answers the question; did the training transfer? If not, it serves to determine why not? This allows a program to determine if trainingi is "sticking" and serves as a way to refine and improve training based on discoveries. It also serves to identify if there are non-training factors that affect performance, so these can be identified and managed separately.
The following is an outline to help form a decision for when to apply Level 1,2,3,4



Level 1

Level 2

Level 3

Level 4

Risk Level 1 / 2

Standard awareness-level trainings (safety or business process) that are not critical to safety or achieving business-critical work.





Risk Level 3

Electrical safety, radiological worker, fall protection, key emergency response, or key business process’ where there is a meaningful consequence to lack of performance. (optional)





Impacts large population

Requirement impacts large population and therefore training program wants to validate it is achieving results given the impact/cost (optional)





High administrative or programmatic risk

Is it important to a program to validate training is being applied for assurance or other business reasons.





Example of Applying Level 2 and 3

The following provides an example for how to apply level 2 and level 3 evaluation methods. The example is based on training for the use of electrical gloves and tools. Since, evaluation should align directly to the learning objectives, the example starts with three objectives.

After completing this lesson, participants will be able to:

Level 2 Methods used to evaluate whether participants Learned. This example uses two parts to position contect; (the instruction) and (evaluating learning).

Level 2 evaluation simply evaluates the extent to which participants learned what they were supposed to learn. It is directly correlated to the performance objectives which describe learning in performance-based terms.

Level 3 Methods used to evaluate whether the learning that took place in the classroom transferred to the workplace. The best method is work observation, but other methods include using confidence intervals and structured interviews. It is suggested that evaluations are conducted 90 days or more after initial training to determine if the behaviors have “Stuck.” It is also suggested that evaluation is performed using a random sample of 20 or more. The following is an example of a structured work observation rubric that is aligned to the learning objectives.


Inadequate (1)

Developing (2)

Skilled (3)

Exceptional (4)


When to inspect gloves

Unable to describe when glove inspection is needed using real world examples.

Explained one situation for when to inspect, but did not understand other applicable situations when probed.

Provided examples for when they inspected their gloves and were able to indicate if inspection was needed for other situations when asked.

Provided examples or when they inspected gloves, and were able to explain multiple other examples and the reasoning behind these.


How to inspect gloves

Missed one or more critical steps

Performed all critical steps but their confidence and/or technique still needed further development

Performed all critical steps and able to explain why they are critical. Technique was good.

Showed automaticity, but with a critical eye for error checking. Able to explain all critical steps and demonstrated good technique.


2) Confidence intervals
Confidence intervals can be a useful technique for measuring the confidence level of supervisors or managers in relation to worker performance. It is (by itself) not the same as validating through observation, but provides value to programs where work observation itself is difficult, or not applicable (for example knowledge work). It is best used by front line managers, work leads, or activity leads who have direct oversight of the work being performed so have a close relation to workers. Reliability is measured in part by the consistency in response between all who respond. Validity would be formative validity in deciding how well the measures can be used to improve training.
Example: A survey could have the following questions

The last two questions are used to determine whether the supervisor has adequate understanding of the process to be considered a reliable source (just examples).

3) Structured Interviews
Structured interviews are used to evaluate understanding and not skill. Use of rubrics are helpful and making sure that all participants are asked the same questions as a method to reinforce validity of results. Interviews, in comparison to confidence intervals, allow for discussion and therefore more meaningful data.

Q1: When did you recently wear electrical gloves, and why?
Q2: How did you know that the gloves you wore were effective?
Q3: If someone new asked you when they needed to wear gloves, what would you tell them?
Q4: How do you inspect your gloves and when?

All questions should be based on the learning objectives from the training so they are valid.


Training evaluation allows programs to determine the extent to which workers are meeting expectations and applying what they learn.  Using evidence-based methods strengthens a program’s ability to provide training that yields expected results. It also provides the data necessary to show management a positive return on investment (ROI) and return on expectations (ROE) without which efforts are unsubstantiated. Finally, it allows a program to clearly define the scope of responsibility so training is not held accountable for non-training issues.