Tandem Education
Tandem EducationGrowing, together.
All posts

“Human in the Loop” Has Become the AI Safety Equivalent of “Thoughts and Prayers”

Julian Pscheid

Julian Pscheid

18 min read

There's a phrase that shows up in nearly every enterprise AI pitch deck, every procurement slide, every vendor FAQ: “Don't worry, there's a human in the loop.”

It's the universal answer to AI anxiety. A hospital deploys AI to flag radiology scans. Human in the loop. A bank uses a model to score loan applications. Human in the loop. A court system adopts risk assessment algorithms for sentencing. Human in the loop. The phrase signals responsible deployment. It suggests that whatever the AI gets wrong, a qualified person will catch it before anything consequential happens.

Except the research on how humans actually interact with AI systems tells a very different story. Across multiple studied domains, from clinical decision support in healthcare to autopilot in aviation to military air defense, the pattern keeps showing up: when you put a human in charge of reviewing AI output, the human tends to defer to the AI. Not sometimes. Routinely. Even when the AI is wrong. Even when the human would have gotten it right on their own.2,5 The same dynamic has been experimentally reproduced in simulated judicial decisions, where AI recommendations given before human judgment reduced accuracy through anchoring effects.8

We build AI tools for teacher evaluations. And we think the education sector needs to pay close attention to what other industries have already learned the hard way about the limits of human oversight.

The confidence game

“Human in the loop” has become the default answer to every concern about AI in high-stakes decisions. Worried about bias? Human in the loop. Worried about accuracy? Human in the loop. Worried about liability? Human in the loop.

What it usually delivers is a busy professional glancing at a polished AI-generated document and clicking “approve.” The human is technically in the loop. They're just not doing much in it.

A 2022 study by Ben Green at the University of Michigan surveyed 41 policies worldwide that prescribe human oversight of government algorithms. His conclusion: the policies suffer from two fundamental flaws. First, evidence shows that people can't actually perform the oversight functions the policies assume they can. Second, these oversight requirements make things worse by creating a false sense of security that legitimizes the use of flawed algorithms without fixing the underlying problems.1

Green won the Future of Privacy Forum's Privacy Papers for Policymakers Award for this work. His framing is direct: human oversight policies don't protect against algorithmic harms. They provide cover for them.

Automation bias is not a theory. It's a measured effect.

The technical term for what happens when humans “review” AI outputs is automation bias: the tendency to treat automated recommendations as a replacement for independent judgment rather than as one input among many. Researchers have been documenting this for decades.

A systematic review published in the Journal of the American Medical Informatics Association pooled results from healthcare studies where clinical decision support systems gave incorrect advice. A small, indicative meta-analysis within the review found that erroneous automated recommendations increased the risk of commission errors by 26% compared to making decisions without automated support. Commission errors, in this context, means users who had the right answer switched to the wrong answer after seeing the AI's recommendation.2

That finding deserves emphasis. The AI didn't just fail to help. It actively made expert decision-makers worse by leading them to override their own correct judgments.

A 2024 study in computational pathology confirmed the same pattern with more recent AI tools. Trained pathology experts overturned their own initially correct evaluations 7% of the time after receiving incorrect AI guidance. Time pressure didn't raise the rate of automation bias, but it appeared to worsen its severity, with greater reliance on erroneous recommendations and larger overall performance declines under time constraints.3

These are trained medical professionals making decisions in their core area of expertise. And yet another 2024 empirical study found that non-specialists are even more susceptible. The people who stand to benefit the most from decision support tools are also the most likely to follow them off a cliff.4

A November 2024 report from Georgetown's Center for Security and Emerging Technology examined automation bias through case studies spanning Tesla's autopilot, Boeing and Airbus aviation incidents, and military air defense systems. The report's framing is plain: “human-in-the-loop” cannot prevent all accidents or errors. When users grow accustomed to a system that usually works, they stop actively monitoring it. The habit of independent verification degrades. The system becomes the default, and the review becomes a formality.5

In aviation, they call this “automation complacency.” In education, it looks like a principal reading an AI-drafted evaluation, thinking “yeah, that sounds about right,” and moving on to the next one.

Why teacher evaluations are especially vulnerable

If you were designing a scenario to maximize automation bias, it would look a lot like the current state of AI-assisted teacher evaluations.

The evaluators are time-constrained. Administrators consistently describe evaluation as one of the most painful parts of their job. In conversations with Oregon school leaders, we've heard administrators report calendaring up to four hours for a single observation cycle. The entire appeal of AI tools is saving time. But faster review means less scrutiny, and less scrutiny is exactly where automation bias thrives.

The evaluators are often not deeply familiar with the rubric. Evaluation frameworks like the Danielson Framework for Teaching run dozens of pages with detailed criteria across four domains and numerous sub-components. A veteran administrator might have these internalized after 14 years on the job. A second-year principal does not. When AI maps evidence to rubric components, a less experienced evaluator doesn't know the framework well enough to catch mapping errors. The 2024 empirical study confirmed this dynamic in healthcare: lower domain expertise correlated directly with higher rates of incorrect decision-switching after flawed automated advice.4

The output looks authoritative. AI-generated evaluation language is fluent, well-structured, and deploys appropriate professional vocabulary. It reads like it was written by someone who knows what they're doing. Research on automation bias has identified display prominence and trust calibration as key factors that increase over-reliance on automated outputs. The more prominent and polished the system's recommendations appear, the less independently users evaluate them.2

Verification is inherently difficult. To properly review an AI-drafted evaluation, an administrator would need to cross-reference the draft against their own observation notes, the raw transcript, and the rubric criteria, then determine whether the AI's interpretation of the evidence is accurate, fair, and complete. A systematic review found that automation bias occurred even in single-task environments, typically when verification complexity was medium or higher, and that the cognitive demands of verification are cumulative across tasks. Adding secondary responsibilities (like monitoring student behavior, which administrators do constantly during observations) compounds the effect.6

Every condition that makes automation bias worse is present in teacher evaluation. Every one.

The direction of the workflow changes everything

Here's the distinction that “human in the loop” obscures: it matters enormously whether the human is reviewing AI-generated work or the AI is supporting human-generated work.

In the typical AI evaluation workflow, the administrator takes brief notes during a classroom observation. They paste those notes into an AI tool (or the tool generates a draft from a transcript). The AI produces a complete evaluation. The administrator reviews it.

At that point, the AI has already made the consequential decisions. It chose which evidence to highlight. It decided which rubric components apply. It set the framing, the tone, the balance between strengths and growth areas. The administrator is now editing a document, not constructing an evaluation. They're working within the AI's frame, and everything we know about automation bias says they'll tend to stay there.

Lisanne Bainbridge explored this tension in her landmark 1983 paper “Ironies of Automation.” Her core observation was that automation creates a paradox for human oversight: by taking over the routine parts of a task, it removes the practice that keeps human operators sharp, while still expecting those operators to step in and catch problems when the system fails. The human is left responsible for the hardest part of the job (error detection and correction) while being denied the ongoing practice that would keep them good at it.7

A 2024 experimental study published in Cognitive Research: Principles and Implications tested this directly for AI decision support. Researchers simulated a judicial decision-making process and varied whether participants received an AI recommendation before or after forming their own judgment. The results were clear: when participants saw the AI's recommendation first, their accuracy dropped. The AI's output acted as an anchor, and participants shifted toward it even when it was wrong. When participants formed their own judgment first and then received AI input, the anchoring effect was significantly reduced.8

That finding maps directly onto teacher evaluations. When the AI produces a draft and the administrator reviews it, the AI's framing is the anchor. When the administrator builds the evaluation and the AI supports the process, the administrator's own judgment remains primary.

The alternative is a workflow where the human remains the primary actor throughout. The AI transcribes audio that the administrator is actively observing. It organizes notes that the administrator is actively taking. It suggests rubric connections that the administrator evaluates against their own firsthand experience in the room.

That difference isn't cosmetic. It determines whether the evaluation reflects the administrator's professional judgment or the AI's interpretation of a transcript. One is a supported evaluation. The other is a generated one with a human stamp of approval.

The deskilling problem

There's a longer-term risk that goes beyond any individual evaluation.

Aviation research has documented for decades that pilots who rely heavily on autopilot systems perform worse when they need to fly manually. Their skills atrophy because they aren't practicing them.5 The same dynamic applies to any domain where automated tools handle the skilled portion of the work.

For teacher evaluation, this means administrators who routinely let AI map observations to rubric criteria will, over time, know the rubric less well. Not better. They'll be less able to catch AI errors, not more. The “human in the loop” gets progressively less capable of performing the oversight function that the entire safety model depends on.

Evaluation tools should make administrators better evaluators, not make the evaluation happen despite the administrator's limited engagement. If a tool works best when the evaluator pays less attention, something has gone wrong with the design.

What meaningful AI assistance looks like

The answer isn't to avoid AI in evaluations. That ship has sailed. A February 2025 RAND Corporation survey found that nearly 60% of principals reported using AI tools during the 2023–2024 school year.9 In our conversations with Oregon school leaders, many of those administrators are using consumer AI tools on personal accounts, without institutional controls, data agreements, or quality safeguards. Sticking with the status quo means accepting an unregulated mess.

But responsible AI integration requires more than appending “human in the loop” to whatever workflow the software happens to use. It requires actually designing for it.

It requires a workflow where the human remains the primary actor. The AI should handle tasks that don't require professional judgment (transcription, timestamp management, note organization) and support tasks that do (rubric alignment, evidence synthesis) without replacing the evaluator's own analysis.

It requires grounding AI suggestions in verifiable evidence. When the AI suggests a rubric connection, it should point to the specific moment in the observation, the specific teacher quote, the specific evaluator note that prompted the suggestion. The administrator should be able to verify each claim against something concrete, not just assess whether the language sounds reasonable.

It requires preserving the evaluator's ownership of the evaluation. The final report should reflect the administrator's professional judgment, supported and organized by AI. Verbatim quotes from the teacher's own instruction, linked to timestamps, grounded in rubric criteria that the evaluator confirmed from their own experience in the classroom. Evidence that is traceable, not generated.

And it requires building toward evaluator skill, not away from it. Every interaction with the tool should deepen the administrator's understanding of the rubric and their ability to recognize quality teaching, so that the human in the loop gets more capable over time rather than less.

How we designed Elevate around this problem

Elevate by Tandem Education was built around a specific conviction: the direction of the workflow determines the quality and integrity of the evaluation. No tool can force an evaluator to be thorough. But a tool can be designed to keep the evaluator in the driver's seat rather than the passenger seat.

During a classroom observation, the administrator is the primary actor. They're in the room. They're watching students, not just listening to the teacher. They're adding timestamped notes that capture what audio alone can't: student engagement, classroom environment, body language, learning targets posted on the wall. The AI transcribes audio in real time and scrubs student identifiers, but the evaluator is building the observation record from their own professional judgment.

After the observation, Elevate suggests how the evidence connects to the district's evaluation rubric. These are suggestions, not conclusions. They're presented alongside the specific evidence that prompted them. The administrator reviews each suggestion against their own experience of what actually happened in the classroom, accepts or rejects it, and edits the language to reflect their own professional voice. The draft report pulls verbatim teacher quotes from the transcript, tied to timestamps, so every claim in the evaluation traces back to something real.

Can an evaluator rush through this process? Of course. No software can substitute for professional responsibility. But the workflow is designed so that the path of least resistance still produces an evaluation grounded in evidence, attributed to specific moments, and connected to rubric criteria that the evaluator engaged with individually. The tool doesn't generate a polished draft and ask the evaluator to bless it. It builds the evaluation step by step, with the evaluator making decisions at each stage.

For teachers, this means receiving an evaluation that quotes their own instruction, references specific moments they'll remember, and connects feedback to concrete evidence rather than generic AI-generated commentary.

For unions, the answer to “who wrote this evaluation?” is unambiguous. The administrator did, with AI support for transcription, organization, and rubric alignment. The AI handled the mechanical work. The evaluator made the professional judgments.

For districts, an evaluation process built this way is defensible. New York City's Local Law 144, enforced since July 5, 2023, requires bias audits for automated employment decision tools.10 Colorado's AI Act, with key obligations starting February 1, 2026, imposes impact assessment requirements for high-risk AI used in consequential employment decisions.11 The regulatory direction is clear, and the distinction between “AI-assisted” and “AI-generated” evaluations is going to matter.

The question worth asking

When any vendor (including us) tells you their tool includes human oversight, ask a follow-up question: in what direction does the workflow run?

Does the AI generate an evaluation that the administrator reviews? Or does the administrator conduct an evaluation that the AI supports?

Those two approaches sound similar. The research says they produce fundamentally different outcomes. The first creates the exact conditions under which decades of automation bias research predicts that human oversight will fail. The second keeps professional judgment where it belongs.

Your teachers deserve to know that their evaluation reflects what a trained professional actually observed in their classroom, not what an algorithm inferred from a transcript. The difference is in the design.

Get in touch to learn more about how Elevate keeps evaluators in the driver's seat.

This article is for informational purposes and does not constitute legal advice. Districts should consult qualified legal counsel regarding AI use in personnel evaluation processes.

References

  1. Green, Ben. “The Flaws of Policies Requiring Human Oversight of Government Algorithms.” Computer Law & Security Review, vol. 45, 2022. Winner of the Future of Privacy Forum Privacy Papers for Policymakers Award. ssrn.com.
  2. Goddard, Kate, Abdul Roudsari, and Jeremy C. Wyatt. “Automation Bias: A Systematic Review of Frequency, Effect Mediators, and Mitigators.” Journal of the American Medical Informatics Association, vol. 19, no. 1, 2012, pp. 121–127. pmc.ncbi.nlm.nih.gov.
  3. Rosbach et al. “Automation Bias in AI-Assisted Medical Decision-Making under Time Pressure in Computational Pathology.” arXiv:2411.00998, November 2024. arxiv.org.
  4. Kücking et al. “Automation Bias in AI-Decision Support: Results from an Empirical Study.” Studies in Health Technology and Informatics, 2024. pubmed.ncbi.nlm.nih.gov.
  5. Kahn, Lauren, Emelia S. Probasco, and Ronnie Kinoshita. “AI Safety and Automation Bias.” Center for Security and Emerging Technology, Georgetown University, November 2024. cset.georgetown.edu.
  6. Lyell, David, and Farah Magrabi. “Automation Bias and Verification Complexity: A Systematic Review.” Journal of the American Medical Informatics Association, vol. 24, no. 2, 2017, pp. 423–431. pmc.ncbi.nlm.nih.gov.
  7. Bainbridge, Lisanne. “Ironies of Automation.” Automatica, vol. 19, no. 6, 1983, pp. 775–779. sciencedirect.com.
  8. Agudo, Ujué, Karlos G. Liberal, Miren Arrese, and Helena Matute. “The Impact of AI Errors in a Human-in-the-Loop Process.” Cognitive Research: Principles and Implications, vol. 9, no. 1, January 2024. pmc.ncbi.nlm.nih.gov.
  9. RAND Corporation, “Uneven Adoption of Artificial Intelligence Tools among U.S. Teachers and Principals in the 2023–2024 School Year,” Research Report RR-A134-25, February 2025. eric.ed.gov.
  10. NYC Local Law 144: Automated Employment Decision Tools, effective July 2023. nyc.gov.
  11. Colorado SB 24-205 (Colorado AI Act), signed 2024. leg.colorado.gov.
Julian Pscheid

Written by

Julian Pscheid

Co-Founder & Chief Technology Officer at Tandem Education

Ready to transform teacher evaluation?