Tandem Education
All posts

Kim Marshall Is Right About AI in Teacher Evaluation. He’s Also Not Going Far Enough.

Julian Pscheid

Julian Pscheid

10 min read

Kim Marshall recently surveyed Marshall Memo readers about AI in teacher evaluation, and the findings confirm what anyone paying attention already suspected. Administrators are using AI to draft and organize classroom observation notes. A smaller number are using AI tools to transcribe lessons and draw conclusions. A few are feeding observation data into systems trained on rubrics like the Danielson Framework and asking for ratings. Almost none of this is governed by policy, training, or formal guardrails.

Marshall, a former principal himself and the person behind the Marshall Memo (a weekly digest that distills K-12 research for practitioners), drew a sharp line in a recent Education Gadfly piece between AI that supports the human work of evaluation and AI that tries to replace it. On one side: tools that help administrators document, reflect, and prepare for coaching conversations. On the other: tools that generate rubric ratings from a transcript and call it evaluation.

That distinction is the right one. And the three AI observation tools Marshall highlighted as promising examples all share a design philosophy that education technology companies should be studying closely.

What these AI teacher evaluation tools got right

Tommy Mulvoy's Inform(Ed) transcribes lessons and gives teachers confidential data on talk-time ratios, question types, wait time, and discourse patterns. It's not an evaluation tool. It's a mirror that helps teachers see their own practice more clearly. The data never goes to an administrator unless the teacher chooses to share it.

Paul Bambrick-Santoyo's Leverage Leadership App (launching in June) helps supervisors prepare for post-observation conversations by synthesizing notes, student work, assessment data, and prior interactions. It doesn't generate the evaluation. It helps the evaluator walk into the conversation with a clearer picture of what to affirm, what to ask, and where to push.

Will and Matt Krasnow's Conversation Summarizer transcribes the debrief itself, then produces a short summary that teacher and supervisor review, edit, and sign off on together. The conversation happened first. The AI captured it second.

Look at the pattern across all three. In every case, the human does the hard cognitive work. The AI handles documentation, organization, and synthesis after the human has already formed judgments and had conversations. The AI never sets the frame. It supports the frame the human already built.

Marshall also laid out conditions he considers prerequisites for responsible AI in teacher evaluation: shifting from annual high-stakes evaluations to short, frequent classroom visits; freeing supervisors from using checklists during observations; requiring face-to-face conversations after every visit; saving rubric ratings for end-of-year summatives. Those conditions matter because they change when and how the administrator engages their own professional judgment. That timing question is the one Marshall's framework raises but doesn't fully answer.

Why the AI-in-evaluation argument needs to go further

Marshall's framework is sound. But it stops at the level of principle: use AI for documentation, not for judgment. Keep the human central. That's necessary but not sufficient. It doesn't address what happens inside the tool itself.

The critical question isn't just whether a human reviews the AI's output. It's when in the process the AI generates that output, and how much cognitive work the human has already done by that point.

We wrote about this in detail in a previous post on automation bias in AI-assisted evaluation. The short version: when a human reviews a polished, AI-generated document, they tend to accept it. Not because they're lazy or careless, but because AI-generated outputs act as anchors. Once you've read a fluent, well-structured draft that uses the right vocabulary and maps to the right rubric components, your own independent judgment gets pulled toward that frame. Studies on AI-assisted decision-making have found that when people see an AI recommendation before forming their own judgment, their accuracy drops. When they form their judgment first, the anchoring effect weakens.

This is why the phrase “human in the loop” is doing less work than people think. If the loop is “AI generates, human approves,” you haven't preserved human judgment. You've given it a very comfortable place to atrophy.

Marshall gets the spirit of this right. He warns that AI-generated evaluations could lead supervisors to skip conversations entirely and just email reports. But the risk isn't only that conversations get skipped. It's that even when conversations happen, the administrator walks in with the AI's frame already in their head, and the conversation follows the AI's structure rather than the evaluator's own read of the classroom.

What “well-crafted” AI teacher evaluation tools actually require

Marshall concludes that “well-crafted AI tools can rescue the teacher evaluation process from bureaucratic purgatory.” I agree with the conclusion. But “well-crafted” is doing a lot of heavy lifting in that sentence, and most AI teacher evaluation tools on the market don't meet the bar.

Start with timing. The AI should never produce a complete evaluation before the administrator has done their own thinking. If the first thing an evaluator sees after a classroom observation is a finished draft, you've already lost the cognitive engagement that makes the evaluation meaningful. The tool should present evidence, surface connections to the rubric, and suggest language, but the evaluator should be assembling the picture, not approving a pre-assembled one.

Visibility matters too. When AI-suggested text looks identical to text the evaluator wrote, the evaluator loses track of which ideas are theirs and which came from the machine. That blurring accelerates automation bias. At a glance, an evaluator should always be able to tell what the AI contributed versus what they wrote themselves.

Then there's the rubric question. If the AI quietly maps evidence to Danielson Domain 3b and the evaluator never actively engages with that mapping, they're not learning the framework. They're outsourcing their understanding of it. Over time, that makes them worse evaluators, not better ones. Aviation researchers have documented for decades that pilots who rely on autopilot lose manual flying skills. The same deskilling effect applies to administrators who let AI handle rubric alignment without engaging with it themselves. Rubric alignment needs to be something the evaluator accepts or rejects, not a background process they never see.

Most AI-generated teacher feedback also gets this wrong. It sounds professional and complete, producing text that reads well on paper. But the purpose of the evaluation isn't the document. It's the conversation the document supports. Feedback framed as invitational inquiry (“How did you decide when to transition from whole-group to small-group instruction?”) opens a conversation. Feedback framed as declarative assessment (“The transition from whole-group to small-group instruction was well-timed”) closes one.

And none of this matters if the data isn't staying inside a controlled environment. Marshall doesn't spend much time on this, but nothing else matters if you get it wrong. When administrators paste observation notes into consumer AI tools, those notes may include student-identifiable information, which means districts should treat AI-assisted evaluation workflows as FERPA-sensitive and require approved, controlled systems. We covered this at length in our February post on consumer AI in teacher evaluations.

The real test for any AI observation tool

The simplest way to evaluate whether an AI teacher evaluation tool is well-designed: does the administrator know the rubric better after six months of using it, or worse? Does the evaluator write more specific, evidence-grounded feedback over time, or do they increasingly rubber-stamp what the AI produces?

If the tool works best when the evaluator pays less attention, something has gone wrong with the design. The whole point is to make administrators better evaluators, not to make evaluation happen despite the administrator's limited engagement.

Marshall's three promising tools all pass this test. They each keep the human doing the thinking while the AI handles the documenting. The question for every other AI tool entering this space, including ours, is whether the architecture actually supports that principle or just claims to.

Julian Pscheid is Co-Founder and CTO of Tandem Education, where we build Elevate, an AI-supported observation platform designed around the principles described in this post. Reach out at info@tandem-education.com if you want to see how it works.

Julian Pscheid

Written by

Julian Pscheid

Co-Founder & Chief Technology Officer at Tandem Education

Ready to transform teacher evaluation?