What if the people using AI were the ones grading it? | Piece #2 Misaal
Accuracy has become the comfort blanket of evaluating how AI is performing. If a model gets a 90 % accuracy score, we’ve learnt to nod approvingly. In law and justice, though, that number can be about as informative as a weather forecast that says “sky present.”
There is a strange irony here. We worry AI lacks human judgment, empathy and awareness of consequences. Then we evaluate it using metrics that carefully strip it of all of those things. We check whether the model produced the answer we expected, not whether it produced an answer that would help an actual person in a real situation. In other words, we criticize AI for being detached from real life and then grade it in ways that perpetuate detachment. The future we fear is already built into how we measure success.
Think about where AI actually shows up: a citizen trying to file a grievance at midnight, because the portal keeps rejecting their form; a paralegal supporting a domestic violence survivor, who has nowhere else to turn; legal aid worker explaining a ration card rule that changed last month to someone for whom this is not a policy question. In these moments, an answer can be perfectly correct and still completely useless. It can also be impressive in the way a dictionary is impressive when you need directions.
Take for example, a question from a domestic violence survivor asked to ChatGPT. AI responds with sections from Domestic Violence Act, not asking whether the person wants to be connected to a helpline or wants to find the nearest shelter home or wants the letter of the law that protects them.
Anyone who has worked in these systems can recognise this: accuracy does not tell you whether the person on the other end actually moved forward. So then, by what standard can we hold AI accountable when it becomes a silent partner in our legal decisions?
That is where benchmarks come in. In AI, benchmarks usually work like a well marked quiz. You line up a set of questions, you decide what the correct answers are, and you check how often the system matches them. If the score is high, the system passes. If it is low, it needs work. Simple and very satisfying for anyone who enjoys neat numbers.
The catch is that most current benchmarks reward recall more than usefulness. They check whether the AI got the fact right, the section number right or the rule right. What it doesn’t measure looks quite different: was the AI’s answer actually helpful and did it make the next step for a person clearer? Was the answer aligned with the person’s language, literacy level, and emotional state? Was it empathetic and caring?
A community-centric benchmark changes what counts as passing. It moves the bar, and more importantly, it asks different people to set it. It asks the people who face these problems every day what really matters when they receive an answer from AI.
There is also a deeper shift here, that this piece alluded to in the beginning. Metrics are values in numerical form. When we optimize only for accuracy, we are making a choice about what we think matters. A community-centric benchmark makes those decisions visible and open to scrutiny. Such benchmarks are built through participatory evaluation where frontline community workers, last-mile paralegals,, everyday citizens- the people who heavily rely on AI models, help define what a “good” response actually is. Their feedback appears as structured criteria that shape how models are tested and improved. Thus, success is measured not by informational precision alone but by whether the system supports trust and relevance. Accuracy remains relevant, but it is situated within a broader understanding of what it means for AI to be helpful in human terms.
As AI moves deeper into law, justice, and public services, the question must become whether it is useful, careful, and trustworthy in the hands of real people. Benchmarks that measure this are situated, attuned to human needs and cultural nuances and the path forward is not just more data or larger models. It is better listening to our communities.
Misaal in Punjabi and Urdu means role model, exemplar or shining light. M.I.S.A.A.L - the Multi-dimensional Intelligence Standard for Assessment of socio-Legal AI Interactions is an effort to hear from thousands of users in diverse communities to surface the elements that truly represent a quality AI response to a query pertaining to law or justice in India





