How to compare conflicting AI detector results responsibly

Run the same paragraph through several AI detectors and you may get several different answers. One tool reports a high probability of AI-generated text, another calls it mostly human, and a third marks only a few sentences.

The tempting response is to choose the score that confirms what you already believe. A better response is to treat the disagreement as a review problem.

AI detector outputs are probabilistic signals. They can produce false positives and false negatives, and they do not prove who wrote a document. If the result could affect a student, employee, writer, or publication decision, the review process matters more than finding one authoritative-looking percentage.

Here is a repeatable protocol for comparing results without turning them into a verdict.

Disclosure: I used AI assistance while drafting this article and reviewed the final copy against the product workflow and source implementation.

Start with one stable input

A comparison is not meaningful if each detector sees different text.

Copy the exact passage into a plain-text file before testing. Preserve paragraph breaks, punctuation, headings, and lists. Do not clean one version, remove citations from another, or test only the sentences that look suspicious.

Record a few basic facts:

total character count
number of sentences or paragraphs
language
whether the text was pasted or extracted from a document
whether references, quotations, tables, or templates are included

This creates a stable baseline. It also makes later disagreements easier to explain. A detector may react differently to a bibliography, repeated form language, translated text, or a short sample.

Verify the text extracted from documents

PDF and DOCX files add an extra step before detection: extraction.

Columns may be read in the wrong order. Headers and footers may repeat. Hyphenated words may split. Text inside images may be omitted. A polished document can therefore become a noisy plain-text input before the detector analyzes it.

If a tool shows the extracted text, inspect it before interpreting the score. Check the beginning, middle, and end. Confirm that paragraphs are in the expected order and that repeated page furniture has not become part of the sample.

Sentence highlights should also be understood in this context. They align with the extracted text shown in the report, not necessarily with the original page coordinates in the uploaded file.

If extraction is clearly broken, stop. Comparing detector scores at that point only compares reactions to damaged inputs.

Treat short samples as weak evidence

A detector may accept a small passage and still have too little material for a stable review.

Short inputs provide fewer opportunities to compare sentence rhythm, repetition, variation, and structure. A formal five-sentence introduction can look highly regular even when a person wrote it. A longer document may contain enough variation to change the interpretation.

In Detector de IA, text can be submitted from 300 characters, but sample reliability is represented separately. Fewer than 1,000 characters or fewer than five sentences lowers reliability; intermediate lengths and sentence counts can remain medium rather than high.

Those thresholds are product guardrails, not universal scientific laws. The useful principle is broader: input acceptance and evidence strength are different states.

When comparing tools, note whether any of them explain sample sufficiency. Do not give a precise percentage more authority merely because it has two digits.

Compare disagreement instead of averaging scores

Suppose three tools return 28%, 61%, and 84%. Averaging them to 58% creates a new number without solving the underlying disagreement.

The tools may use different models, thresholds, preprocessing, training data, or definitions. Their percentages are not guaranteed to share the same scale. Treating them as interchangeable measurements can create false precision.

Use a comparison table instead:

| Review item | Tool A | Tool B | Tool C |
| --- | --- | --- | --- |
| Exact input used | Yes | Yes | Yes |
| Sample warning shown | No | Yes | No |
| Sentence evidence shown | Yes | No | Yes |
| Limitations stated | Yes | Yes | No |
| Result | Record it | Record it | Record it |

This shifts attention from "Which number wins?" to "What did each system actually return, and under what conditions?"

Inspect sentence-level signals

A document-level score is useful for triage, but it does not tell you what to review.

Sentence-level highlights can narrow the next step. Look for patterns:

Are all tools flagging the same sentence?
Is a repeated template or transition causing several flags?
Are quotations or definitions being treated as original prose?
Does one tool flag an entire section while another flags nothing?
Does the highlighted wording appear elsewhere in the document?

Then read the surrounding paragraph. A highlighted sentence without context is easy to overinterpret.

Repeated phrasing, generic markers, limited punctuation variety, and uniform sentence length can influence a detector. Those patterns can also occur in human writing, especially in forms, academic summaries, policies, translated material, and heavily edited copy.

The highlight is a place to investigate, not an authorship finding.

Preserve the report and the surrounding context

If the review matters, preserve more than a screenshot of the final percentage.

Keep the exact input, date, tool name, report output, reliability note, highlighted sentences, and stated limitations. Also record relevant context that a detector cannot know: drafting history, source notes, revision logs, citations, document metadata, or an explanation from the writer.

A copied summary or printable report can help organize the technical output, but it should sit beside the human evidence rather than replace it.

This record is especially important when results conflict. It prevents the review from being reduced later to "the detector said so."

Define the decision boundary before testing

Do not wait for a surprising score to decide what the score means.

Before running a detector, write down the purpose of the check and the next allowed action. For example:

low-impact editorial review: inspect highlighted passages for clarity or repetition
classroom conversation: ask about sources, drafting, and revision process
publication workflow: verify citations and compare the current draft with previous versions
high-impact decision: require independent evidence and human review

An AI detector report should not be the sole basis for an accusation, penalty, rejection, or misconduct finding. Paraphrasing, translation, templates, and heavy editing can all reduce accuracy.

If the process has no room for uncertainty, the process is not ready to use detector output responsibly.

A practical seven-step protocol

The full workflow is simple:

Save one exact input for every tool.
Verify extracted text when reviewing a document.
Record sample length, language, and structure.
Capture scores, reliability notes, highlights, and limitations.
Compare areas of agreement and disagreement without averaging percentages.
Review the highlighted passages in context with human evidence.
Stop at a review signal; do not convert it into proof of authorship.

This protocol will not make different detectors agree. It makes the disagreement legible and keeps the final judgment in the right place.

I built Detector de IA around that review-oriented approach for pasted text and supported documents: https://detector-de-ia.net/

Search This Blog

Ethan Cole