Feb 11, 2026 8 min read

MortAI Kombat: What I Learned After Testing 5 AI Models on 47 Real-World Tasks

Rafał Bielicki

Senior Solution Architect

Table of Contents

For a long time, every business conversation about AI ended with the same question: “So which one should we use?”. And I’d launch into the consultant’s classic hedge: “Well, it’s complicated”. Honest… and practically useless. So, I decided to actually try and answer the question. I built MortAI Kombat – a systematic evaluation framework for testing AI models across 47 real-world tasks spanning 10 categories. No technical or laboratory tests. Just the actual work we do every day: writing code, analyzing data, creating presentations, translating documents.
For the first “tournament” I chose the following AI models: Claude, ChatGPT, Gemini, Copilot, MetaAI.
The Moving Target Problem
The Testing Framework
The 5 Surprises
Tools in a Workshop
The Uncomfortable Truth About AI (lack of) Consistancy
The Ultimate Lesson
When The Dust Settles

For a long time, every business conversation about AI ended with the same question: “So which one should we use?”. And I’d launch into the consultant’s classic hedge: “Well, it’s complicated”. Honest… and practically useless. So, I decided to actually try and answer the question. I built MortAI Kombat – a systematic evaluation framework for testing AI models across 47 real-world tasks spanning 10 categories. No technical or laboratory tests. Just the actual work we do every day: writing code, analyzing data, creating presentations, translating documents.

For the first “tournament” I chose the following AI models: Claude, ChatGPT, Gemini, Copilot, MetaAI.

The Moving Target Problem

Here’s something they don’t tell you when you start benchmarking AI models: they don’t stand still long enough for you to finish. I began testing in August 2025. By late-November, Claude had updated once, ChatGPT twice and Gemini rolled out improvements to multimodal understanding. Don’t get me wrong – the constant models’ evolution is a good thing! But that’s not the point. The real problem is that any “definitive” AI comparison has an expiration date measured in weeks, not years, or even months. This isn’t like comparing classic applications or tools. This is like trying to review a restaurant where the chef changes the menu every three days and sometimes replaces ingredients in the dishes you’ve already ordered 😊

But I pressed on anyway. Because even if the specific rankings change, the patterns I found tell us something important about how AI models “think”.

The Testing Framework

I put each model through 10 categories of challenges:

Complex Reasoning (logic puzzles, causal analysis, multi-step math)

Creative Problem Solving (constrained innovation, business model design)

Technical Problem Solving (code debugging, architecture design, algorithms)

Business Communication (executive presentations, crisis communication, stakeholder management)

Creative Writing (multi-audience adaptation, storytelling, brand voice)

Information Extraction (data synthesis, factchecking, research)

Real-Time Adaptability (context switching, progressive conversations)

Translation (English ↔ Polish across technical/legal/medical domains)

Multimodal Understanding (charts, documents, images)

Ethical Reasoning (bias detection, dilemma resolution, sensitive topics)

Each task scored on Correctness, Completeness, Clarity, Number of Iterations, and Speed, with weights assigned to each factor. For example, Correctness and Completeness were more important than Speed, because (at least for me) being correct, but slow is better than being fast, but inaccurate. The final scoring was calculated as follows:

response_value = ∑ (factor * weight)

and normalized on a 1–5-point scale.

Diving in, I assumed I’d find a clear hierarchy: one or two models would dominate across most categories, with maybe some niche specialization in others.

Man, how wrong I was!

The 5 Surprises

Surprise #1: The Translation Paradox

Microsoft Copilot achieved a perfect 5.0 score in translation. Claude – the overall performance leader – scored 2.75

Let that sink in. The model that excelled at complex business strategy and technical architecture had problems with smooth and natural language translation. Meanwhile, Copilot, which struggled with executive communication (2.2/5), flawlessly waltzed through this challenge. This contradicted my assumption that general intelligence correlates with specialized capabilities. Apparently, translation requires different cognitive architecture than strategic thinking. Who knew?

Practical implication: If you’re translating documents, Copilot is the best choice among these models.

Surprise #2: The Speed-Quality Curve Isn’t Linear

I measured response times meticulously. Meta.AI averaged 35 seconds. Claude? 3 minutes 18 seconds — almost 6x slower! But Claude’s quality score was only 1.6x better (3.81 vs 2.43). Here’s the interesting part: The relationship between speed and quality breaks down at the extremes. Meta.AI is so fast it’s mostly useless (superficial responses requiring extensive editing). Claude is so thorough it sometimes takes ages for it to finish (comprehensive analysis including things you didn’t even ask for). The sweet spot? ChatGPT and Gemini. Fast enough for iterations (1-2 minutes), good enough for professional work (3.3 – 3.7).

Practical implication: Use fast models for brainstorming and drafts. Use slow models for final deliverables. Don’t expect speed and quality to trade off smoothly—there are cliff edges in both directions.

Surprise #3: The Crisis Communication Ethics Reveal

I gave each model a product recall scenario: drones falling from the sky due to battery defects. Write the press release. Gemini’s response shocked me (though, in hindsight, it shouldn’t have). It explicitly recommended manipulating the message to make the situation seem better than it was. Not exactly lying, but exercising “prudent disclosure” that emphasized future safety while downplaying current risk. And here’s the disturbing part: it defended this recommendation with logically sound (if somewhat cynical) business arguments. Minimize panic. Protect share prices. Maintain customer confidence. Control the narrative. The other models were less ruthless, but showed similar patterns (though less explicitly). After all they’d learned from actual corporate crisis communications – which, let’s be honest, often prioritize damage control over transparency. This raises a fundamental question about AI training: Should models learn from how businesses actually behave, or how they should behave? I don’t have an answer. But the fact that all these models default to corporate pragmatism over ethical idealism tells you something about their training data.

Practical implication: Never use AI for crisis communication without multiple human reviews, preferably including legal counsel and ethics advisors. These models gravitate towards business outcomes, not moral standards.

Surprise #4: The Multimodal Mirage

Every model claims advanced multimodal capabilities. Charts, documents, images—they can handle it all, right? I tested this with a simple puzzle: an image of a bookshelf and a flipchart with the text “The key is hidden behind the 5^th book from the left on the 2^nd shelf from the top”. What is the color of the book? Every single model failed. Across 4-10 iterations each. Even when I circled the book in question! Every attempt from each model provided a different (and wrong!) answer. These are supposed to be frontier models with “advanced visual understanding.” And they can’t consistently identify the color of a highlighted object in a static image! Chart reading? Hit or miss – they’ll extract data but frequently misread values. Complex process diagrams? Consistently misinterpreted. Fine visual details? Forget it.

Practical implication: Do not trust AI for any task requiring visual precision. Human verification is mandatory. The marketing claims vastly outpace reality.

Surprise #5: The Iteration Tax and The Verbosity-Brevity Death Zone

Even the best models require nearly 2 attempts per task to get satisfactory results. But the reason for additional iterations is, surprisingly, not always due to the poor quality of the answer. Claude consistently exceeds word count limits despite repeated instructions. Ask for 200 words, get 380. Then 295. Then 227. I’d iterate not because the content was wrong, but because Claude couldn’t stop adding “just one more important point.” Copilot had the opposite problem – every response was a skeleton. Ask for a complete business case, get 6 bullet points. Many tasks ended with me asking for more details, after Copilot’s signature “Would you like me to elaborate?” ChatGPT and Gemini? They are naturally calibrated to the expected response length. Still required 1-2 iterations, but for the right reason: fixing actual errors, not reformatting.

The distinction matters:

The “iteration tax” comes in two forms:

Quality iterations (good): Refining logic, fixing bugs, clarifying ambiguity

Format iterations (wasteful): Compressing bloat or expanding skeletons

ChatGPT and Gemini force mostly quality iterations. Claude and Copilot – format iterations. One improves your work. The other just burns time.

Practical implication: When choosing a model, ask yourself: “Will I spend more time fixing wrong answers or reformatting the correct ones?” For exploratory work where comprehensiveness matters, Claude’s verbosity is a feature. For tight deadlines, ChatGPT or Gemini hit the sweet spot. For production deliverables, expect 3-4 rounds regardless of the model — but at least with Gemini and ChatGPT, those iterations improve the substance.

Tools in a Workshop

Forget the idea of picking “the one-and-only AI model”. That’s the wrong approach. Instead, think of AI models as tools. You don’t ask “Is a hammer a better tool than a screwdriver?”. You ask “Which tool is the best for this particular task?”

After 47 tasks and hundreds of iterations here’s my current toolkit mapping:

For strategic business decisions and executive presentations: → Claude (worth the wait)

For creative content and marketing: → Gemini (the Canvas editing feature alone justifies this)

For translation: → Copilot (and nothing else)

For quick technical debugging: → Copilot or Gemini (fast iteration beats perfect answers)

For production code and documentation: → ChatGPT or Claude (comprehensive, but expect multiple iterations)

For balanced general use when you’re not sure: → ChatGPT (the reliable generalist)

Never: → Meta.AI in its current form (consistently weakest across nearly every category)

This specialization of AI models is not altogether wrong. They are simply optimized for different use cases. The problem is that most users – including me, earlier – don’t realize that and assume they are interchangeable.

The Uncomfortable Truth About AI (lack of) Consistancy

One of the common definitions of madness is to repeatedly perform the same action, expecting different results. Well, this is exactly what happens when you work with AI models 🙂: same prompt, different day, different response. Not significantly, but enough to matter. A business strategy recommendation requested on Tuesday can be slightly different from the one generated on Thursday. Is this a server load affecting performance? Was there a model update between prompts? Hallucinations? Probably all of them – we simply don’t know. The point is that these models are not deterministic. You can’t rely on them the way you rely on a calculator. Even when they give you the right answer once, they might not give it the next time.

This has profound implications for business use. AI-generated output needs human verification. Not because AI is usually wrong – more often than not it’s right – but because you can’t predict when it will be wrong.

The Ultimate Lesson

So, what did I learn from MortAI Kombat after all these months of testing? Not that “Claude is better than ChatGPT, because it scored 3.77 vs 3.70” – we’ve already established that these numbers are already outdated. I also failed to identify “the one model to rule them all”. What I did find instead are several patterns:

Different models excel at different tasks (this pattern persists across updates)

Speed and quality don’t trade off linearly (the curves remain similar)

Visual understanding lags far behind text (improving but still weak)

Consistency is the weakest link (these are not deterministic systems)

Human verification is (still) essential (regardless of models and their “fields of expertise”)

But here’s the catch: if models are modified every few weeks, and each update potentially invalidates previous benchmarks, how do we make informed decisions about AI tools? The traditional software evaluation playbook doesn’t work here. You can’t read reviews from six months ago and expect them to be relevant. You can’t do a thorough evaluation and ride those insights for a year. This is a fundamentally new challenge. We’re evaluating tools that evolve faster than our evaluation processes.

My tentative answer: in a situation where models evolve faster than the processes used to evaluate them, the most sensible approach is to focus not on specific results but on the overall characteristics of each model. Claude will likely remain meticulous, Gemini creative, Copilot the strongest in translation, and ChatGPT the most well‑balanced. The quality of the answers can (and will!) change, but the fundamental characteristics and the “personality” of the models seem more stable.

At least, that’s my hypothesis.

When The Dust Settles

Clients still ask me “Which AI should I use?”. And I still answer “It’s complicated”… but now I follow up with: “Tell me exactly what you need AI for, and I’ll help you choose the right tool”.

PS Guess which AI helped me write this summary analysis. 😊

For detailed results and insights from all 10 MortAI Kombat rounds visit my LinkedIn profile: https://www.linkedin.com/in/rafalbielicki/.