Microsoft experiment: AI and documents

A Microsoft study published on April 17, 2026 and later updated in an official blog post on May 15 became one of the most revealing tests of the real-world reliability of modern language models in conditions close to autonomous office work. Its main conclusion is cautious but fundamentally strict: current AI systems are not yet ready to fully replace humans in tasks that require long, continuous work with documents without supervision.

The experiment was not based on standard question-answer benchmarks or isolated tasks. Instead, it simulated a real working environment where a document is gradually edited, refined, rewritten, and extended over multiple steps. This is the closest approximation to how AI is currently being used in office workflows, from report writing to legal and analytical documentation.

Researchers tested 19 modern models, including Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4. The tasks covered 52 professional domains: programming, accounting, data analysis, scientific writing, crystallography, musical notation, and other specialized formats where not only text generation matters, but also structural consistency, meaning, and accuracy across long chains of modifications.

The key difference from standard benchmarks was the absence of intermediate human supervision. The model had to independently carry a document through a sequence of edits, simulating a scenario in which a worker fully delegates the task to the system and only reviews the final output.

This is where the main issue emerged.

Even the most advanced models began to gradually distort content. On average, about 25% of information was altered or lost after 20 consecutive editing steps in the best-performing systems. Across the full model set, the average distortion rate reached approximately 50%. This means that every second piece of information in a long workflow could be changed, simplified, or reinterpreted compared to the original.

Importantly, these are not obvious or catastrophic errors. On the contrary, researchers note that changes often appear reasonable at each individual step. The problem arises from accumulation: small deviations early in the process gradually compound into significant semantic drift. The document does not break suddenly but slowly “drifts,” losing accuracy over time.

This effect is especially pronounced in complex tasks with large context sizes. The longer the document and the more sequential operations the model performs, the higher the likelihood that the final output will significantly diverge from the original intent.

At the same time, the study identified a key exception. In Python programming tasks, most models maintained accuracy above 98% even after 20 iterations. The authors attribute this to the formal structure of programming languages: strict syntax rules and unambiguous verification reduce the space for cumulative drift. In simple terms, where rules are precise, AI drifts much less.

The study’s authors — Philippe Laban, Tobias Schnabel, and Jennifer Neville — conclude cautiously but firmly that current language models are not sufficiently reliable for long autonomous execution without human oversight. They note that in real-world applications, some of these issues can be mitigated through intermediate checks, but the absence of such supervision was precisely what the experiment aimed to test.

A key emphasis is placed on the nature of the error itself. This is not random failure, but a systematic accumulation of deviations. This is a crucial insight: models may perform accurately in short tasks but lose stability in long self-sustaining processes.

The methodology has been published openly, allowing independent teams to reproduce results and test the boundaries of the observed effect.

A useful historical analogy helps clarify the scale of the problem. In the 1970s, early industrial robots showed similar behavior: long-running operations without recalibration led to accumulated mechanical deviations that disrupted assembly lines. The solution was not abandoning automation, but introducing regular checkpoints and recalibration systems. In essence, today’s language models face a similar class of limitation, but at the informational rather than mechanical level.

This is why the industry is increasingly moving toward a hybrid model: AI performs the core work, while humans remain points of verification and final control.

Practical consequences are already emerging. There have been documented cases where errors in long automated chains led to real financial losses, including an incident involving a mistaken transfer of about $441,000 due to a single AI agent error. This demonstrates that cumulative error effects are not confined to text — they directly translate into economic outcomes.

In conclusion, the Microsoft study effectively defines the current boundary of AI development: models are already capable of performing complex tasks, but they are not yet capable of reliably maintaining accuracy during long autonomous workflows without external control.

Disclaimer

All content provided on this website (https://wildinwest.com/) -including attachments, links, or referenced materials — is for informative and entertainment purposes only and should not be considered as financial advice. Third-party materials remain the property of their respective owners.

Microsoft experiment: AI and documents

Leave a Reply Cancel reply

Popular Posts

Vladimir Putin on Polymarket

2+2, is it really 5?

Kyiv call center, U.S. citizens, and scammers

Crypto Exchanges in Europe and the New Market Landscape

Don’t miss

Another Sanctions Package Against Russia. What’s Inside?

Goldman Sachs and a $2.36 Billion Crypto Portfolio

When it’s not AI’s growth that scares, but its consequences

🛡️ The 10 Commandments of a Crypto Wallet Owner

Trending

The Most Expensive Thing in the World!

What is Leverage in Cryptocurrency Trading?

🔝 The Most In-Demand AI Solutions for Entrepreneurs

🛡️ The 10 Commandments of a Crypto Wallet Owner

₿ The New “American Dream”

Categories

Other links

Microsoft experiment: AI and documents

Leave a Reply Cancel reply

Subscribe and Follow

Popular Posts

Related posts

Telegram

Subscribe to our Telegram channel

Don’t miss

Trending

Categories

Other links