his Week in Cloud: AKS, Azure Files, AI Debt

It has been a week where Microsoft shipped two quietly significant pieces of infrastructure plumbing while the broader industry got a reminder that moving fast with AI is not the same as moving well. Google I/O brought the usual wave of model announcements, and the GitHub bug bounty story turned out to be more interesting than the headline suggested.

Azure Kubernetes Fleet Manager gets a proper network backbone

Microsoft announced the public preview of cross-cluster networking for Azure Kubernetes Fleet Manager this week, built on a managed Cilium layer delivered through Advanced Container Networking Services. The feature lets AKS clusters registered to a fleet communicate directly across virtual networks, regions, and subscriptions, with global service discovery, identity-based policy enforcement, and transparent encryption handled by the platform rather than by hand-rolled VPNs or custom mesh tooling. A fleet can include up to 255 member clusters, and services can be published globally so any connected cluster consumes them as if they were local.

Why it matters: If you are running workloads across multiple AKS clusters today, you are almost certainly stitching them together with gateways and bespoke glue that nobody is proud of. This preview is an early but serious attempt to make the fleet the unit of design, rather than something you bolt together after the fact. Worth evaluating now, with realistic expectations given it is still preview-grade.

Azure Files drops the domain controller dependency

Microsoft reached general availability this week for Entra-Only identities on Azure Files SMB shares. The change means organisations can grant identity-based access to file shares using cloud-only Microsoft Entra ID accounts, with no Active Directory on-premises, no hybrid sync, no Entra Domain Services, and no managed domain controllers required. Authentication uses Kerberos tickets issued directly by the storage account, and the feature supports FIDO2 keys, Windows Hello for Business, and MFA through the standard Entra authentication stack.

Why it matters: File shares have been one of the stickiest pieces of legacy infrastructure precisely because they forced a domain controller dependency even in otherwise cloud-native environments. For anyone running Azure Virtual Desktop or planning a final push to retire on-premises identity infrastructure, this closes a gap that has been sitting there for years.

Operational debt is already breaking AI strategies

A piece from The New Stack this week, backed by PagerDuty research, put some numbers behind what many architects already suspect. According to the data, 84% of companies have already experienced at least one AI-related outage, and 68% lose more than £240,000 per hour when systems go down. The article identifies three compounding debt types: technical and automation debt from unautomated, non-standardised processes; integration debt from AI tools dropped into siloed environments that cannot correlate signals; and human-AI partnership debt, which is the costly failure to define which decisions belong to machines and which belong to humans.

Why it matters: AI failures do not behave like traditional incidents. Models drift, agents misinterpret context, and root causes are harder to trace. If your incident management process was designed for conventional infrastructure, it is already behind. The article’s point about MCP servers reducing integration debt without months of project work is worth a closer look for teams juggling multiple AI tooling investments.

Google I/O brings Gemini 3.5 and a proper agent control plane

Google’s I/O announcements this week centred on two things relevant to cloud architects: a new generation of models and an expanded Antigravity platform. Gemini 3.5 Flash is positioned as the strongest agentic and coding model Google has shipped, delivering frontier-level intelligence at significantly lower cost than comparable large models. Antigravity 2.0 arrives as a standalone desktop application for building and orchestrating complex agent workflows, accompanied by a CLI, Python SDK, and a dynamic subagent capability that lets a primary agent spawn specialised child agents for parallel tasks. The Managed Agents API, available via both the Gemini API and Google Cloud, handles backend infrastructure so teams can define agent behaviour without managing the runtime.

Why it matters: The Antigravity updates shift the conversation from “which model do we use” to “how do we operate agents consistently at scale.” For platform engineers evaluating where to invest in agentic infrastructure, the managed agent primitives are the most practically interesting announcement here.

GitHub starts paying bug bounty hunters in swag for AI-generated noise

GitHub announced this week that it is tightening standards across its bug bounty programme in response to a surge in AI-assisted submissions that lack proof-of-concept validation or demonstrated impact. Working PoC demonstrations are now required, verbose AI-generated reports will be deprioritised in triage, and lower-severity findings that result in a fix may receive company merchandise rather than a cash payout. The company was clear that AI-assisted research is welcome, but that the researcher remains accountable for validating findings before submission. This follows cURL shutting down its own bug bounty programme earlier this year for the same reason.

Why it matters: This is an early signal of a broader quality problem that will affect any security programme that relies on external researchers. If you run or advise on a vulnerability disclosure programme, the GitHub guidance on shared responsibility boundaries and ineligible vulnerability categories is worth reading directly. The prompt injection and malicious repository rulings in particular clarify where platform owners are drawing the line.

Looking ahead

The Azure networking and identity announcements this week both move in the same direction: reducing the number of legacy dependencies that organisations carry as invisible tax on their cloud estates. The AI operational debt story asks a harder question. Most teams can deploy AI tooling faster than they can build the processes to operate it safely. Which piece of your current AI strategy would survive an honest audit of whether your incident management, integration model, and human oversight boundaries are actually fit for purpose?