From tenant-aware to job-aware: scaling shared AI clusters with Cisco Nexus One

Discover how Job-ID-based segmentation secures and optimizes backend AI fabrics, ensuring high-performance connectivity for demanding AI and machine learning workloads at scale. Discover how Job-ID-based segmentation secures and optimizes backend AI fabrics, ensuring high-performance connectivity for demanding AI and machine learning workloads at scale. Read More Data Center - Cisco Blogs

AI clusters are becoming a shared infrastructure. Neoclouds, enterprise AI platform teams, financial services organizations, life sciences teams, and research groups need to share GPU capacity. This shared infrastructure can suffer from lower monetization, increased operational complexity, and limited control and visibility across tenants, workloads, hosts, and the network fabric.

EVPN/VXLAN is the practical network foundation. It provides tenant-scoped overlay segmentation using VRFs, VNIs, route distinguishers, and route targets. However, tenant-aware segmentation is not job-aware segmentation. The scheduler understands jobs; the network typically understands routes, interfaces, queues, drops, and flows.

Why AI clusters need multitenancy

Dedicated GPU clusters are simple to isolate, but they are inefficient to operate at scale. As GPU estates grow, organizations want a shared resource pool that can serve multiple teams, customers, and workload classes without forcing every group into its own physical cluster. Otherwise, one group can have stranded GPUs in a dedicated island while another waits in queue.

The requirement appears in several patterns:

A GPU-as-a-Service provider maps each tenant to an external customer with its own address and policy domain (per-customer isolation while keeping the GPU pool shareable).
An enterprise platform team maps tenants to development, testing, production fine-tuning, model evaluation, or regulated analytics (consistent environment boundaries without building separate clusters).
A financial service department separates fraud analytics, risk modeling, and research workloads on one training cluster (stronger control boundaries and auditability without duplicating GPU islands).
A research organization assigns shared GPU capacity to independent research groups (clearer quota, usage, and troubleshooting accountability across competing projects).

This is why multitenancy cannot stop at compute allocation. Distributed training depends on east-west GPU communication, typically over Ethernet fabrics, so the network becomes an integral part of the isolation and performance boundary.

How industry solves it today

Current AI multitenancy is usually implemented across three layers:

Orchestration and scheduler layer. Kubernetes-based platforms, GPU cloud orchestration systems, and Slurm schedulers define the logical ownership model for the cluster. They track tenants or projects, users, queues or namespaces, job requests, node placement, and GPU allocation. For example, Tenant A might submit Job 100 requesting eight GPUs across two servers, while Tenant B submits Job 200 requesting four GPUs on a different set of nodes. For instance, in an orchestration platform like Rafay, the platform can own tenant onboarding and infrastructure intent, while the actual job scheduling may happen in Kubernetes, Slurm, or a tenant-operated scheduler.
Host isolation layer. The host enforces the local device boundary for each workload. If a tenant receives whole servers, isolation is simpler because the server, GPU set, and NIC set can be treated as one tenant-owned unit. If multiple tenants or jobs share GPUs within the same server, the runtime must expose only the assigned GPU devices and bind the workload’s communication libraries, such as NCCL or UCX, to the intended NICs. This host-side mapping matters because a GPU server may have multiple NICs connected to different switches or tenant-facing network segments. Fabric segmentation can isolate traffic once it enters the network, but it cannot correct an incorrect local assignment where the workload is allowed to use the wrong GPU or NIC.
Network segmentation layer. EVPN/VXLAN provides scalable tenant segmentation across the fabric. VXLAN encapsulates tenant traffic and uses VNIs to identify the overlay segment or routing domain. EVPN uses BGP to advertise endpoint and prefix reachability and to control which VTEPs import a tenant’s routes through route targets. In a routed AI fabric, a tenant commonly maps to a VRF and one or more VNIs, with route distinguishers keeping tenant routes unique and route targets controlling import-export policy. This allows overlapping tenant address space and scoped reachability across a shared underlay.

ACLs or security group ACLs can add source and destination policy, especially in brownfield L3 designs or where the fabric cannot yet consume richer workload identity. Their limitation is operational scale: static or manually updated ACL and VRF policies do not naturally follow fast-changing AI job placement.

Together, these layers provide a workable tenant-level model. The remaining gap is job context: the network can usually see tenant context, interfaces, routes, queues, and counters, but not the specific scheduler job running inside a tenant. Tenant segmentation itself does not automatically isolate Job 100 from Job 101 inside the same tenant unless job identity is also carried, derived, or programmed into network policy.

Cisco Nexus One integration with AI infrastructure orchestration platforms

Cisco Nexus One is well positioned as the broader foundation for making tenant-aware AI fabrics more deterministic. In this architecture, Nexus One is the complete fabric automation, integration, and visibility surface for the entire fabric.

Multitenancy in back-end AI network: Nexus One connects Tenant A and B XPU nodes for isolation, automated onboarding, and infrastructure monetization. — Figure 1. Nexus One delivers secure multitenant isolation and automated onboarding for backend AI fabrics, enabling efficient XPU infrastructure monetization.

Nexus One can provide fabric topology context to an AI infrastructure orchestration platform such as Rafay through integration workflows or APIs. That lets teams map tenant VRFs, VLANs, and port assignments directly to a tenant, rather than managing them only as an abstract tenant label.

In addition, Nexus One extends the model beyond provisioning. Tenant-level visibility can show the tenant’s fabric path and relevant health signals such as congestion, drops, and so on. This complements AI job observability: job-aware views can correlate scheduler, topology, optics, GPU telemetry, analytics, and anomalies, while tenant-specific Job-ID enforcement remains a separate future-facing policy capability.

Tenant-aware is not job-aware

Tenant segmentation answers the question, “Which customer or organization owns this traffic?” AI operations often need, “Which training job is creating or experiencing this traffic within a tenant?”

This distinction matters for segmentation as well as during troubleshooting. A scheduler can identify the job, allocated nodes, GPUs, and runtime state. The network can identify interfaces, routes, queues, drops, ECN marks, PFC events, optics health, and paths. Without correlation, operators must manually connect these two views.

The result is a common operational problem: the fabric shows a hot uplink or lossy interface, while the platform team sees a slow training job. The missing link is the workload identity in the network operating model.

Future direction: AI Job-ID-aware segmentation

Job-ID-aware segmentation direction—patent-pending technology from Cisco—is the logical next step. (Note that this describes our architectural direction, not a shipping feature.) The goal is for infrastructure orchestrator (such as Rafay) and scheduler (such as Slurm) intent to carry both tenant identity and job identity into the fabric control and data-plane model.

In that model, the fabric controller translates job intent into policy. The switch data plane carries or derives a job ID, for example through VXLAN GPO bits, and enforces that only endpoints in the same authorized tenant and job can exchange RoCEv2 traffic.

The expected benefits are operationally important:

Simpler operations, because teams can reason in tenants and jobs instead of translating every change into static network objects—crucial for NetOps and fabric operations teams.
Deeper visibility, because drops, congestion, paths, and optics can be correlated to workload context rather than only to interfaces or tenant VRFs—beneficial for platform engineering and SRE teams.
More granular segmentation, because policy can follow the lifecycle of a job rather than stopping at the tenant boundary—important for security, compliance, and tenant governance teams.

This approach is built on open standards—not a proprietary overlay. EVPN/VXLAN is IETF-defined, and the Group Policy Option (GPO) follows the same path: an IETF-defined mechanism that encodes a group/policy identifier in the VXLAN header alongside the VNI, which Cisco NX-OS implements in alignment with the open specification. Tenant scope (VNI) and workload/job scope (GPO) are therefore expressed in constructs a standards-compliant fabric can interpret—letting operators evolve from tenant-aware to job-aware enforcement without a fabric forklift.

Technical example: tenant and job boundaries

Consider a GPU-as-a-Service environment with two customers, Tenant A and Tenant B. Each tenant is mapped to its own VRF/VNI in the EVPN/VXLAN fabric. Tenant-level segmentation prevents Tenant B from reaching Tenant A.

Nexus One job scheduler integration: diagram showing tenant-level to job-level segmentation for improved visibility and troubleshooting. — Figure 2. Nexus One integrates with job schedulers to provide granular, AI job-level segmentation, delivering deeper visibility and faster troubleshooting for AI fabrics.

Now assume Tenant A runs two concurrent training jobs. Job 100 uses GPUs on servers 1 and 2. Job 101 uses different GPUs on the same shared fabric. Tenant-level EVPN/VXLAN still treats both jobs as Tenant A traffic. Job-ID-aware segmentation would add another enforcement dimension: Job 100 endpoints could exchange RoCEv2 traffic with other Job 100 endpoints, but not with Job 101 endpoints, even inside the same tenant.

That is the architectural shift: EVPN/VXLAN remains the tenant foundation, while Job ID becomes the future workload-level policy and observability attribute.

Advancing security from tenant-level to job-level segmentation

AI data center multitenancy starts with EVPN/VXLAN tenant segmentation, but it does not end there. The stronger operating model combines topology-aware provisioning, tenant-level enforcement, and end-to-end visibility today, then evolves toward Job-ID-aware segmentation as scheduler and orchestrator integration matures.

If you are designing a shared AI cluster today, tenant-aware EVPN/VXLAN is the foundation. Job-aware enforcement and observability are the next frontier.

*Special thanks to Ramesh Ponnapalli and his team, whose engineering leadership has been instrumental in bringing this technology to life.

Read more about this innovation

Additional resources: